Attention Mechanisms in Sequence-to-Sequence Models

Overview

The main objective of this project is to understand scaled dot-product attention and to implement a simple attention mechanism in a sequence-to-sequence model. The task involves translating sentences from German to English using the Multi30k dataset, which contains over 31,000 bitext sentences describing common visual scenes in both languages.

1. Scaled Dot-Product Attention

Starts from the definition of a single-query scaled dot-product attention mechanism. Given a query $\mathbf{q} \in \mathbb{R}^{1\times d}$, a set of candidates represented by keys $\mathbf{k}_1, ... , \mathbf{k}_m \in \mathbb{R}^{1\times d}$ and values $\mathbf{v}_1, ... , \mathbf{v}_m \in \mathbb{R}^{1\times d_v}$, we compute the scaled dot-product attention as:

$$ \alpha_i = \frac{\mbox{exp}\left(~\mathbf{q}\mathbf{k}_ i^T / \sqrt d\right)}{\sum_{j=1}^m \mbox{exp}\left(\mathbf{q}\mathbf{k}_ j^T / \sqrt d\right)} $$

$$ \textbf{a} = \sum_{j=1}^m \alpha_j \mathbf{v}_j $$

where the $\alpha_i$ are referred to as attention values (or collectively as an attention distribution).

1.1. Copying

Q. Describe what properties of the keys and queries would result in the output $\textbf{a}$ being equal to one of the input values $\mathbf{v}_j$. Specifically, what must be true about the query $\mathbf{q}$ and the keys $\mathbf{k}_1, ..., \mathbf{k}_m$ such that $\textbf{a} \approx \mathbf{v}_j$? (We assume all values are unique -- $\mathbf{v}_i \neq \mathbf{v}_j,~\forall ij$.)

In the case where $\textbf{a} \approx \mathbf{v}_j$, the similarity score for $\mathbf{q}\mathbf{k}_j^T$ is significantly higher than all others due to the softmax function producing outputs to have probability distribution. Therefore, given a query $\mathbf{q}$, $\mathbf{k}_j$ must be significantly higher than others to determine the similarity score.

1.2. Average of Two

Q. Consider a set of key vectors $\mathbf{k}_1, ... , \mathbf{k}_m$ where all keys are orthogonal unit vectors -- that is to say $\mathbf{k}_i \mathbf{k}_j^T = 0, \forall ij$ and $\Vert\mathbf{k}_i\Vert=1,\forall i$. Let $\mathbf{v}_a, \mathbf{v}_b \in {\mathbf{v}_1, ..., \mathbf{v}_m}$ be two value vectors. Give an expression for a query vector $\mathbf{q}$ such that the output $\textbf{a}$ is approximately equal to the average of $\mathbf{v}_a$ and $\mathbf{v}_b$, that is to say $\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$. You can reference the key vectors corresponding to $\mathbf{v}_a$ and $\mathbf{v}_b$ as $\mathbf{k}_a$ and $\mathbf{k_b}$ respectively.

From $\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$, we can consider the term $\frac{1}{2}$ is from $\alpha_i$. Meaning that $\alpha_a = \alpha_b$ and $\alpha_i = 0$ should be satisfied to meet the condition. Since $\alpha_i = \mbox{softmax}(\mathbf{q}\mathbf{k}_i^T)$, we only want to keep $\mathbf{k}_a$ and $\mathbf{k}_b$; otherwise, $\mathbf{k}_i=0$.
Considering $c$ as a large constant, the below expression for $\mathbf{q}$ can satisfy $\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$.

$$ \mathbf{q}=c(\mathbf{k}_a + \mathbf{k}_b) $$

By constructing the original linear equation of $\mathbf{q}$ and $\mathbf{k}$, we can ensure if this expression satisfy the condition.

$$ \mathbf{q}\mathbf{k}_a^T=c(\mathbf{k}_a\mathbf{k}_a^T + \mathbf{k}_b\mathbf{k}_a^T)=c(1+0)=c $$

$$ \mathbf{q}\mathbf{k}_b^T=c(\mathbf{k}_a\mathbf{k}_b^T + \mathbf{k}_b\mathbf{k}_b^T)=c(0+1)=c $$

$$ \mathbf{q}\mathbf{k}_i^T=c(\mathbf{k}_a\mathbf{k}_i^T + \mathbf{k}_b\mathbf{k}_i^T)=c(0+0)=0 $$

Therefore, $\alpha_a = \alpha_b = c$ and $\alpha_i = 0$. $\textbf{a}$ can be written,

$$ \textbf{a} = \alpha_a \mathbf{v}_ a + \alpha_b \mathbf{v}_ b + \sum_{j=1, j\neq a, b}^m \alpha_j \mathbf{v}_j $$

$$ \textbf{a} \approx \frac{1}{2}\mathbf{v}_a + \frac{1}{2}\mathbf{v}_b + 0 $$

$$ \textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b) $$

1.3. Noisy Average

Q. Now consider a set of key vectors ${\mathbf{k}_1, ... , \mathbf{k}_m}$ where keys are randomly scaled such that $\mathbf{k}_i = \mathbf{\mu}_i*\lambda_i$ where $\lambda_i \sim \mathcal{N}(1, \beta)$ is a randomly sampled scalar multiplier. Assume the unscaled vectors $\mu_1, ..., \mu_m$ are orthogonal unit vectors. If you use the same strategy to construct the query $q$ as you did in Task 1.2, what would be the outcome here? Specifically, derive $\mathbf{q}\mathbf{k}_a^T$ and $\mathbf{q}\mathbf{k}_b^T$ in terms of $\mu$'s and $\lambda$'s. Qualitatively describe what how the output $a$ would vary over multiple resamplings of $\lambda_1, ..., \lambda_m$.

From the expression for $\mathbf{q}$ in Task 1.2,

$$ \mathbf{q}=c(\mathbf{k}_a + \mathbf{k}_b) $$

By substituting $\mathbf{k}_i = \mathbf{\mu}_i*\lambda_i$,

$$ \mathbf{q}=c(\mathbf{\mu}_a*\lambda_a + \mathbf{\mu}_b*\lambda_b) $$

The expression for $\mathbf{q}\mathbf{k}_a^T$ and $\mathbf{q}\mathbf{k}_b^T$,

$$ \mathbf{q}\mathbf{k}_a^T=c(\lambda_a^2*\mathbf{\mu}_a\mathbf{\mu}_a^T + \lambda_a\lambda_b*\mathbf{\mu}_b\mathbf{\mu}_a^T)=c\lambda_a^2 $$

$$ \mathbf{q}\mathbf{k}_b^T=c(\lambda_a\lambda_b*\mathbf{\mu}_a\mathbf{\mu}_b^T + \lambda_b^2*\mathbf{\mu}_b\mathbf{\mu}_b^T)=c\lambda_b^2 $$

When $\lambda_a \approx \lambda_b$,

$$ \textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b) $$

When $\lambda_a \gg \lambda_b$ or $\lambda_a \ll \lambda_b$,

$$ \textbf{a} \approx \mathbf{v}_a \text{ or } \textbf{a} \approx \mathbf{v}_b $$

Since randomly sampled following $\lambda_i \sim \mathcal{N}(1, \beta)$,

$$ \mathbb{E}[\mathbf{q}\mathbf{k}_a^T]=\mathbb{E}[c\lambda_a^2]=c $$

$$ \mathbb{E}[\mathbf{q}\mathbf{k}_b^T]=\mathbb{E}[c\lambda_b^2]=c $$

Over multiple resamplings of $\lambda_1, ..., \lambda_m$,

$$ \textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b) $$

1.4. Noisy Average with Multi-head Attention

Q. Let's now consider a simple version of multi-head attention that averages the attended features resulting from two different queries. Here, two queries are defined ($\mathbf{q}_1$ and $\mathbf{q}_2$) leading to two different attended features ($\textbf{a}_1$ and $\textbf{a}_2$). The output of this computation will be $\textbf{a} = \frac{1}{2}(\textbf{a}_1 + \textbf{a}_2)$. Assume we have keys like those in Task 1.3, design queries $\mathbf{q}_1$ and $\mathbf{q}_2$ such that $\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$.

From the task 1.3, the expression $\mathbf{q}$ below,

$$ \mathbf{q}=c(\mathbf{\mu}_a*\lambda_a + \mathbf{\mu}_b*\lambda_b) $$

This expression for $\mathbf{q}$ yields $\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$. Utilizing the notion that each $\mathbf{\mu}_i$ can maintain its term, the following expressions for $\mathbf{q}_1$ and $\mathbf{q}_2$ can ensure $\textbf{a} = \frac{1}{2}(\textbf{a}_1 + \textbf{a}_2)$.

$$ \mathbf{q}_1=c_1\lambda_a*\mathbf{\mu}_a $$

$$ \mathbf{q}_2=c_2\lambda_b*\mathbf{\mu}_b $$

By constructing the linear equation of $\mathbf{q}$ and $\mathbf{k}$,

$$ \mathbf{q}_1\mathbf{k}_a^T=c_1\lambda_a*\mathbf{\mu}_a\mathbf{k}_a^T =c_1\lambda_a^2*(\mathbf{\mu}_a\mathbf{\mu}_a^T)=c_1\lambda_a^2 $$

$$ \mathbf{q}_2\mathbf{k}_b^T=c_2\lambda_b*\mathbf{\mu}_b\mathbf{k}_b^T =c_2\lambda_b^2*(\mathbf{\mu}_b\mathbf{\mu}_b^T)=c_2\lambda_b^2 $$

From here, $\textbf{a}_1$ and $\textbf{a}_2$ can be expressed,

$$ \alpha_i=\mbox{softmax}(\mathbf{q}_1\mathbf{k}_i^T) \hspace{20pt} \alpha_j=\mbox{softmax}(\mathbf{q}_2\mathbf{k}_j^T) $$

$$ \alpha_i = \begin{cases} 1 & \text{if } i = a \\ 0 & \text{if } i \neq a \end{cases} \hspace{20pt} \alpha_j = \begin{cases} 1 & \text{if } j = b \\ 0 & \text{if } j \neq b \end{cases} $$

$$ \textbf{a}_1 = \alpha_a \mathbf{v}_a = \mathbf{v}_a \hspace{40pt} \textbf{a}_2 = \alpha_b \mathbf{v}_b = \mathbf{v}_b $$

The final output is the average of $\textbf{a}_1$ and $\textbf{a}_2$.

$$ \textbf{a} = \frac{1}{2}(\textbf{a}_1 + \textbf{a}_2) $$

2. German-to-English Machine Translation

Machine translation involves automatically converting text from one language to another using computational models. In the context of translating German to English, Scaled-Dot Product Attention is a core mechanism used in modern neural networks, particularly in models like the Transformer. This mechanism allows the model to focus on relevant parts of the input sentence when generating the output translation, enabling it to capture the nuances and structure of German and translate them effectively into English.

By using Scaled-Dot Product Attention, the model effectively learns these common patterns and differences between German and English

2.1. German-to-English Attention Pattern

Example 1 $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ Example 2

Example 3 $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ $\hspace{20pt}$ Example 4

Figure 1: Attention diagram examples

SVO vs SOV

English typically follows an SVO (Subject-Verb-Object) structure, while German often employs an SOV (Subject-Object-Verb) structure in subordinate clauses. This difference leads to deviations from the diagonal in the verb area on attention maps, as illustrated in Examples 1 and 2. For instance, deviations can be seen in the phrases "are climbing" in Example 1 and "are anticipating" in Example 2.

Preposition and Conjunction

Prepositions and their objects usually align closely, but their usage can differ in certain cases. For example, in Example 3, the English preposition "to" corresponds to the German preposition "zu," and the English conjunction "while" corresponds to the German conjunction "während." Both of these cases show deviations rather than diagonal alignment. In Example 4, the English preposition "in front of" corresponds to the German preposition "vor," which also shows significant deviation.

2.2. Scaled-Dot Product Attention Code

class SingleQueryScaledDotProductAttention(nn.Module):    
    # kq_dim is the dimension of keys and queries. Linear layers should be used to project inputs to these dimensions.
    def __init__(self, enc_hid_dim, dec_hid_dim, kq_dim=512):
        super().__init__()
        self.W_k = nn.Linear(enc_hid_dim * 2, kq_dim)
        self.W_q = nn.Linear(dec_hid_dim, kq_dim)
        self.kq_dim = kq_dim

        
    #hidden is h_t^{d} from Eq. (11) and has dim => [batch_size, dec_hid_dim]
    #encoder_outputs is the word representations from Eq. (6) 
    # and has dim => [src_len, batch_size, enc_hid_dim * 2]
    def forward(self, hidden, encoder_outputs):
        # Compute for q = W_q h_j^{e}
        query = self.W_q(hidden).unsqueeze(1)
        # Compute for k_t = W_k h_t^{d}
        key = self.W_k(encoder_outputs)
        # Compute for score = qk^T / sqrt(d)
        score = torch.bmm(query, key.T(1, 2)).squeeze(1) / np.sqrt(self.kq_dim)
        # Compute alpha = softmax(score)
        alpha = F.softmax(score, dim=1)
        # Compute a = sum(alpha_j v_j)
        attended_val = torch.sum(encoder_outputs * alpha.unsqueeze(2), dim=1)
        
        assert attended_val.shape == (hidden.shape[0], encoder_outputs.shape[2])
        assert alpha.shape == (hidden.shape[0], encoder_outputs.shape[0])
        return attended_val, alpha

2.3. BLEU Score Comparision

Models were trained and evaluated using the Dummy baseline, MeanPool, and attention mechanisms. For each method, three independent runs were conducted, and the mean and variance of the results were calculated to compare performance and consistency.

Attn.	PPL Mean	PPL Var	BLEU Mean	BLEU Var
None	18.107	0.057	16.3	0.118
Mean	15.809	0.028	18.599	0.277
SDP	10.447	0.049	34.64	0.793

The trends indicate that the scaled dot-product attention mechanism (SDP), significantly improve the model's performance, as lower perplexity and higher BLEU scores. However, the increased variance in BLEU scores with SDP suggests that there might be some variability caused by random initialization.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
examples		examples
README.md		README.md
attention_mechanism.ipynb		attention_mechanism.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention Mechanisms in Sequence-to-Sequence Models

Overview

1. Scaled Dot-Product Attention

1.1. Copying

1.2. Average of Two

1.3. Noisy Average

1.4. Noisy Average with Multi-head Attention

2. German-to-English Machine Translation

2.1. German-to-English Attention Pattern

2.2. Scaled-Dot Product Attention Code

2.3. BLEU Score Comparision

About

Releases

Packages

Languages

kapshaul/NLP-attention.mechanism

Folders and files

Latest commit

History

Repository files navigation

Attention Mechanisms in Sequence-to-Sequence Models

Overview

1. Scaled Dot-Product Attention

1.1. Copying

1.2. Average of Two

1.3. Noisy Average

1.4. Noisy Average with Multi-head Attention

2. German-to-English Machine Translation

2.1. German-to-English Attention Pattern

2.2. Scaled-Dot Product Attention Code

2.3. BLEU Score Comparision

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages