The main objective of this project is to understand scaled dot-product attention and to implement a simple attention mechanism in a sequence-to-sequence model. The task involves translating sentences from German to English using the Multi30k dataset, which contains over 31,000 bitext sentences describing common visual scenes in both languages.
Starts from the definition of a single-query scaled dot-product attention mechanism. Given a query
where the
Q. Describe what properties of the keys and queries would result in the output
In the case where
$\textbf{a} \approx \mathbf{v}_j$ , the similarity score for$\mathbf{q}\mathbf{k}_j^T$ is significantly higher than all others due to the softmax function producing outputs to have probability distribution. Therefore, given a query$\mathbf{q}$ ,$\mathbf{k}_j$ must be significantly higher than others to determine the similarity score.
Q. Consider a set of key vectors
From
$\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$ , we can consider the term$\frac{1}{2}$ is from$\alpha_i$ . Meaning that$\alpha_a = \alpha_b$ and$\alpha_i = 0$ should be satisfied to meet the condition. Since$\alpha_i = \mbox{softmax}(\mathbf{q}\mathbf{k}_i^T)$ , we only want to keep$\mathbf{k}_a$ and$\mathbf{k}_b$ ; otherwise,$\mathbf{k}_i=0$ .
Considering$c$ as a large constant, the below expression for$\mathbf{q}$ can satisfy$\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$ .
$$ \mathbf{q}=c(\mathbf{k}_a + \mathbf{k}_b) $$ By constructing the original linear equation of
$\mathbf{q}$ and$\mathbf{k}$ , we can ensure if this expression satisfy the condition.
$$ \mathbf{q}\mathbf{k}_a^T=c(\mathbf{k}_a\mathbf{k}_a^T + \mathbf{k}_b\mathbf{k}_a^T)=c(1+0)=c $$
$$ \mathbf{q}\mathbf{k}_b^T=c(\mathbf{k}_a\mathbf{k}_b^T + \mathbf{k}_b\mathbf{k}_b^T)=c(0+1)=c $$
$$ \mathbf{q}\mathbf{k}_i^T=c(\mathbf{k}_a\mathbf{k}_i^T + \mathbf{k}_b\mathbf{k}_i^T)=c(0+0)=0 $$ Therefore,
$\alpha_a = \alpha_b = c$ and$\alpha_i = 0$ .$\textbf{a}$ can be written,
$$ \textbf{a} = \alpha_a \mathbf{v}_ a + \alpha_b \mathbf{v}_ b + \sum_{j=1, j\neq a, b}^m \alpha_j \mathbf{v}_j $$
$$ \textbf{a} \approx \frac{1}{2}\mathbf{v}_a + \frac{1}{2}\mathbf{v}_b + 0 $$
$$ \textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b) $$
Q. Now consider a set of key vectors
From the expression for
$\mathbf{q}$ in Task 1.2,
$$ \mathbf{q}=c(\mathbf{k}_a + \mathbf{k}_b) $$ By substituting
$\mathbf{k}_i = \mathbf{\mu}_i*\lambda_i$ ,
$$ \mathbf{q}=c(\mathbf{\mu}_a*\lambda_a + \mathbf{\mu}_b*\lambda_b) $$ The expression for
$\mathbf{q}\mathbf{k}_a^T$ and$\mathbf{q}\mathbf{k}_b^T$ ,
$$ \mathbf{q}\mathbf{k}_a^T=c(\lambda_a^2*\mathbf{\mu}_a\mathbf{\mu}_a^T + \lambda_a\lambda_b*\mathbf{\mu}_b\mathbf{\mu}_a^T)=c\lambda_a^2 $$
$$ \mathbf{q}\mathbf{k}_b^T=c(\lambda_a\lambda_b*\mathbf{\mu}_a\mathbf{\mu}_b^T + \lambda_b^2*\mathbf{\mu}_b\mathbf{\mu}_b^T)=c\lambda_b^2 $$ When
$\lambda_a \approx \lambda_b$ ,
$$ \textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b) $$ When
$\lambda_a \gg \lambda_b$ or$\lambda_a \ll \lambda_b$ ,
$$ \textbf{a} \approx \mathbf{v}_a \text{ or } \textbf{a} \approx \mathbf{v}_b $$ Since randomly sampled following
$\lambda_i \sim \mathcal{N}(1, \beta)$ ,
$$ \mathbb{E}[\mathbf{q}\mathbf{k}_a^T]=\mathbb{E}[c\lambda_a^2]=c $$
$$ \mathbb{E}[\mathbf{q}\mathbf{k}_b^T]=\mathbb{E}[c\lambda_b^2]=c $$ Over multiple resamplings of
$\lambda_1, ..., \lambda_m$ ,
$$ \textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b) $$
Q. Let's now consider a simple version of multi-head attention that averages the attended features resulting from two different queries. Here, two queries are defined (
From the task 1.3, the expression
$\mathbf{q}$ below,
$$ \mathbf{q}=c(\mathbf{\mu}_a*\lambda_a + \mathbf{\mu}_b*\lambda_b) $$ This expression for
$\mathbf{q}$ yields$\textbf{a} \approx \frac{1}{2}(\mathbf{v}_a + \mathbf{v}_b)$ . Utilizing the notion that each$\mathbf{\mu}_i$ can maintain its term, the following expressions for$\mathbf{q}_1$ and$\mathbf{q}_2$ can ensure$\textbf{a} = \frac{1}{2}(\textbf{a}_1 + \textbf{a}_2)$ .
$$ \mathbf{q}_1=c_1\lambda_a*\mathbf{\mu}_a $$
$$ \mathbf{q}_2=c_2\lambda_b*\mathbf{\mu}_b $$ By constructing the linear equation of
$\mathbf{q}$ and$\mathbf{k}$ ,
$$ \mathbf{q}_1\mathbf{k}_a^T=c_1\lambda_a*\mathbf{\mu}_a\mathbf{k}_a^T =c_1\lambda_a^2*(\mathbf{\mu}_a\mathbf{\mu}_a^T)=c_1\lambda_a^2 $$
$$ \mathbf{q}_2\mathbf{k}_b^T=c_2\lambda_b*\mathbf{\mu}_b\mathbf{k}_b^T =c_2\lambda_b^2*(\mathbf{\mu}_b\mathbf{\mu}_b^T)=c_2\lambda_b^2 $$ From here,
$\textbf{a}_1$ and$\textbf{a}_2$ can be expressed,
$$ \alpha_i=\mbox{softmax}(\mathbf{q}_1\mathbf{k}_i^T) \hspace{20pt} \alpha_j=\mbox{softmax}(\mathbf{q}_2\mathbf{k}_j^T) $$
$$ \alpha_i = \begin{cases} 1 & \text{if } i = a \\ 0 & \text{if } i \neq a \end{cases} \hspace{20pt} \alpha_j = \begin{cases} 1 & \text{if } j = b \\ 0 & \text{if } j \neq b \end{cases} $$
$$ \textbf{a}_1 = \alpha_a \mathbf{v}_a = \mathbf{v}_a \hspace{40pt} \textbf{a}_2 = \alpha_b \mathbf{v}_b = \mathbf{v}_b $$ The final output is the average of
$\textbf{a}_1$ and$\textbf{a}_2$ .
$$ \textbf{a} = \frac{1}{2}(\textbf{a}_1 + \textbf{a}_2) $$
Machine translation involves automatically converting text from one language to another using computational models. In the context of translating German to English, Scaled-Dot Product Attention is a core mechanism used in modern neural networks, particularly in models like the Transformer. This mechanism allows the model to focus on relevant parts of the input sentence when generating the output translation, enabling it to capture the nuances and structure of German and translate them effectively into English.
By using Scaled-Dot Product Attention, the model effectively learns these common patterns and differences between German and English
Example 1
Example 3
Figure 1: Attention diagram examples
- SVO vs SOV
English typically follows an SVO (Subject-Verb-Object) structure, while German often employs an SOV (Subject-Object-Verb) structure in subordinate clauses. This difference leads to deviations from the diagonal in the verb area on attention maps, as illustrated in Examples 1 and 2. For instance, deviations can be seen in the phrases "are climbing" in Example 1 and "are anticipating" in Example 2.
- Preposition and Conjunction
Prepositions and their objects usually align closely, but their usage can differ in certain cases. For example, in Example 3, the English preposition "to" corresponds to the German preposition "zu," and the English conjunction "while" corresponds to the German conjunction "während." Both of these cases show deviations rather than diagonal alignment. In Example 4, the English preposition "in front of" corresponds to the German preposition "vor," which also shows significant deviation.
class SingleQueryScaledDotProductAttention(nn.Module):
# kq_dim is the dimension of keys and queries. Linear layers should be used to project inputs to these dimensions.
def __init__(self, enc_hid_dim, dec_hid_dim, kq_dim=512):
super().__init__()
self.W_k = nn.Linear(enc_hid_dim * 2, kq_dim)
self.W_q = nn.Linear(dec_hid_dim, kq_dim)
self.kq_dim = kq_dim
#hidden is h_t^{d} from Eq. (11) and has dim => [batch_size, dec_hid_dim]
#encoder_outputs is the word representations from Eq. (6)
# and has dim => [src_len, batch_size, enc_hid_dim * 2]
def forward(self, hidden, encoder_outputs):
# Compute for q = W_q h_j^{e}
query = self.W_q(hidden).unsqueeze(1)
# Compute for k_t = W_k h_t^{d}
key = self.W_k(encoder_outputs)
# Compute for score = qk^T / sqrt(d)
score = torch.bmm(query, key.T(1, 2)).squeeze(1) / np.sqrt(self.kq_dim)
# Compute alpha = softmax(score)
alpha = F.softmax(score, dim=1)
# Compute a = sum(alpha_j v_j)
attended_val = torch.sum(encoder_outputs * alpha.unsqueeze(2), dim=1)
assert attended_val.shape == (hidden.shape[0], encoder_outputs.shape[2])
assert alpha.shape == (hidden.shape[0], encoder_outputs.shape[0])
return attended_val, alpha
Models were trained and evaluated using the Dummy baseline, MeanPool, and attention mechanisms. For each method, three independent runs were conducted, and the mean and variance of the results were calculated to compare performance and consistency.
Attn. | PPL Mean | PPL Var | BLEU Mean | BLEU Var |
---|---|---|---|---|
None | 18.107 | 0.057 | 16.3 | 0.118 |
Mean | 15.809 | 0.028 | 18.599 | 0.277 |
SDP | 10.447 | 0.049 | 34.64 | 0.793 |
The trends indicate that the scaled dot-product attention mechanism (SDP), significantly improve the model's performance, as lower perplexity and higher BLEU scores. However, the increased variance in BLEU scores with SDP suggests that there might be some variability caused by random initialization.