This post contains some of my notes mostly on the multi-head attention aspect of the original Transformer (a recurrence-free, attention-based, sequence-to-sequence processing model), as in the paper “Attention Is All You Need” (Vaswani et al., 2017)". It assumes familiarity with the attention mechanism (see my post on that).

Transformer architecture. Image source: "Attention Is All You Need" (Vaswani et al., 2017)

Above is a diagram of the Transformer architecture for reference.

Query, key, value perspective of attention

This paper views the attention function as a mapping of a query and set of key-value pairs to an output. This perspective is similar to that of retrieval from databases, in which a query (ex. SELECT query in SQL) is compared with a set of keys, and values are returned.

In the non-Transformer basic sequence-to-sequence attention model in “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2015), the query would be the decoder hidden state $s_i$, and the keys/values would be the encoder hidden states $h$. Recall that the context vector for decoding step $i$ can be formulated as follows:

$c_i = \mathrm{f}(s_i, h) = \mathrm{softmax}(\mathrm{align}(s_i, h))h$

To compare, in this paper with the query/key/value perspective, this formulation becomes something like this (the queries/keys/values are bundled together into matrices):

${\mathrm{Attention}}(Q, K, V) = {\mathrm{softmax}}({\mathrm{align}}(Q, K))V$

The output is essentially a weighted average of the values.

Types of attention layers in the Transformer

Note that the attention layers belong to stacks of identical blocks (see diagram).

  1. Encoder-decoder attention
    • Queries come from the previous decoder layer
    • Keys/values come from the encoder output
  2. Encoder self-attention
    • Keys, values and queries are all from the same place - the output of the previous encoder layer
  3. Decoder self-attention
    • Keys, values and queries are all from the same place - output of previous decoder layer
    • Each decoder position $i$ can attend to all positions in the decoder $<= i$. This is enforced in the alignment function by masking out its softmax’s input values corresponding to positions that should not be attended to.

Scaled dot-product alignment function

The alignment function for attention used in this paper is the scaled dot-product:

$\mathrm{align}(Q, K) = \frac{QK^T}{\sqrt{d_k}}$

  • $Q$ and $K$: each of these is a matrix that consists of a set of queries or keys (respectively) packed together
  • $d_k$: the dimension of a single key $k$

The paper mentions that additive and dot-product attention perform similarly for small values of $d_k$, but additive outperforms dot-product for larger values. They guess that the dot products become large as $d_k$ becomes large, which push the softmax into regions of very small gradients.

To counter that effect, they decided to scale the dot products by $\frac{1}{\sqrt{d_k}}$. This prevents the variance from increasing as $d_k$ increases; refer to the laws of variances to derive this.

Multi-head attention

What is multi-head attention?

Remember that in “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2015), a single alignment/weight was computed between each decoder time step (query) and encoder time step (key/value). There was only one attention “head”.

In contrast, multiple alignments are computed for each query and key-value pair in the Transformer, so there are multiple attention “heads”. This is called multi-head attention.

Why multiple attention heads?

The reason is probably like why multiple kernels are used in convolutional layers. Different heads can theoretically focus on capturing different dependencies and patterns in the training data. They can capture different rules of the grammar.

What is self-attention?

In certain parts of the encoder and decoder in the Transformer, the alignments for attention are computed between items in the same sequence. The queries, keys and values are the same ($Q=K=V$). This is called self-attention.

Why use self-attention in addition to encoder-decoder attention in the Transformer?

Different, and potentially useful, relations can be modeled from attending to tokens within the same sequence as in self-attention (versus between different sequences as in encoder-decoder attention).

Self-attention helps the model understand which parts of the sequence are relevant to other parts of it, while encoder-decoder attention helps the model understand which parts of the input sequence are relevant to the outputs.

For example, consider the input sentence “I told my sister that our mother has a present for her.” in the context of an English-to-SomeLanguage translation network. Self-attention could capture the relation between the words “sister” and “her”.

Why self-attention versus convolution/recurrence?

You might wonder why the Transformer uses the self-attention mechanism, rather than convolution or recurrence, for mapping one sequence to another.

This has to do with:

  • computational complexity per layer
  • amount of computation that can be parallelized
  • path length between long-range dependencies in the network (affecting how easy it is to learn those long-range dependencies)

This post will not going into the details, but you can refer to the paper “Attention Is All You Need” (Vaswani et al., 2017) for them.

Multi-head attention layer equations

Reminder of basic attention layer equations in RNNs for comparison

Recall the context vector from the pre-Transformer paper “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2015) that provided the decoder with all the information about the input sequence that it needed for a decoding step.

This context vector, or attention output, $c_i$ per decoding step $i$ can be formulated as follows:

$c_i = \mathrm{f}(s_i, h) = \mathrm{softmax}(\mathrm{align}(s_i, h))h$

  • $s_i$: the decoder hidden state at step $i$
  • $h$: the encoder hidden states
  • $\mathrm{align}(s_i, h)$: outputs a vector of alignment scores between $s_i$ and each element of $h$

Transformer multi-head attention layer equations

Similarly, in the Transformer, the attention output can be formulated as:

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\mathrm{align}(Q, K))V$

  • $Q, K, V$
    • Each of these is a matrix consisting of a sequence of queries, keys or values (respectively)
      • I think that maybe it is presented this way (with Q representing a query sequence, rather than a single query as in the equations for attention in RNNs) because there are some areas of attention usage (encoder self-attention) in which the entire query sequence is known at this point, in both training and evaluation, and processing this way is more efficient
    • $Q \in \R^{\mathrm{seqlen_q} \times d_k}$, $K \in \R^{\mathrm{seqlen_k} \times d_k}$, $V \in \R^{\mathrm{seqlen_k} \times d_v}$
  • $\mathrm{align}(Q, K)$
    • This will be the scaled dot-product alignment function as discussed earlier: $\frac{QK^T}{\sqrt{d_k}} \in \R^{\mathrm{seqlen_q} \times \mathrm{seqlen_k}}$, where $d_k$ is the dimensionality of a single input key $k$

But these are not the final formulas.


To account for multiple attention heads, each $Q, K, V$ is actually first linearly projected by a matrix (unique to each of them and to each attention head) onto a new dimension before being fed into the attention function. These transformation matrices represent the different patterns that the different attention heads learn. So each attention head actually outputs:

$\mathrm{head}_i = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)$

$= \mathrm{softmax}(\mathrm{align}(QW^Q_i, KW^K_i))VW^V_i$

  • $i$: the attention head index
  • $Q \in \R^{\mathrm{seqlen_q} \times d_\mathrm{model}}$, $K, V \in \R^{\mathrm{seqlen_k} \times d_\mathrm{model}}$
  • $W^Q_i, W^K_i \in \R^{d_\mathrm{model} \times d_k}$, $W^V_i \in \R^{d_\mathrm{model} \times d_v}$
    • Transformation matrices for the queries, keys and values, respectively, of attention head $i$
  • $\mathrm{align}(QW^Q_i, KW^K_i) = \frac{QW^Q_i(KW^K_i)^T}{\sqrt{d_k}}$, the scaled dot-product alignment function as mentioned earlier
    • $\in \R^{\mathrm{seqlen_q} \times \mathrm{seqlen_k}}$
  • $d_\mathrm{model}$: the dimensionality of a single query/key/value
  • $d_k$, $d_v$: the dimensionality of a single projected query/key or value, respectively
    • In the paper, this is $\frac{d_{\mathrm{model}}}{h}$ where $h$ is the number of attention heads. Projecting the queries/keys/values onto a lower dimension reduces the costs of subsequent computations.
  • $\in \R^{\mathrm{seqlen_q} \times d_v}$

These attention heads are then combined by concatenation, and then transformed again, to produce the multi-head attention layer output:

$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \mathrm{head}_2, … \mathrm{head}_h)W^O$

  • $W^O \in \R^{hd_v \times d_\mathrm{model}}$, where $h = 8$, the number of parallel attention layers/heads used in the paper
  • $\in \R^{ \mathrm{seqlen_q} \times d_{\mathrm{model}} }$

What is the effect of stacking the multi-head attention layers?

Stacking the multi-head attention layers allows the model to identify patterns at different levels of abstraction and learn long-range dependencies, like the stacked convolutional layers in CNNs do.

Acknowledgments