Self-Attention with Relative Positional Encoding

This post contains my notes about the relative positional encoding method proposed as an alternative to sinusoidal positional encoding in Transformers, as in the paper “Self-Attention with Relative Position Representations” (Shaw et al., 2018). References to “the paper” will refer to the aforementioned one, unless otherwise stated. This post assumes knowledge of self-attention in the original Transformer.

Absolute versus relative positional encoding

Before discussing absolute and relative positional encodings, here is what I mean by them in this context.

Absolute positional encoding

Encodes the position of a token without the reference of another token
The encoding function is in the form $p(t)$, where $t$ is some token position
Sinusoidal positional encoding as proposed in the original Transformer paper is an example of this

Relative positional encoding

Encodes the position of a token relative to another token. In other words, it encodes the distance between two tokens
The encoding function is in the form $p(i, j)$, where $i$ and $j$ are token positions

The motivation for relative positional encoding

The paper proposes a relative positional encoding method as an alternative to the sinusoidal positional encoding method in the original Transformer. Why might we want relative positional encoding over absolute positional encoding?

Generalizes better to sequence lengths not seen during training

Absolute positional encoding might not allow the model to generalize well to sequence lengths not seen during training (or in some types of absolute positional encoding, it might not even be able to create a positional encoding for a position greater than a fixed maximum and so cannot process sequences past a certain length). The upstream “neurons” might not interpret those encodings effectively, because they haven’t been trained on them before (or enough).

In contrast, the authors of the relative positional encoding paper “hypothesized that precise relative position information is not useful beyond a certain distance”. That implies that we might only need to encode a range of relative positions that is relatively small and can be significantly more so than max sequence length, which should be much easier to train the model on. This means that relative positional encoding should be better at generalizing to sequence lengths not seen during training.

More intuitive

Personally, when I read and write, I don’t think about the positions of the words in absolute terms, but rather relative to each other (ex. X positions before or after a certain word). It seems very difficult and not scalable for me to always think about the absolute positions.

This suggests that relative positional encoding may have some efficiency benefits over absolute positional encoding, and may be worth trying.

Self-attention with relative positional encoding

This section details the changes that the paper’s authors propose making to the original Transformer, for replacing the sinusoidal positional encoding with relative positional encoding.

Proposed changes to allow for relation-aware self-attention

What are the changes that the paper proposes?

Remove the sinusoidal positional encoding layer from the original Transformer
- From the paper: “In our experiments we did not observe any benefit from including sinusoidal position encodings in addition to relative position representations”
Add two new learned vectors $\textcolor{limegreen}{a_{ij}^V}$ and $\textcolor{limegreen}{a_{ij}^K}$, representing the relative positions of tokens at positions $i$ and $j$, to the value and key spaces of self-attention, respectively
- These relative positional representations can be shared across all attention heads
- The reason there is separate information added to each space is that the paper’s authors didn’t want to have a separate transformation to project the positional information into those spaces
- Note that integrating $\textcolor{limegreen}{a_{ij}^V}$ may not be necessary for machine translation:
  - The authors of the paper tested their model with and without this, and found that for machine translation, it doesn’t seem to offer benefit. They mentioned that “further work is needed to determine whether this is true for other tasks”.

Details of changes to self-attention equations

The following are the changes to the self-attention equations as described in the paper. (See the paper, my posts on introduction to attention mechanism, and/or my post on multi-head attention in the Transformer, for more context for these equations)

Add $\textcolor{limegreen}{a_{ij}^V}$ to value space
- $z_i = \sum_{j=1}^{\mathrm{seqlen_x}}\alpha_{ij}(x_j W^V + \textcolor{limegreen}{a_{ij}^V}$ $)$
  - This is also known as the context vector for position $i$ of the sequence. If $\mathrm{head_h}$ $\in \R^{\mathrm{seqlen_x} \times d_z}$ is the attention head indexed by $h$, then $\mathrm{head_h}[i] = z_i$.
  - $\textcolor{limegreen}{a_{ij}^V} \in \R^{d_z}$
  - $x_j \in \R^{d_{x}}$
  - $W^V \in \R^{d_{x} \times d_z}$
  - $\in \R^{d_z}$
Add $\textcolor{limegreen}{a_{ij}^K}$ to key space
- $e_{ij} = \frac{x_i W^Q(x_j W^K + \textcolor{limegreen}{a_{ij}^K} )^T}{\sqrt{d_z}}$
  - This calculates the scalar alignment between the tokens at positions $i$ and $j$, and later gets softmaxed using the other $e_{ij}$ to produce $\alpha_{ij}$.
  - $\textcolor{limegreen}{a_{ij}^K} \in \R^{d_z}$
  - $x_i, x_j \in \R^{d_{x}}$
  - $W^Q, W^K \in \R^{d_{x} \times d_z}$

Details of the relative position representations

The maximum relative position is clipped to a maximum absolute value of $k$, so the range of relative positions represented is $[-k, k]$
- As mentioned earlier, the authors “hypothesized that precise relative position information is not useful beyond a certain distance”
$a_{ij}^K = w^K_{clip(j-i, k)}, \in \R^{d_z}$
$a_{ij}^V = w^V_{clip(j-i, k)}, \in \R^{d_z}$
$clip(x, k) = max(-k, min(k, x))$
We learn the relative positions $w^K = (w^K_{-k} … w^K_{k})$ and $w^V = (w^V_{-k} … w^V_{k})$
- $w^K, w^V \in \R^{(2k+1) \times d_z}$
- These can be shared across all attention heads

More efficient implementation of changes for relative positional encoding

The original Transformer parallelizes the self-attention alignment computations (involving multiplication of matrices of shapes $n \times d_z$ and $d_z \times n$ for sequence length $n$) across number of batches $b$ and number of attention heads $h$.

So in practice, there are $bh$ parallel computations of the following in the original Transformer to get the alignments for a single sequence:

$align(Q, K) = \frac{QK^T}{\sqrt{d_z}}$

$Q, K \in \R^{\mathrm{n} \times d_z}$
- These represent a sequence transformed into the query and key spaces, respectively
$\in \R^{n \times n}$
This computes the scalar alignment $e_{ij}$ for all positions $i$ and $j$ of a sequence, in a single matrix multiplication

What should that computation be like with relative positional encoding?

Recall the equation for calculating alignment with relative positional encoding given by the paper:

$e_{ij} = \frac{x_i W^Q(x_j W^K + \textcolor{limegreen}{{a_{ij}}^K} )^T}{\sqrt{d_z}}$

We would like to package these such that they represent the entire sequence, for more efficient computing, as in the original Transformer. So a naive form could be something like this:

$align(Q, K) = \frac{Q(K + \textcolor{limegreen}{A^K})^T}{\sqrt{d_z}}$

This is just the same as in the original Transformer, except with the addition of the relative positional information $A^K$ to the key space
$Q, K \in \R^{n \times d_z}$
There would be $bh$ parallel computations of this

$A^K$ would have to contain the relative positional information for each of the $n \times n$ position pairs $(i, j)$ in the sequence, which can be different from each other, so $A^K \in \R^{n \times n \times d_z}$. $K + A^K$ means that there would have to be a new intermediate tensor of the same dimensionality as $A^K$ for each of the $bh$ parallel computations, in addition to the final result tensor. Also, the non-conforming dimensionality means that this computation will involve broadcasting.

We can avoid broadcasting and make this more efficient by splitting the above equation into two tensor multiplications:

$align(Q, K) = \frac{QK^T + Q(A^K)^T}{\sqrt{d_z}}$

The first tensor product can be computed in $bh$ parallel computations of $n \times d_z$ and $d_z \times n$ tensors as in the original Transformer
The second tensor product can be reshaped so that there are $n$ parallel computations of $bh \times d_z$ and $d_z \times n$ tensors, and reshaped again for adding with the first term

This same technique can also be applied to more efficiently compute the part of the self-attention equation where positional information is added to the value space.

Acknowledgments

“Self-Attention with Relative Position Representations” (Shaw et al., 2018)

Absolute versus relative positional encoding#

Absolute positional encoding#

Relative positional encoding#

The motivation for relative positional encoding#

Generalizes better to sequence lengths not seen during training#

More intuitive#

Self-attention with relative positional encoding#

Proposed changes to allow for relation-aware self-attention#

What are the changes that the paper proposes?#

Details of changes to self-attention equations#

Details of the relative position representations#

More efficient implementation of changes for relative positional encoding#

Acknowledgments#