This post contains my notes about the improved relative positional encoding method from the paper “Music Transformer” (Huang et al., 2018). References to “the paper” will refer to the aforementioned one, unless otherwise stated. This assumes knowledge of the original Transformer and relative positional encoding.

The improved relative positional encoding method proposed by the paper is a significant contribution, as it reduces the intermediate memory requirement of relative positional encodings in self-attention (as in “Self-Attention with Relative Position Representations” (Shaw et al., 2018)) from $O(L^2D)$ to $O(LD)$. This allows for training on longer sequences.

Review of Shaw et al. (2018) method for relative positional encoding

To prepare for understanding the aforementioned changes proposed by the paper, recall the equation for self-attention alignment with relative positional encoding, from “Self-Attention with Relative Position Representations” (Shaw et al., 2018), which adds relative position information to the key space:

$align(Q, K) = \frac{QK^T + S^{rel}}{\sqrt{D}}$

  • $n$: sequence length
  • $D$: dimensionality of the representation of a single sequence token
  • $Q, K \in \R^{L \times D}$
    • These represent the sequence, transformed into the query and key spaces, respectively
  • $S^{rel} = Q(R^K)^T$
    • $R^K \in \R^{L \times L \times D}$
      • This contains the relative positional encoding for all position pairs for the key space
    • Note that this can be reshaped so that there are $n$ parallel computations of $bh \times D$ and $D \times L$ tensors, and reshaped again for adding with the first term

$R^K$ looks like this:

$\underset{(L \times L \times D)} {R^K} = \begin{bmatrix}w_0 & w_1 & … & w_{L-1} \\\ w_{-1} & w_0 & … & w_{L-2} \\\ … & … & … & … \\\ w_{-L+1} & w_{-L+2} & … & w_0 \end{bmatrix}$

  • Where $w_x \in \R^{D}$ represents the relative position $x$

Then:

$S^{rel} = \underset{(L \times L)} {Q{R^K}^T} = \begin{bmatrix}q_0w_0 & q_0w_1 & … & q_0w_{L-1} \\\ q_1w_{-1} & q_1w_0 & … & q_1w_{L-2} \\\ … & … & … & … \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}$

  • Where $q_x \in \R^{D}$ represents the $x^{th}$ sequence token in the query space
  • Note that in decoder self-attention, entries above the diagonal will be made irrelevant by masking

Positional information is also added to the value space, but I won’t write out the equation, because as you will see in the next section, it is not needed for understanding the changes introduced by the paper.

Improved relative positional encoding that reduces memory requirements

What are the changes to relative positional encoding in Transformer self-attention that the paper proposes?

Don’t add relative positional encoding to value space

The paper proposes to only add relative positional information to the key space, and no longer to the value space. This is very reasonable, because also adding the positional information to the value space was experimentally implied to not be beneficial for language translation (“Self-Attention with Relative Position Representations” (Shaw et al., 2018)). So we no longer need both $R^K$ and $R^V$; we only need $R^K$, and from now on, we will refer to $R^K$ as $R$.

Calculate $Q{E^r}^T$ instead of $QR^T$ to save memory

The paper’s authors observed that they only need a subset of the terms from $R \in \R^{L \times L \times D}$ - namely, the embeddings for relative positions in the range $[-L+1, 0]$.

  • I think that this must mean it uses masked self-attention (only allowing attention to previous tokens of the sequence), which implies this change is for the Transformer decoder. There is implication that the Music Transformer may be a decoder-only Transformer (at least, not using the encoder described in the original Transformer paper), although it seems like this wasn’t very blatantly stated in the paper.

The required terms are all present in the first column of $R$. We will call the matrix containing all of these terms $E^r \in \R^{L \times D}$, which looks like this:

$\underset{(L \times D)} {E^r} = \begin{bmatrix}w_{-L+1} \\\ w_{-L+2} \\\ … \\\ w_{0} \end{bmatrix}$

So the authors of the paper propose multiplying $Q$ by ${E^r}^T$ instead of by $R$, and then performing some operations on the result to make the contents match $QR^T$. Importantly, this will reduce the intermediate space requirement of relative positional encoding from $O(L^2D)$ to $O(LD)$.

${Q{E^r}^T}$ will look like this:

$\underset{(L \times L)} {Q{E^r}^T} = \begin{bmatrix}q_0w_{-L+1} & q_0w_{-L+2} & … & q_0w_0 \\\ q_1w_{-L+1} & q_1w_{-L+2} & … & q_1w_{0} \\\ … & … & … & … \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}$

This does not look like $QR^T$, but we want to make it look like $QR^T$ (diagonal and below). How to do that?

The terms we need in ${Q{E^r}^T}$ are on its diagonal from the bottom left to top right (diagonal with the “$/$” orientation) and below, so to transform this into $QR^T$, we effectively need to shift the terms of each row of index $i_q$ by $(L-1) - i_q$ to the left, which means that the target new position of entry at position $(i_q, r)$ should be $(i_q, r - [(L-1) - i_q]) = (i_q, r - (L-1) + i_q)$.

The operations we will have to perform to cause this transformation will be detailed in the next section.

Transform contents of $Q{E^r}^T$ to be like those of $QR^T$

${Q{E^r}^T}$ will then be rearranged so that the contents look just like those of $QR^T$ to produce $S^{rel}$. How is this rearrangement done?

1. Pad a dummy column of length $L$, before the leftmost column

After this operation, ${Q{E^r}^T}$ will have been transformed into something like this:

$\underset{(L \times (L+1))} {S^{rel'}} = \begin{bmatrix}(dum) & q_0w_{-L+1} & q_0w_{-L+2} & … & q_0w_0 \\\ (dum) & q_1w_{-L+1} & q_1w_{-L+2} & … & q_1w_{0} \\\ (dum) & … & … & … & … \\\ (dum) & q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}$

2. Reshape the matrix to be $(L+1, L)$ (rows, columns)

This is assuming row-major ordering. After this operation, $S^{rel'}$ (from step 1) will look something like this:

$\underset{((L+1) \times L)} {S^{rel'}} = \begin{bmatrix}(dum) & q_0w_{-L+1} & q_0w_{-L+2} & … \\\ q_0w_0 & (dum) & q_1w_{-L+1} & q_1w_{-L+2} \\\ … & q_1w_{0} & (dum) & … \\\ … & … & … & (dum) \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}$

3. Slice the matrix to retain only the last $L$ rows (and all the columns)

After this operation, $S^{rel'}$ (from step 2) will look something like this:

$\underset{(L \times L)} {S^{rel}} = \begin{bmatrix}q_0w_0 & (dum) & q_1w_{-L+1} & q_1w_{-L+2} \\\ … & q_1w_{0} & (dum) & … \\\ … & … & … & (dum) \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}$

The effect after all three steps, is that each entry at position $(i_q, r)$ gets moved to $(i_q, r - (L-1) + i_q)$.

Now we have an $L \times L$ matrix that is equivalent to $QR^T$ from the diagonal and below. Note that the entries above the diagonal don’t matter because this is supposed to be masked self-attention (as in the decoder of the original Transformer).

Impacts of these changes in practice

As previously stated, these changes significantly reduce the intermediate memory requirement of relative attention ($O(L^2D)$ to $O(LD)$) and allows the model to more effectively deal with much longer sequences.

To emphasize that significance, you can also see the differences in length limits and memory usage in practice in the following table from the paper:

Image source: "Music Transformer" (Huang et al., 2018)

Acknowledgments

  • “Music Transformer” (Huang et al., 2018)
  • “Self-Attention with Relative Position Representations” (Shaw et al., 2018)