This post contains my notes about the improved relative positional encoding method from the paper “Music Transformer” (Huang et al., 2018). References to “the paper” will refer to the aforementioned one, unless otherwise stated. This assumes knowledge of the original Transformer and relative positional encoding.

The improved relative positional encoding method proposed by the paper is a significant contribution, as it reduces the intermediate memory requirement of relative positional encodings in self-attention (as in “Self-Attention with Relative Position Representations” (Shaw et al., 2018)) from O(L2D)O(L^2D) to O(LD)O(LD). This allows for training on longer sequences.

Review of Shaw et al. (2018) method for relative positional encoding

To prepare for understanding the aforementioned changes proposed by the paper, recall the equation for self-attention alignment with relative positional encoding, from “Self-Attention with Relative Position Representations” (Shaw et al., 2018), which adds relative position information to the key space:

align(Q,K)=QKT+SrelDalign(Q, K) = \frac{QK^T + S^{rel}}{\sqrt{D}}

  • nn: sequence length
  • DD: dimensionality of the representation of a single sequence token
  • Q,KRL×DQ, K \in \R^{L \times D}
    • These represent the sequence, transformed into the query and key spaces, respectively
  • Srel=Q(RK)TS^{rel} = Q(R^K)^T
    • RKRL×L×DR^K \in \R^{L \times L \times D}
      • This contains the relative positional encoding for all position pairs for the key space
    • Note that this can be reshaped so that there are nn parallel computations of bh×Dbh \times D and D×LD \times L tensors, and reshaped again for adding with the first term

RKR^K looks like this:

RK(L×L×D)=[w0w1wL1 w1w0wL2  wL+1wL+2w0]\underset{(L \times L \times D)} {R^K} = \begin{bmatrix}w_0 & w_1 & … & w_{L-1} \\\ w_{-1} & w_0 & … & w_{L-2} \\\ … & … & … & … \\\ w_{-L+1} & w_{-L+2} & … & w_0 \end{bmatrix}

  • Where wxRDw_x \in \R^{D} represents the relative position xx

Then:

Srel=QRKT(L×L)=[q0w0q0w1q0wL1 q1w1q1w0q1wL2  qL1wL+1qL1wL+2qL1w0]S^{rel} = \underset{(L \times L)} {Q{R^K}^T} = \begin{bmatrix}q_0w_0 & q_0w_1 & … & q_0w_{L-1} \\\ q_1w_{-1} & q_1w_0 & … & q_1w_{L-2} \\\ … & … & … & … \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}

  • Where qxRDq_x \in \R^{D} represents the xthx^{th} sequence token in the query space
  • Note that in decoder self-attention, entries above the diagonal will be made irrelevant by masking

Positional information is also added to the value space, but I won’t write out the equation, because as you will see in the next section, it is not needed for understanding the changes introduced by the paper.

Improved relative positional encoding that reduces memory requirements

What are the changes to relative positional encoding in Transformer self-attention that the paper proposes?

Don’t add relative positional encoding to value space

The paper proposes to only add relative positional information to the key space, and no longer to the value space. This is very reasonable, because also adding the positional information to the value space was experimentally implied to not be beneficial for language translation (“Self-Attention with Relative Position Representations” (Shaw et al., 2018)). So we no longer need both RKR^K and RVR^V; we only need RKR^K, and from now on, we will refer to RKR^K as RR.

Calculate QErTQ{E^r}^T instead of QRTQR^T to save memory

The paper’s authors observed that they only need a subset of the terms from RRL×L×DR \in \R^{L \times L \times D} - namely, the embeddings for relative positions in the range [L+1,0][-L+1, 0].

  • I think that this must mean it uses masked self-attention (only allowing attention to previous tokens of the sequence), which implies this change is for the Transformer decoder. There is implication that the Music Transformer may be a decoder-only Transformer (at least, not using the encoder described in the original Transformer paper), although it seems like this wasn’t very blatantly stated in the paper.

The required terms are all present in the first column of RR. We will call the matrix containing all of these terms ErRL×DE^r \in \R^{L \times D}, which looks like this:

Er(L×D)=[wL+1 wL+2  w0]\underset{(L \times D)} {E^r} = \begin{bmatrix}w_{-L+1} \\\ w_{-L+2} \\\ … \\\ w_{0} \end{bmatrix}

So the authors of the paper propose multiplying QQ by ErT{E^r}^T instead of by RR, and then performing some operations on the result to make the contents match QRTQR^T. Importantly, this will reduce the intermediate space requirement of relative positional encoding from O(L2D)O(L^2D) to O(LD)O(LD).

QErT{Q{E^r}^T} will look like this:

QErT(L×L)=[q0wL+1q0wL+2q0w0 q1wL+1q1wL+2q1w0  qL1wL+1qL1wL+2qL1w0]\underset{(L \times L)} {Q{E^r}^T} = \begin{bmatrix}q_0w_{-L+1} & q_0w_{-L+2} & … & q_0w_0 \\\ q_1w_{-L+1} & q_1w_{-L+2} & … & q_1w_{0} \\\ … & … & … & … \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}

This does not look like QRTQR^T, but we want to make it look like QRTQR^T (diagonal and below). How to do that?

The terms we need in QErT{Q{E^r}^T} are on its diagonal from the bottom left to top right (diagonal with the “//” orientation) and below, so to transform this into QRTQR^T, we effectively need to shift the terms of each row of index iqi_q by (L1)iq(L-1) - i_q to the left, which means that the target new position of entry at position (iq,r)(i_q, r) should be (iq,r[(L1)iq])=(iq,r(L1)+iq)(i_q, r - [(L-1) - i_q]) = (i_q, r - (L-1) + i_q).

The operations we will have to perform to cause this transformation will be detailed in the next section.

Transform contents of QErTQ{E^r}^T to be like those of QRTQR^T

QErT{Q{E^r}^T} will then be rearranged so that the contents look just like those of QRTQR^T to produce SrelS^{rel}. How is this rearrangement done?

1. Pad a dummy column of length LL, before the leftmost column

After this operation, QErT{Q{E^r}^T} will have been transformed into something like this:

Srel(L×(L+1))=[(dum)q0wL+1q0wL+2q0w0 (dum)q1wL+1q1wL+2q1w0 (dum) (dum)qL1wL+1qL1wL+2qL1w0]\underset{(L \times (L+1))} {S^{rel'}} = \begin{bmatrix}(dum) & q_0w_{-L+1} & q_0w_{-L+2} & … & q_0w_0 \\\ (dum) & q_1w_{-L+1} & q_1w_{-L+2} & … & q_1w_{0} \\\ (dum) & … & … & … & … \\\ (dum) & q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}

2. Reshape the matrix to be (L+1,L)(L+1, L) (rows, columns)

This is assuming row-major ordering. After this operation, SrelS^{rel'} (from step 1) will look something like this:

Srel((L+1)×L)=[(dum)q0wL+1q0wL+2 q0w0(dum)q1wL+1q1wL+2 q1w0(dum) (dum) qL1wL+1qL1wL+2qL1w0]\underset{((L+1) \times L)} {S^{rel'}} = \begin{bmatrix}(dum) & q_0w_{-L+1} & q_0w_{-L+2} & … \\\ q_0w_0 & (dum) & q_1w_{-L+1} & q_1w_{-L+2} \\\ … & q_1w_{0} & (dum) & … \\\ … & … & … & (dum) \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}

3. Slice the matrix to retain only the last LL rows (and all the columns)

After this operation, SrelS^{rel'} (from step 2) will look something like this:

Srel(L×L)=[q0w0(dum)q1wL+1q1wL+2 q1w0(dum) (dum) qL1wL+1qL1wL+2qL1w0]\underset{(L \times L)} {S^{rel}} = \begin{bmatrix}q_0w_0 & (dum) & q_1w_{-L+1} & q_1w_{-L+2} \\\ … & q_1w_{0} & (dum) & … \\\ … & … & … & (dum) \\\ q_{L-1}w_{-L+1} & q_{L-1}w_{-L+2} & … & q_{L-1}w_0 \end{bmatrix}

The effect after all three steps, is that each entry at position (iq,r)(i_q, r) gets moved to (iq,r(L1)+iq)(i_q, r - (L-1) + i_q).

Now we have an L×LL \times L matrix that is equivalent to QRTQR^T from the diagonal and below. Note that the entries above the diagonal don’t matter because this is supposed to be masked self-attention (as in the decoder of the original Transformer).

Impacts of these changes in practice

As previously stated, these changes significantly reduce the intermediate memory requirement of relative attention (O(L2D)O(L^2D) to O(LD)O(LD)) and allows the model to more effectively deal with much longer sequences.

To emphasize that significance, you can also see the differences in length limits and memory usage in practice in the following table from the paper:

Image source: "Music Transformer" (Huang et al., 2018)

Acknowledgments

  • “Music Transformer” (Huang et al., 2018)
  • “Self-Attention with Relative Position Representations” (Shaw et al., 2018)