This post contains my notes about the positional encoding in the original Transformer (a recurrence-free, attention-based, sequence-to-sequence processing model), as in the paper “Attention Is All You Need” (Vaswani et al., 2017). I want to give the positional encoding special attention, because I found it to be the most non-trivial part of the Transformer architecture and to be interesting.

Transformer architecture. Image source: "Attention Is All You Need" (Vaswani et al., 2017)

Above is a diagram of the Transformer architecture for reference. The positional encoding layer comes after the input embedding layers in the encoder and decoder. The resulting positional encoding is summed with the input embedding.

Why does the Transformer need the positional encoding layer?

The ordering of the input tokens is important to the meaning of the sequence. For example, the phrase “my sister wants to eat a chicken” has a different meaning if the words are rearranged to be “a chicken wants to eat my sister”. So sequence-processing models should be able to capture positional information of the inputs.

In RNNs, the recurrent structure already intrinsically encodes positional information about the input sequence (but also requires it to be processed sequentially).

But Transformers have no such structure (ex. recurrent, convolutional) that intrinsically encodes positional information about the input sequence. This lack of recurrence makes it much more parallelizable, but also means it needs another way of encoding the input sequence’s positional information. It does that using the positional encoding layer.

What do we want from a positional encoding?

Some criteria for a positional encoding to satisfy:

  1. Unique for each position. Encoding for position $a$ should be different from that of position $b$, where $a \neq b$.
  2. Distance between encodings for positions $i$ and $j$ of any sequence $A$ should be the same as those for the same positions of any sequence $B$ (see “Self-Attention with Relative Position Representations” (Shaw et al., 2018))
  3. Ideally, values should be zero-centered, between -1 and 1 (remember why we use normalization in neural networks). We don’t want huge values, like 1000. We don’t want huge activations and subsequently possibly huge gradients; they can make learning unstable.
  1. Ideally, it should also easily work well for longer sequences.

Some naive positional encoding methods:

  1. One scalar per position, equal to position index, so that a sequence is encoded as [0, 1, 2, … <sequence_length>]
    • This obviously doesn’t satisfy criterion #3
    • This also doesn’t satisfy criterion #4. If the input sequence is of a length that the model wasn’t trained on, the model will not know what it means
  2. Normalized version of naive method #1, so that values are zero-centered and between -1 and 1
    • This doesn’t satisfy criterion #2
  3. One-hot vector per position
    • This doesn’t satisfy criterion #4

The Transformer’s positional encoding method

The authors of the original Transformer paper proposed a very elegant positional encoding method that satisfies all of the aforementioned criteria.

The formula

Let $P_t$ be a $d$-dimensional positional encoding vector for position $t$ of any sequence. Let $P_{t, j}$ be the value at index $j$ of $P_t$. Then:

$P_{t, 2i} = \sin(\frac{t}{10000^{\frac{2i}{d}}})$

$P_{t, 2i+1} = \cos(\frac{t}{10000^{\frac{2i}{d}}})$

where $i$ indexes a sine-cosine pair that has the same period.

Note that in the paper, $d = 512$. (This should be equal to the input embeddings dimension, since we are summing the positional encodings with the input embeddings)


What does $P_t$ look like? Let’s plug in some values for $i$. Assume $d$ is even.

ij$P_{t, j}$Period
00$\sin(t)$$2\pi$
1$\cos(t)$$2\pi$
12$\sin(\frac{t}{10000^{\frac{2}{d}}})$$2\pi \cdot 10000^{\frac{2}{d}}$
3$\cos(\frac{t}{10000^{\frac{2}{d}}})$$2\pi \cdot 10000^{\frac{2}{d}}$
24$\sin(\frac{t}{10000^{\frac{4}{d}}})$$2\pi \cdot 10000^{\frac{4}{d}}$
5$\cos(\frac{t}{10000^{\frac{4}{d}}})$$2\pi \cdot 10000^{\frac{4}{d}}$
$\frac{d-2}{2}$$d-2$$\sin(\frac{t}{10000^{\frac{d-2}{d}}})$$2\pi \cdot 10000^{\frac{d-2}{d}}$
$d-1$$\cos(\frac{t}{10000^{\frac{d-2}{d}}})$$2\pi \cdot 10000^{\frac{d-2}{d}}$

You can see that each dimension $j$ of the positional encoding corresponds to a different sinusoid. The periods are different for each sine-cosine pair indexed here by $i$, and increase from $2\pi$ to almost $20000\pi$. The values are in a nice range for neural networks, $[-1, 1]$.

How does this virtually guarantee uniqueness for each encoding vector $P_t$ in practice?

Analog clock analogy

The Transformer’s positional encoding can be likened to an analog clock’s time encoding, in which the hands maintain a unique configuration as time goes on until the clock’s entire cycle is completed. The distinct periods of the positional encoding are like the distinct periods of a clock’s hands (second, minute, hour, etc.). Encoding a subsequent position is like incrementing a clock’s time by rotating its hands.

What about repeats? Why the $10000$ in the encoding formula?

We can see that due to its composition, $P_t$ can only start to repeat when $t$ is at least the least common multiple of the periods of the sinusoidal functions, which has to be at least the largest period of the sinusoids, $2\pi \cdot 10000^{\frac{d-2}{d}}$.

So if $d=512$ as in the paper, the largest period is $2\pi \cdot 10000^{\frac{510}{512}} \approx 60611.477$. This is already much larger than the sequence lengths used in training the original Transformer and it’s only a very conservative lower bound as the actual least common multiple of the periods is much higher (won’t bother calculating here), so the authors didn’t have to worry at all about $P_t$ getting repeated in their use case. Maybe, this is why they used such a large number of $10000$ in the encoding formula.

Note that since we only use integer values of $t$ to generate the encoding values, and the point $t_{repeat}$ where $P_t$ begins to repeat is probably not approximately an integer, in practice we will probably only encounter an approximate repeat much further past $t_{repeat}$.

Why use pairs of sines and cosines?

The authors of the Transformer paper mentioned that this positional encoding method was chosen because $P_{t + k}$ can be represented as a linear function of $P_{t}$ for any fixed offset $k$, which they hypothesized would allow the model to easily learn to attend by relative positions. This is possible because we use both sines and cosines in the encoding.

Firstly, how does the positional encoding allow $P_{t+k}$ to be represented as a linear function of $P_{t}$?

For $P_{t+k}$ to be represented as a linear function of $P_{t}$, it must be true that $P_{t+k} = R P_t$, where $R$ is some linear transformation. What is this $R$?

Let’s expand $P_{t+k} = R P_t$ so that we can better inspect it. Let the $i^{th}$ distinct frequency of a sinusoidal wave in our positional encoding be $w_i$.

$P_{t+k} = \begin{bmatrix}\sin(w_0 (t + k))\\\ \cos(w_0 (t + k))\\\ \sin(w_1 (t + k))\\\ \cos(w_1 (t + k))\\\ … \\\ \sin(w_{\frac{d-2}{2}} (t + k))\\\ \cos(w_{\frac{d-2}{2}} (t + k))\end{bmatrix}$ $=R P_t = R\begin{bmatrix}\sin(w_0 (t))\\\ \cos(w_0 (t))\\\ \sin(w_1 (t))\\\ \cos(w_1 (t))\\\ … \\\ \sin(w_{\frac{d-2}{2}} (t))\\\ \cos(w_{\frac{d-2}{2}} (t))\end{bmatrix}$

You can see from the above equation that each same-period sine-cosine pair in $P_{t+k}$ just results from a rotation around a unit circle of its corresponding pair in $P_{t}$. Using the clock analogy from earlier, $R$ is just like separately rotating each hand of the clock.

The formulas for rotating $\sin(t)$ and $\cos(t)$ by angle $\theta$ are as follows:

$\sin(t + \theta) = \sin(t)\cos(\theta) + \cos(t)\sin(\theta)$

$\cos(t + \theta) = \cos(t)\cos(\theta) - \sin(t)\sin(\theta)$

From this, we know what the rotation matrix should be for a single same-period sine-cosine pair of the positional encoding. Here is that rotation matrix in action:

$\begin{bmatrix}\sin(w_i (t + k))\\\ \cos(w_i (t + k))\end{bmatrix} = \begin{bmatrix}\cos(w_ik) & \sin(w_ik) \\\ -\sin(w_ik) & \cos(w_i k) \end{bmatrix} \begin{bmatrix}\sin(w_i t)\\\ \cos(w_i t)\end{bmatrix}$

We can extend this to account for all the sine-cosine pairs of $P_t$. The rotation matrix $\underset{(d \times d)} R$ would look something like this:

$\underset{(d \times d)} R = \begin{bmatrix}\cos(w_0 k) & \sin(w_0 k) & 0 & 0 & 0 & … \\\ -\sin(w_0 k) & \cos(w_0 k) & 0 & 0 & 0 & … \\\ 0 & 0 & \cos(w_1 k) & \sin(w_1 k) & 0 & … \\\ 0 & 0 & -\sin(w_1 k) & \cos(w_1 k) & 0 & … \\\ 0 & 0 & 0 & 0 & … & … \\\ … & … & … & … & … & … \end{bmatrix}$

You can see that $R$ is a linear transformation that satisfies $P_{t+k} = RP_t$ and can easily be the learned weight matrix of a feed-forward layer, since it does not depend on the variable $t$.


Now, imagine if $P_t$ only contained sines, or only cosines. Can we find a linear transformation $R$ that satisfies $P_{t+k} = R P_t$ in that case?

As a reminder, the formulas for rotating $\sin(t)$ and $\cos(t)$ by angle $\theta$ are as follows:

$\sin(t + \theta) = \sin(t)\cos(\theta) + \cos(t)\sin(\theta)$

$\cos(t + \theta) = \cos(t)\cos(\theta) - \sin(t)\sin(\theta)$

Notice that $\sin(t + \theta)$ and $\cos(t + \theta)$ are linear combinations of both $\sin(t)$ and $\cos(t)$. The formulas can’t be simplified so that they are just linear combinations of just one of those, since the relationship between sine and cosine is $\sin^2(\theta) + \cos^2(\theta) = 1$. So it seems necessary to have both sines and cosines in the encoding, for a valid $R$.

What about using a learned positional encoding?

The authors of the paper mentioned that they experimented with learnable positional embeddings (we could just use a typical embedding layer for that), but their method involving sinusoids yielded similar results, and could additionally “allow the model to extrapolate to sequence lengths longer than the ones encountered during training”.

Acknowledgments