Positional Encoding & Variants
Self-attention computes the same function regardless of token order — it is permutation-invariant by design. Shuffle "The cat sat on the mat" into "mat the cat sat on The" and the attention scores between any pair of tokens remain identical if no positional information is provided. This is fundamentally wrong for language. Position must be injected explicitly, and the choice of how to encode position has far-reaching consequences for extrapolation beyond training length, KV cache efficiency, and context window extension.
The Permutation-Invariance Problem
In a standard feedforward network, position is implicit in the input layout — input neuron 1 always receives the first feature. In a transformer, all positions are treated symmetrically: there is no preferred position, and no mechanism for the attention computation to distinguish "token at position 3" from "token at position 7" without an explicit signal.
The solution is to add a positional signal to the token embeddings (or to the query and key vectors directly, in the case of RoPE) so that attention scores between two tokens depend not only on their semantic content but also on their relative or absolute positions.
Sinusoidal Positional Encoding
The original "Attention Is All You Need" paper used a fixed, non-learned positional encoding based on sinusoids of different frequencies. Each position p gets a unique dmodel-dimensional vector where each dimension uses a sinusoid with a different period.
The multi-frequency design means that nearby positions have similar high-frequency components (the fast-varying dimensions nearly match for close positions) while the slow-varying dimensions provide a coarse global position signal. The authors chose sinusoids because a linear transformation of PE(pos) can produce PE(pos+k) for any offset k — making relative position arithmetic expressible in the attention layer through the learned WQ and WK matrices.
Sinusoidal encoding was found to work well in practice but was later surpassed by learned embeddings for fixed-length tasks and by relative positional methods for variable-length or long-context applications.
Learned Absolute Positional Embeddings
The simplest alternative to fixed sinusoids: a trainable embedding lookup table E of shape (max_length × dmodel). The embedding for position p is E[p], learned end-to-end from the training data. The position embedding is added to the token embedding before the first transformer block.
- Simple to implement — just an nn.Embedding
- The model learns what positional signal is actually useful for the task
- Empirically matches or slightly outperforms sinusoidal on fixed-length benchmarks
- Cannot extrapolate beyond max_length — positions beyond the table have no embedding
- Performance degrades sharply at inference for sequences longer than training length
- The table size adds parameters proportional to max_length
GPT-2 and GPT-3 use learned absolute positional embeddings with a fixed maximum context length (1024 and 2048 tokens respectively). The inability to extrapolate is why extending these models to longer contexts requires fine-tuning with longer sequences — the embeddings for new positions do not exist.
Relative Positional Encodings
Rather than encoding the absolute position of each token, relative positional encodings encode the distance between token pairs. The intuition: what matters for most linguistic relationships is not "token 37 and token 41" but "tokens that are 4 positions apart." T5 introduced learnable scalar biases added to attention logits, one per relative distance:
Because the biases depend only on relative distance and not absolute position, the model generalises better to sequences of lengths not seen during training. T5 was trained with maximum sequence lengths of 512 but can be applied to longer sequences at inference with modest quality degradation.
RoPE: Rotary Position Embeddings
RoPE (Su et al., 2023, widely adopted 2023–2024) is the dominant positional encoding for modern open-weight LLMs. It encodes position by rotating the query and key vectors by angles proportional to position before computing the attention dot product. The key mathematical property: the dot product of two rotated vectors depends only on the relative angle between them — which is proportional to the relative position.
- No added parameters — positional signal is injected at attention time, not at input
- Naturally encodes relative position via the rotation angle
- Compatible with KV caching — position is encoded in Q and K, not in stored V
- Extendable to longer contexts via frequency scaling (YaRN, LongRoPE)
- Llama 2, Llama 3 (base=500000 for Llama 3)
- Mistral, Mixtral
- Gemma, Gemma 2
- PaLM 2, Qwen 2, DeepSeek V3
- Yi, Falcon 2
ALiBi: Attention with Linear Biases
ALiBi (Press et al., 2021) takes a different approach: rather than rotating query/key vectors, it subtracts a linear penalty proportional to the distance between tokens from the attention logit. Tokens further apart receive more negative attention scores, biasing the model toward local attention.
ALiBi's key advantage is zero-shot context length generalisation: a model trained on 2048 tokens can be applied to 4096-token sequences without any retraining, with modest quality degradation rather than catastrophic failure. The linear penalty simply extrapolates — distant tokens get penalised more, but the mechanism does not break. MPT (MosaicML), BLOOM, and Baichuan use ALiBi.
Context Length Extension
A major engineering challenge in 2023–2024 was extending models pre-trained with short contexts (4K–8K tokens) to handle 128K–1M+ token windows. Most successful approaches operate on the RoPE frequency schedule:
Scale down all position indices so the maximum position in the long context maps to the maximum position seen during training. Simple and effective with fine-tuning, but compresses position resolution — nearby tokens become harder to distinguish. Used to extend Llama from 4K to 32K.
Yet Another RoPE extensioN (Peng et al., 2023). Applies different scaling to different frequency bands of RoPE — high-frequency dimensions (local context) are not interpolated, low-frequency dimensions (global context) are. Outperforms uniform interpolation and requires only ~0.1% of original training compute for fine-tuning.
Meta extended Llama 3 to 128K context using a combination of increased RoPE base frequency (500000 vs 10000), long-context continued pre-training on data with long documents, and YaRN-style frequency scaling. The result is a natively long-context model without architectural changes.
Context length extension is now a routine post-training step for production LLMs. The key insight across all methods: the RoPE rotation angles are a continuous function of position and base frequency, so changing the base frequency or scaling the position indices shifts the model's effective context without requiring architectural surgery.
Checklist: Do You Understand This?
- Can you explain why self-attention is permutation-invariant without positional encoding, and give a concrete example of two sentences that would be indistinguishable without it?
- Can you describe the sinusoidal encoding formula and explain what the frequency range (1/2 to 1/10000 cycles per token) is designed to capture?
- Can you explain why learned absolute positional embeddings cannot extrapolate beyond training length, while sinusoidal encoding can?
- Can you describe what RoPE does to query and key vectors and explain the mathematical property that makes the dot product depend only on relative position?
- Can you describe what ALiBi does to attention logits and explain why it enables zero-shot context length generalisation?
- Can you describe the position interpolation approach for context length extension and explain what quality tradeoff it makes?
- Can you name the positional encoding used by Llama 3, Mistral, and Gemma, and explain one practical advantage over learned absolute embeddings?