Advanced

Positional Encoding & Variants

Self-attention computes the same function regardless of token order — it is permutation-invariant by design. Shuffle "The cat sat on the mat" into "mat the cat sat on The" and the attention scores between any pair of tokens remain identical if no positional information is provided. This is fundamentally wrong for language. Position must be injected explicitly, and the choice of how to encode position has far-reaching consequences for extrapolation beyond training length, KV cache efficiency, and context window extension.

The Permutation-Invariance Problem

In a standard feedforward network, position is implicit in the input layout — input neuron 1 always receives the first feature. In a transformer, all positions are treated symmetrically: there is no preferred position, and no mechanism for the attention computation to distinguish "token at position 3" from "token at position 7" without an explicit signal.

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V Without position: Q = X·Wq, K = X·Wk, V = X·Wv Permuting X permutes rows of Q, K, V in the same way. The attention weight matrix changes but the output set is the same — just in a different row order. The model cannot distinguish "The cat" from "cat The" without positional signal.

The solution is to add a positional signal to the token embeddings (or to the query and key vectors directly, in the case of RoPE) so that attention scores between two tokens depend not only on their semantic content but also on their relative or absolute positions.

Sinusoidal Positional Encoding

The original "Attention Is All You Need" paper used a fixed, non-learned positional encoding based on sinusoids of different frequencies. Each position p gets a unique d_model-dimensional vector where each dimension uses a sinusoid with a different period.

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) pos — position index (0, 1, 2, ...) i — dimension index (0 to d_model/2 − 1) Properties: - Each position gets a unique vector - PE(pos+k) can be expressed as a linear function of PE(pos) — relative offsets are learnable - Frequencies range from 1 cycle per 2 tokens (high freq) to 1 cycle per 10000 tokens (low freq) - No learned parameters — works at inference for positions never seen in training

The multi-frequency design means that nearby positions have similar high-frequency components (the fast-varying dimensions nearly match for close positions) while the slow-varying dimensions provide a coarse global position signal. The authors chose sinusoids because a linear transformation of PE(pos) can produce PE(pos+k) for any offset k — making relative position arithmetic expressible in the attention layer through the learned W^Q and W^K matrices.

Sinusoidal encoding was found to work well in practice but was later surpassed by learned embeddings for fixed-length tasks and by relative positional methods for variable-length or long-context applications.

Learned Absolute Positional Embeddings

The simplest alternative to fixed sinusoids: a trainable embedding lookup table E of shape (max_length × d_model). The embedding for position p is E[p], learned end-to-end from the training data. The position embedding is added to the token embedding before the first transformer block.

Advantages

Simple to implement — just an nn.Embedding
The model learns what positional signal is actually useful for the task
Empirically matches or slightly outperforms sinusoidal on fixed-length benchmarks

Limitations

Cannot extrapolate beyond max_length — positions beyond the table have no embedding
Performance degrades sharply at inference for sequences longer than training length
The table size adds parameters proportional to max_length

GPT-2 and GPT-3 use learned absolute positional embeddings with a fixed maximum context length (1024 and 2048 tokens respectively). The inability to extrapolate is why extending these models to longer contexts requires fine-tuning with longer sequences — the embeddings for new positions do not exist.

Relative Positional Encodings

Rather than encoding the absolute position of each token, relative positional encodings encode the distance between token pairs. The intuition: what matters for most linguistic relationships is not "token 37 and token 41" but "tokens that are 4 positions apart." T5 introduced learnable scalar biases added to attention logits, one per relative distance:

Attention score(i, j) = qᵢᵀkⱼ / √dₖ + b(i−j) b(·) — learned scalar bias per relative offset (clipped at max distance) T5 uses 32 buckets for relative distances (log-scale bucketing at long range) The same bias parameters are shared across all layers and heads

Because the biases depend only on relative distance and not absolute position, the model generalises better to sequences of lengths not seen during training. T5 was trained with maximum sequence lengths of 512 but can be applied to longer sequences at inference with modest quality degradation.

RoPE: Rotary Position Embeddings

RoPE (Su et al., 2023, widely adopted 2023–2024) is the dominant positional encoding for modern open-weight LLMs. It encodes position by rotating the query and key vectors by angles proportional to position before computing the attention dot product. The key mathematical property: the dot product of two rotated vectors depends only on the relative angle between them — which is proportional to the relative position.

For position m, dimension pairs (2i, 2i+1): RoPE(x, m)₂ᵢ = x₂ᵢ cos(mθᵢ) − x₂ᵢ₊₁ sin(mθᵢ) RoPE(x, m)₂ᵢ₊₁ = x₂ᵢ sin(mθᵢ) + x₂ᵢ₊₁ cos(mθᵢ) θᵢ = base^(−2i/d), base = 10000 (original), 500000 (Llama 3) Dot product: ⟨RoPE(q,m), RoPE(k,n)⟩ depends only on (m−n) and q, k → naturally encodes relative position in the attention score

Why RoPE Dominates

No added parameters — positional signal is injected at attention time, not at input
Naturally encodes relative position via the rotation angle
Compatible with KV caching — position is encoded in Q and K, not in stored V
Extendable to longer contexts via frequency scaling (YaRN, LongRoPE)

Models Using RoPE

Llama 2, Llama 3 (base=500000 for Llama 3)
Mistral, Mixtral
Gemma, Gemma 2
PaLM 2, Qwen 2, DeepSeek V3
Yi, Falcon 2

ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2021) takes a different approach: rather than rotating query/key vectors, it subtracts a linear penalty proportional to the distance between tokens from the attention logit. Tokens further apart receive more negative attention scores, biasing the model toward local attention.

Attention score(i, j) = qᵢᵀkⱼ / √dₖ − m · |i − j| m — head-specific slope (different per head, fixed, not learned) m₁ = 1/2, m₂ = 1/4, m₃ = 1/8, ... (geometric sequence) No learned positional parameters — zero additional training cost Different slopes across heads encode different range preferences

ALiBi's key advantage is zero-shot context length generalisation: a model trained on 2048 tokens can be applied to 4096-token sequences without any retraining, with modest quality degradation rather than catastrophic failure. The linear penalty simply extrapolates — distant tokens get penalised more, but the mechanism does not break. MPT (MosaicML), BLOOM, and Baichuan use ALiBi.

Absolute (Learned)

Best quality at training length; hard cutoff at max_length

Relative (ALiBi)

Zero-shot extrapolation; slight quality cost within training range

GPT-2/3

Sinusoidal

T5 biases

RoPE

ALiBi

Context Length Extension

A major engineering challenge in 2023–2024 was extending models pre-trained with short contexts (4K–8K tokens) to handle 128K–1M+ token windows. Most successful approaches operate on the RoPE frequency schedule:

Position Interpolation

Scale down all position indices so the maximum position in the long context maps to the maximum position seen during training. Simple and effective with fine-tuning, but compresses position resolution — nearby tokens become harder to distinguish. Used to extend Llama from 4K to 32K.

YaRN

Yet Another RoPE extensioN (Peng et al., 2023). Applies different scaling to different frequency bands of RoPE — high-frequency dimensions (local context) are not interpolated, low-frequency dimensions (global context) are. Outperforms uniform interpolation and requires only ~0.1% of original training compute for fine-tuning.

LongRoPE / Llama 3.1

Meta extended Llama 3 to 128K context using a combination of increased RoPE base frequency (500000 vs 10000), long-context continued pre-training on data with long documents, and YaRN-style frequency scaling. The result is a natively long-context model without architectural changes.

Context length extension is now a routine post-training step for production LLMs. The key insight across all methods: the RoPE rotation angles are a continuous function of position and base frequency, so changing the base frequency or scaling the position indices shifts the model's effective context without requiring architectural surgery.

Checklist: Do You Understand This?

Can you explain why self-attention is permutation-invariant without positional encoding, and give a concrete example of two sentences that would be indistinguishable without it?
Can you describe the sinusoidal encoding formula and explain what the frequency range (1/2 to 1/10000 cycles per token) is designed to capture?
Can you explain why learned absolute positional embeddings cannot extrapolate beyond training length, while sinusoidal encoding can?
Can you describe what RoPE does to query and key vectors and explain the mathematical property that makes the dot product depend only on relative position?
Can you describe what ALiBi does to attention logits and explain why it enables zero-shot context length generalisation?
Can you describe the position interpolation approach for context length extension and explain what quality tradeoff it makes?
Can you name the positional encoding used by Llama 3, Mistral, and Gemma, and explain one practical advantage over learned absolute embeddings?