Advanced

The Full Transformer Block

The transformer block is the repeating unit of every modern large language model. A GPT-4-class model stacks 96 or more of these blocks, each adding depth, expressiveness, and the capacity to represent increasingly abstract features. Understanding the internal structure of a single block — and why each design decision was made — is prerequisite knowledge for understanding training dynamics, inference cost, and the engineering tradeoffs in production models.

Block Structure

A modern transformer block follows the Pre-LayerNorm pattern and contains exactly two sublayers — a multi-head self-attention sublayer and a position-wise feedforward sublayer — each with a residual connection wrapping it.

LayerNorm

Normalise the residual stream before attention. Stabilises activations entering the attention computation.

Multi-Head Self-Attention

Queries, keys, values projected per head. Attention weights computed. Output projected back to d_model.

Residual Add

Add attention output to the original input: x = x + Attn(LN(x)). Gradient highway through all blocks.

LayerNorm

Normalise before the feedforward sublayer.

FFN (Feed-Forward Network)

Two-layer MLP applied position-wise: expand to 4×d_model, activate, project back. Stores factual associations.

Residual Add

Add FFN output to input: x = x + FFN(LN(x)). Block output enters next block's residual stream.

Written as equations, where x is the residual stream at the block's input:

x = x + MHA(LayerNorm(x)) — attention sublayer with residual x = x + FFN(LayerNorm(x)) — feedforward sublayer with residual This is Pre-LN (Pre-LayerNorm). The original "Attention Is All You Need" used Post-LN: x = LayerNorm(x + MHA(x)) — norm after the residual add.

Pre-LayerNorm vs Post-LayerNorm

The placement of layer normalisation relative to the residual connection is a subtle but important training stability choice. The original transformer used Post-LN (normalise after the residual sum). All major modern LLMs use Pre-LN (normalise before the sublayer computation).

Post-LN (Original Transformer)

x = LayerNorm(x + Sublayer(x))

The residual connection adds unnormalised sublayer output to the stream before normalisation. At initialisation, the residual stream magnitude grows with depth. This causes training instability for very deep networks — large learning rates or deep models often diverge early in training.

Pre-LN (Modern Standard)

x = x + Sublayer(LayerNorm(x))

The sublayer receives normalised input. The residual stream itself is never forced through a normalisation operation — it can grow freely, which means the gradient flows unimpeded. Pre-LN allows training without learning rate warmup, converges more reliably at depth, and is the universal choice in GPT-3, LLaMA, Mistral, PaLM, and their successors.

Residual Connections

Residual connections (He et al., ResNet, 2015) are among the most important architectural innovations in deep learning. Without them, training networks deeper than roughly 20 layers was impractical — gradients vanished or exploded over the long backward pass. Residual connections solve this by creating direct gradient highways from any layer back to any earlier layer.

Layer output: y = F(x) + x Gradient through residual block: ∂L/∂x = ∂L/∂y · (∂F/∂x + I) The identity term I ensures gradient always has a direct path. Even if ∂F/∂x → 0 (dead sublayer), ∂L/∂x = ∂L/∂y — no loss.

Each sublayer learns to compute a delta — a correction to be added to the existing residual stream — rather than a complete transformation. This makes the learning problem easier: a sublayer that learns to output zero leaves the stream unchanged, which is the identity function. At initialisation, weights can be set close to zero so that each block starts as approximately the identity, and training progressively specialises each block's delta.

For a transformer with 96 layers, there are 192 residual connections (two per block). The gradient can travel from the output layer to the first layer through any combination of these connections, following the identity path for any block that has not yet learned a useful transformation. This is why transformers can be trained at extreme depths.

Feed-Forward Network (FFN)

The FFN sublayer is a two-layer MLP applied independently to each token position — it has no cross-position interaction (unlike attention). Despite this locality, the FFN accounts for roughly two-thirds of total transformer compute and a disproportionate fraction of the model's factual knowledge storage.

FFN(x) = W₂ · activation(W₁ · x + b₁) + b₂ W₁ shape: (d_model × d_ff) — expand W₂ shape: (d_ff × d_model) — project back Standard expansion: d_ff = 4 × d_model Example (Llama 3 70B): d_model = 8192, d_ff ≈ 28672 (SwiGLU, 2/3 × 4×d_model)

Research on mechanistic interpretability (Geva et al., 2021) has shown that FFN layers function as key-value memories. Each row of W₁ is a "key" that fires for a particular input pattern; the corresponding column of W₂ is a "value" that gets added to the residual stream. Factual associations like "the capital of France is Paris" are stored in these weight matrices — surgery on FFN weights can surgically alter specific factual beliefs (this is the basis for knowledge editing techniques like ROME).

Activation Functions

The choice of activation function in the FFN expansion has evolved significantly from the original transformer.

Activation	Formula	Properties	Used By
ReLU	max(0, x)	Simple, sparse; dying ReLU problem; gradient 0 for x<0	Original Transformer (2017)
GELU	x · Φ(x)	Smooth ReLU approximation; non-zero gradient everywhere; good empirical performance	BERT, GPT-2, GPT-3
SwiGLU	Swish(W₁x) ⊙ (W₃x)	Gated linear unit; uses 3 weight matrices; 2/3 expansion to match FLOPs; strong empirical results	PaLM, Llama 2/3, Mistral, Gemma

SwiGLU (Shazeer, 2020) uses a gated linear unit architecture: the input is projected to two streams, one through a Swish (smooth ReLU variant) activation and one through a linear projection, then multiplied elementwise. This gating enables the FFN to selectively amplify or suppress features. Because SwiGLU requires three weight matrices (W₁, W₂, W₃) instead of two, the expansion dimension is typically reduced to 2/3 × 4 × d_model to maintain the same total parameter count and FLOPs. Empirically, SwiGLU consistently outperforms GELU and ReLU at equivalent compute.

Layer Normalisation and RMSNorm

Layer normalisation (Ba et al., 2016) normalises the activations at each position independently across the feature dimension (d_model). This is in contrast to batch normalisation, which normalises across the batch dimension and is unsuitable for autoregressive language modelling (where batch size is often 1 at inference).

LayerNorm(x) = (x − μ) / √(σ² + ε) · γ + β μ, σ² — mean and variance computed over d_model features at each position γ, β — learnable scale and shift (same shape as x) ε — small constant for numerical stability RMSNorm(x) = x / RMS(x) · γ (no mean centering, no β) RMS(x) = √(1/d · Σᵢ xᵢ²)

RMSNorm (Zhang and Sennrich, 2019) is a simplified variant that removes the mean-centering operation, computing only the root mean square normalisation. It has 20–40% lower compute cost than full LayerNorm and empirically matches its performance. LLaMA, Mistral, Gemma, and most modern open LLMs use RMSNorm over LayerNorm — a practical optimisation that compounds across the hundreds of normalisation operations in a deep network.

Depth vs Width Tradeoffs

Modern LLMs are designed with specific depth (number of layers) and width (d_model and d_ff) tradeoffs that affect their capabilities, training efficiency, and inference cost.

More Layers (Depth)

Each layer adds a computation step, enabling multi-step reasoning chains
Deeper models tend to learn more abstract, compositional representations
Increases pipeline parallelism opportunities during training and inference
Adds serialised compute per token at inference (cannot parallelise across layers)

Wider Layers (Width)

Larger d_model means more capacity per computation step
Wider FFN stores more factual associations per layer
Enables more attention heads and richer per-layer representations
Increases memory bandwidth requirements (larger weight matrices to load)

Modern frontier LLMs typically use 32–96 layers with d_model of 4096–12288. Llama 3 70B uses 80 layers with d_model = 8192. GPT-4 (speculated architecture) is reported to use a mixture-of-experts structure rather than scaling depth and width uniformly. The Chinchilla scaling laws (Hoffmann et al., 2022) provide guidance on compute-optimal depth/width ratios given a training FLOP budget, but inference cost — which scales with depth — often motivates shallower, wider models for deployment.

Checklist: Do You Understand This?

Can you write the Pre-LN transformer block equations for both the attention sublayer and the FFN sublayer, showing where the LayerNorm and residual add occur?
Can you explain why Pre-LN is more training-stable than Post-LN, and describe what goes wrong with Post-LN at depth?
Can you derive the gradient through a residual block (y = F(x) + x) and explain why the identity term I prevents vanishing gradients even when F(x) contributes nothing?
Can you describe what the FFN sublayer does to each token position — including the expansion ratio, why d_ff = 4 × d_model is the standard, and what factual associations means in this context?
Can you explain what SwiGLU is and why it uses three weight matrices instead of two, requiring the expansion ratio to be reduced?
Can you explain what RMSNorm is, how it differs from LayerNorm, and why it is preferred in modern LLMs?
Can you describe one tradeoff between adding more layers versus making each layer wider in terms of model capability and inference cost?