The Full Transformer Block
The transformer block is the repeating unit of every modern large language model. A GPT-4-class model stacks 96 or more of these blocks, each adding depth, expressiveness, and the capacity to represent increasingly abstract features. Understanding the internal structure of a single block — and why each design decision was made — is prerequisite knowledge for understanding training dynamics, inference cost, and the engineering tradeoffs in production models.
Block Structure
A modern transformer block follows the Pre-LayerNorm pattern and contains exactly two sublayers — a multi-head self-attention sublayer and a position-wise feedforward sublayer — each with a residual connection wrapping it.
Normalise the residual stream before attention. Stabilises activations entering the attention computation.
Queries, keys, values projected per head. Attention weights computed. Output projected back to d_model.
Add attention output to the original input: x = x + Attn(LN(x)). Gradient highway through all blocks.
Normalise before the feedforward sublayer.
Two-layer MLP applied position-wise: expand to 4×d_model, activate, project back. Stores factual associations.
Add FFN output to input: x = x + FFN(LN(x)). Block output enters next block's residual stream.
Written as equations, where x is the residual stream at the block's input:
Pre-LayerNorm vs Post-LayerNorm
The placement of layer normalisation relative to the residual connection is a subtle but important training stability choice. The original transformer used Post-LN (normalise after the residual sum). All major modern LLMs use Pre-LN (normalise before the sublayer computation).
x = LayerNorm(x + Sublayer(x))
The residual connection adds unnormalised sublayer output to the stream before normalisation. At initialisation, the residual stream magnitude grows with depth. This causes training instability for very deep networks — large learning rates or deep models often diverge early in training.
x = x + Sublayer(LayerNorm(x))
The sublayer receives normalised input. The residual stream itself is never forced through a normalisation operation — it can grow freely, which means the gradient flows unimpeded. Pre-LN allows training without learning rate warmup, converges more reliably at depth, and is the universal choice in GPT-3, LLaMA, Mistral, PaLM, and their successors.
Residual Connections
Residual connections (He et al., ResNet, 2015) are among the most important architectural innovations in deep learning. Without them, training networks deeper than roughly 20 layers was impractical — gradients vanished or exploded over the long backward pass. Residual connections solve this by creating direct gradient highways from any layer back to any earlier layer.
Each sublayer learns to compute a delta — a correction to be added to the existing residual stream — rather than a complete transformation. This makes the learning problem easier: a sublayer that learns to output zero leaves the stream unchanged, which is the identity function. At initialisation, weights can be set close to zero so that each block starts as approximately the identity, and training progressively specialises each block's delta.
For a transformer with 96 layers, there are 192 residual connections (two per block). The gradient can travel from the output layer to the first layer through any combination of these connections, following the identity path for any block that has not yet learned a useful transformation. This is why transformers can be trained at extreme depths.
Feed-Forward Network (FFN)
The FFN sublayer is a two-layer MLP applied independently to each token position — it has no cross-position interaction (unlike attention). Despite this locality, the FFN accounts for roughly two-thirds of total transformer compute and a disproportionate fraction of the model's factual knowledge storage.
Research on mechanistic interpretability (Geva et al., 2021) has shown that FFN layers function as key-value memories. Each row of W1 is a "key" that fires for a particular input pattern; the corresponding column of W2 is a "value" that gets added to the residual stream. Factual associations like "the capital of France is Paris" are stored in these weight matrices — surgery on FFN weights can surgically alter specific factual beliefs (this is the basis for knowledge editing techniques like ROME).
Activation Functions
The choice of activation function in the FFN expansion has evolved significantly from the original transformer.
| Activation | Formula | Properties | Used By |
|---|---|---|---|
| ReLU | max(0, x) | Simple, sparse; dying ReLU problem; gradient 0 for x<0 | Original Transformer (2017) |
| GELU | x · Φ(x) | Smooth ReLU approximation; non-zero gradient everywhere; good empirical performance | BERT, GPT-2, GPT-3 |
| SwiGLU | Swish(W₁x) ⊙ (W₃x) | Gated linear unit; uses 3 weight matrices; 2/3 expansion to match FLOPs; strong empirical results | PaLM, Llama 2/3, Mistral, Gemma |
SwiGLU (Shazeer, 2020) uses a gated linear unit architecture: the input is projected to two streams, one through a Swish (smooth ReLU variant) activation and one through a linear projection, then multiplied elementwise. This gating enables the FFN to selectively amplify or suppress features. Because SwiGLU requires three weight matrices (W1, W2, W3) instead of two, the expansion dimension is typically reduced to 2/3 × 4 × dmodel to maintain the same total parameter count and FLOPs. Empirically, SwiGLU consistently outperforms GELU and ReLU at equivalent compute.
Layer Normalisation and RMSNorm
Layer normalisation (Ba et al., 2016) normalises the activations at each position independently across the feature dimension (dmodel). This is in contrast to batch normalisation, which normalises across the batch dimension and is unsuitable for autoregressive language modelling (where batch size is often 1 at inference).
RMSNorm (Zhang and Sennrich, 2019) is a simplified variant that removes the mean-centering operation, computing only the root mean square normalisation. It has 20–40% lower compute cost than full LayerNorm and empirically matches its performance. LLaMA, Mistral, Gemma, and most modern open LLMs use RMSNorm over LayerNorm — a practical optimisation that compounds across the hundreds of normalisation operations in a deep network.
Depth vs Width Tradeoffs
Modern LLMs are designed with specific depth (number of layers) and width (dmodel and dff) tradeoffs that affect their capabilities, training efficiency, and inference cost.
- Each layer adds a computation step, enabling multi-step reasoning chains
- Deeper models tend to learn more abstract, compositional representations
- Increases pipeline parallelism opportunities during training and inference
- Adds serialised compute per token at inference (cannot parallelise across layers)
- Larger dmodel means more capacity per computation step
- Wider FFN stores more factual associations per layer
- Enables more attention heads and richer per-layer representations
- Increases memory bandwidth requirements (larger weight matrices to load)
Modern frontier LLMs typically use 32–96 layers with dmodel of 4096–12288. Llama 3 70B uses 80 layers with dmodel = 8192. GPT-4 (speculated architecture) is reported to use a mixture-of-experts structure rather than scaling depth and width uniformly. The Chinchilla scaling laws (Hoffmann et al., 2022) provide guidance on compute-optimal depth/width ratios given a training FLOP budget, but inference cost — which scales with depth — often motivates shallower, wider models for deployment.
Checklist: Do You Understand This?
- Can you write the Pre-LN transformer block equations for both the attention sublayer and the FFN sublayer, showing where the LayerNorm and residual add occur?
- Can you explain why Pre-LN is more training-stable than Post-LN, and describe what goes wrong with Post-LN at depth?
- Can you derive the gradient through a residual block (y = F(x) + x) and explain why the identity term I prevents vanishing gradients even when F(x) contributes nothing?
- Can you describe what the FFN sublayer does to each token position — including the expansion ratio, why dff = 4 × dmodel is the standard, and what factual associations means in this context?
- Can you explain what SwiGLU is and why it uses three weight matrices instead of two, requiring the expansion ratio to be reduced?
- Can you explain what RMSNorm is, how it differs from LayerNorm, and why it is preferred in modern LLMs?
- Can you describe one tradeoff between adding more layers versus making each layer wider in terms of model capability and inference cost?