Pre-training Objectives
A language model cannot be supervised by human labels when training on trillions of tokens โ no labelling workforce could annotate data at that scale. Instead, the model learns from a self-supervised signal: the correct answer is derived automatically from the training text itself. The specific form of that signal is the pre-training objective, and it is the single most consequential choice in determining what kind of model you get. Generative vs. representational capability, encoder vs. decoder, zero-shot instruction following vs. retrieval-oriented embeddings โ all flow from objective choice.
Causal Language Modeling (CLM)
Causal Language Modeling, also called autoregressive language modeling, is the dominant pretraining objective for modern LLMs. The task is simple: given all tokens to the left, predict the next token.
During training, the model processes the entire sequence in a single forward pass thanks to causal (lower-triangular) masking in the attention layers. Position i computes its loss using only positions 1 through iโ1. The loss is the average cross-entropy over every position in every sequence:
The key property is that every single token in every sequence contributes a gradient signal. There are no "ignored" positions. For a sequence of length 4096 tokens, the model receives 4096 prediction tasks simultaneously in one forward pass. This makes CLM exceptionally data-efficient relative to its simplicity.
CLM produces decoder-only architectures naturally: a transformer with causal masking generates text autoregressively by sampling the next token and appending it to the context, then repeating. GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Gemma, DeepSeek, Grok โ every major frontier model as of 2025 uses CLM pretraining.
Masked Language Modeling (MLM)
Masked Language Modeling, introduced by BERT (Devlin et al., 2018), presents the model with a sequence where some tokens have been replaced by a special [MASK] token. The objective is to predict the original token at each masked position. Unlike CLM, the model can use context from both directions โ left and right โ since there is no causal constraint on attention.
BERT's implementation masks 15% of tokens, with a deliberate mix to avoid the model learning that [MASK] always means "predict here" (which would not generalise to inference where [MASK] is absent): 80% are replaced with [MASK], 10% are replaced with a random token, and 10% are left unchanged. The loss is computed only over masked positions, meaning the gradient signal covers just 15% of tokens per batch โ a meaningful efficiency gap compared to CLM.
MLM Strengths
- Bidirectional context produces richer representations of each token
- Better intrinsic sentence embeddings for classification, NLI, semantic similarity
- BERT-style encoders remain state-of-the-art for many discriminative NLP tasks
- Lower perplexity on in-domain text relative to same-size CLM models
MLM Limitations
- Not natively generative โ cannot produce open-ended text
- Only 15% of tokens contribute gradient per step (low signal efficiency)
- The [MASK] token creates a train/inference mismatch (masks don't appear at test time)
- Scaling to very large models yields diminishing returns vs. CLM
Next Sentence Prediction (NSP)
BERT also trained with a secondary objective: given two sentences A and B, predict whether B is the real next sentence following A in the corpus (50% of the time) or a randomly sampled unrelated sentence (50%). The intuition was that this would teach cross-sentence coherence needed for tasks like question answering and textual inference.
Subsequent research was unkind to NSP. RoBERTa (Liu et al., 2019) demonstrated that training without NSP and with longer sequences produced equal or better downstream performance across nearly all tasks. The hypothesis is that NSP is too easy โ distinguishing unrelated sentences is a low-signal task that takes compute away from the main MLM objective. DistilBERT, RoBERTa, ALBERT (with sentence-order prediction, a harder variant), and all modern encoder models have abandoned or replaced NSP.
Span Corruption (T5 Approach)
T5 (Raffel et al., 2020) introduced span corruption as a bridge between MLM and CLM. Instead of masking individual tokens at random, contiguous spans of text are replaced by a single sentinel token (e.g. <extra_id_0>, <extra_id_1>). The model must reconstruct all dropped spans in the decoder output.
T5's default corrupts 15% of tokens in spans averaging 3 tokens each. The benefits are twofold: the encoder develops bidirectional representations (seeing full context around the masked span), and the decoder trains autoregressively on the reconstruction task, so the model retains generative capability. This is an encoder-decoder architecture: the encoder processes the corrupted input; the decoder generates the removed spans.
Span corruption is substantially more training-signal-efficient than standard MLM per token of input, because reconstructing a span requires predicting multiple tokens from a single masked region โ the decoder receives gradient over every token it generates, not just 15% of the input. Flan-T5, mT5, and Google's internal research models extensively use span corruption.
Prefix Language Modeling
Prefix LM (explored in UL2, Raffel et al. 2022; also in GLM and PaLM variants) is a hybrid objective. A prefix portion of the sequence is processed with bidirectional attention (like an encoder), and the suffix is generated autoregressively (like a decoder). The split point is sampled randomly during training.
Prefix LM โ Key Property
The prefix (e.g. a question or instruction) sees full bidirectional context, producing a richer representation. The completion is generated autoregressively, so the model is still generative. This is directly analogous to how instruction-following works at inference: a prompt (prefix) is provided and the model generates a response (suffix).
UL2 Mixture
Google's UL2 framework proposed training with a mixture of objectives: short-span masking (BERT-like), long-span masking (aggressive denoising), and causal LM โ randomly switching between them per batch. Flan-T5 uses UL2 pretraining. The claim is that objective diversity improves generalisation across tasks more than any single objective alone.
Objective Comparison and Why CLM Won
The practical outcome of the 2018โ2025 scaling era is clear: causal language modeling at scale dominates. Understanding why requires looking at three factors.
| Objective | Signal efficiency | Generative? | Bidirectional? | Best for |
|---|---|---|---|---|
| CLM | 100% of tokens | Yes | No (causal only) | Text generation, instruction following, agents |
| MLM | ~15% of tokens | No | Yes | Classification, NLI, sentence embeddings |
| Span corruption | High (decoder sees all spans) | Yes (encoder-decoder) | Encoder yes | Seq2seq tasks, summarisation, translation |
| Prefix LM | Varies | Yes | Prefix only | Instruction following, conditional generation |
| NSP | Very low (binary) | No | Yes | Largely abandoned |
CLM's dominance comes from three compounding advantages. First, full gradient signal: every token predicts the next, so no position is wasted. At a trillion tokens of pretraining, the difference between 15% and 100% token utilisation is enormous. Second, natural fit to generation: the model learns to produce coherent text by doing exactly the task it will be used for at inference time, with no train/inference mismatch. Third, emergent instruction following: as model size and token count scale, CLM-trained models spontaneously improve at following instructions embedded in their context (few-shot prompting), which then transfers to instruction fine-tuning. MLM-trained models lack this generative mechanism entirely.
The encoder-decoder line (T5, mT5, Flan-T5) remains competitive for structured output tasks and is widely used in Google's production systems, but GPT-style decoder-only CLM has become the de facto standard for frontier models because it unifies pretraining, instruction tuning, and generation under a single architecture and objective.
Checklist: Do You Understand This?
- In CLM pretraining, how many tokens contribute a gradient signal per forward pass, and why is this different from MLM?
- BERT masks 15% of tokens, but further splits that 15% into three categories. What are they, and why is the split used rather than always replacing with [MASK]?
- Span corruption in T5 replaces spans with sentinel tokens. Why does this produce a more efficient gradient signal than standard MLM?
- What is the "train/inference mismatch" problem with MLM, and why does it not affect CLM?
- If you wanted to build a model optimised for sentence-level semantic similarity search (embedding retrieval), which pre-training objective would you choose and why?
- NSP was a core BERT objective but was abandoned in RoBERTa. What did ablation studies show, and what is the current understanding of why NSP failed?
- Name two current production models that use CLM and one that uses span corruption. What architectural difference reflects the objective choice?