Chinchilla & Optimal Training
In 2022, Jordan Hoffmann and colleagues at DeepMind published a paper with a provocative finding: nearly every large language model trained between 2020 and 2022 was significantly undertrained. Not undertrained in the sense of needing more compute — they had enormous compute budgets — but undertrained relative to their size. The models were too large for the amount of data they saw. The paper introduced a revised optimality condition that reshaped how the entire industry thinks about training budgets.
The Core Finding
The Kaplan et al. (2020) scaling laws suggested that model size N should grow faster than dataset size D when scaling compute. In practice, this meant building larger and larger models (GPT-3: 175B, Gopher: 280B, Megatron: 530B) while keeping training token counts relatively modest.
Hoffmann et al. ran a systematic compute-controlled study: for a fixed FLOP budget C, they trained many models at different (N, D) combinations and measured which combination achieved the lowest loss. Their finding:
Compute-Optimal Condition (Chinchilla)
N_optimal ∝ C^0.5 D_optimal ∝ C^0.5
Model size and dataset size should scale equally with compute. As a rule of thumb: train on approximately 20 tokens per parameter. A 10B parameter model should see ~200B tokens; a 70B model should see ~1.4T tokens.
This is a direct contradiction of the Kaplan recommendation. Kaplan said N should grow faster; Chinchilla says N and D should grow at the same rate. The practical implication: for any given compute budget, the optimal model is much smaller and more heavily trained than what the field had been building.
Chinchilla vs. Gopher — The Proof
DeepMind demonstrated the finding concretely by training Chinchilla and comparing it against Gopher, their own previous flagship model, using the same compute budget:
| Model | Parameters | Training Tokens | Tokens / Param | Result |
|---|---|---|---|---|
| Gopher | 280B | 300B | ~1.1× | Undertrained — too few tokens for its size |
| Chinchilla | 70B | 1.4T | ~20× | Outperforms Gopher on most benchmarks |
Same compute budget, 4× fewer parameters, 4.6× more training data. Chinchilla won on language modeling loss and on most downstream evaluations including MMLU, BIG-bench, and reading comprehension tasks. The model was also far cheaper to run at inference time — 4× fewer parameters means roughly 4× faster inference and 4× lower memory requirements.
The Compute-Optimal Frontier
The "compute-optimal frontier" is the curve in (N, D) space that achieves the lowest possible loss for each value of total compute C. Points below the curve represent models trained with too few tokens (undertrained). Points above represent models trained with too many tokens for their size — which rarely happens in practice because it requires prohibitively large datasets.
On the Frontier
Chinchilla (70B / 1.4T), LLaMA 2 (70B / 2T), Mistral 7B (7B / extended training) — these models traded a smaller N for a much larger D and achieved excellent quality-per-FLOP.
Below the Frontier (Undertrained)
GPT-3 (175B / 300B tokens ≈ 1.7 tokens/param), Gopher (280B / 300B tokens), PaLM (540B / 780B tokens ≈ 1.4 tokens/param). These models were too large for the data they saw.
Why the Industry Initially Ignored Chinchilla
The Chinchilla paper was published in March 2022. Yet GPT-4 (released 2023) is rumored to be over 1T parameters, and many subsequent models continued to be built at massive scale. Why did the field not immediately converge on Chinchilla-optimal training?
Inference Cost Dominates at Scale
Chinchilla optimizes for training efficiency — the best quality per FLOP of training compute. But when you deploy a model to millions of users, inference costdominates. A smaller model trained on more data is cheaper to serve. The Chinchilla condition is optimal for a one-shot training run, but a model used for inference billions of times benefits from being as small as possible, even if training it required more tokens.
Data Constraints Were Not Yet Binding
In 2022–2023, the internet corpus was still large enough that Chinchilla-scale token counts were achievable. The urgency of the inference-efficiency argument was not fully appreciated until inference at scale became the primary cost center for companies like OpenAI and Anthropic.
Larger Models Have Better Absolute Performance
Even if a 70B model trained optimally beats a 280B undertrained model per FLOP, the compute-optimal 280B model (trained on 5.6T tokens) would be even better. Organizations with access to very large compute budgets could exceed the Chinchilla allocation for N and still improve absolute performance — they just were not doing so efficiently.
Overtrained Models — The LLaMA Strategy
The most significant practical development following the Chinchilla paper was the deliberate decision to overtrain small models — that is, to train models on far more tokens than the Chinchilla optimum for their parameter count. The goal is not training efficiency but inference efficiency.
| Model | Params | Training Tokens | Chinchilla Optimal | Ratio vs Optimal |
|---|---|---|---|---|
| LLaMA 1 7B | 7B | 1T | 140B | 7× overtrained |
| LLaMA 3 8B | 8B | 15T | 160B | ~94× overtrained |
| Mistral 7B | 7B | ~1T+ | 140B | 7×+ overtrained |
LLaMA 3 8B trained on 15 trillion tokens is not optimal from a training compute perspective — you could achieve the same loss with a larger model trained on fewer tokens using less total compute. But the resulting 8B model is fast to serve, fits on a consumer GPU, and outperforms what a Chinchilla-optimal model at 8B parameters would achieve. Mistral 7B similarly outperforms Llama 2 13B — a model nearly twice its size — because it was trained on more data.
This strategy is only viable when you have access to enough high-quality training data. The scarcity of such data (post-2024) is a meaningful constraint on how far this approach can be pushed.
Beyond Chinchilla — What Comes Next
The Chinchilla laws, like the Kaplan laws before them, describe an empirical optimum under specific conditions. They assume:
- A fixed, static training dataset (no data repetition or curation)
- Standard autoregressive training with cross-entropy loss
- No post-training (RLHF, instruction tuning, preference optimization)
- Inference cost is not a consideration
Real-world training violates all four assumptions. Post-training alignment (RLHF, DPO) often produces larger gains than additional pre-training tokens at the same compute cost. Synthetic data generation and data curation change the effective quality of each token. And inference scaling — using more compute at inference time via chain-of-thought, best-of-N sampling, or reasoning models — is emerging as a parallel dimension that the Chinchilla framework does not address at all.
Checklist: Do You Understand This?
- Can you explain what the Chinchilla paper found, and how it contradicts the Kaplan et al. recommendation?
- Can you state the compute-optimal condition: N_optimal ∝ C^0.5, D_optimal ∝ C^0.5, and the 20 tokens/parameter rule of thumb?
- Can you compare Chinchilla and Gopher in terms of parameters, training tokens, and outcome?
- Do you understand why inference cost drives labs to "overtrain" small models beyond the Chinchilla optimum?
- Can you explain why LLaMA 3 8B training on 15T tokens is a rational strategy despite being far from Chinchilla-optimal?
- Do you know the four key assumptions the Chinchilla framework makes, and which ones modern training violates?
- Can you explain why Mistral 7B outperforms Llama 2 13B despite having fewer parameters?