Advanced

Scaling Laws — Compute, Data, Parameters

In 2020, researchers at OpenAI published a landmark empirical study showing that language model performance does not improve randomly as you add resources — it follows precise mathematical relationships called scaling laws. Understanding these laws explains why the AI field spent several years single-mindedly pursuing larger models, and why that strategy eventually had to be refined.

Small Scale

Low compute, small model, less data

Large Scale

High compute, large model, more data

GPT-2 (1.5B)

GPT-3 (175B)

GPT-4 class

Frontier (500B+)

Scaling laws predict that model loss improves predictably as compute, parameters, and data increase together

The Kaplan et al. Scaling Laws (2020)

The original paper by Jared Kaplan and colleagues at OpenAI found that language model loss — how well a model predicts the next token — follows smooth power laws with respect to three quantities:

Number of Parameters

More parameters = lower loss, regardless of architecture details. The relationship is a straight line on a log-log plot.

Dataset Size (tokens)

More training tokens = lower loss. But there are diminishing returns — doubling data does not halve the loss.

Compute (FLOPs)

Total floating-point operations during training. Compute is the master budget — N and D are how you allocate it.

The core finding: loss follows a power law L ∝ X^(-α) for each of N, D, and C independently (when the others are not the bottleneck). The exponent α determines how quickly performance improves per unit increase. Crucially, Kaplan et al. found that model size N has a stronger effect than dataset size D — suggesting that for a fixed compute budget, it is better to train a larger model on fewer tokens than a smaller model on more tokens.

This conclusion would later be challenged by the Chinchilla paper, but for 2020–2022 it drove the industry toward ever-larger models: GPT-3 (175B), PaLM (540B), and Gopher (280B) were all products of the "scale the model" interpretation of these laws.

FLOP Counting — The Training Budget

To apply scaling laws, you need to estimate the compute cost of training. The standard approximations used in the literature are:

Forward pass cost:≈ 2N FLOPs per token

Each parameter participates in one multiply-accumulate operation per token

Backward pass cost:≈ 2× forward pass = 4N FLOPs per token

Gradients for each weight + activations for each layer

Total training cost:≈ 6N × D FLOPs

C ≈ 6ND — the standard approximation used when comparing models across papers

This formula (C ≈ 6ND) is a rule of thumb, not an exact formula. It ignores attention costs (which grow with context length), embedding operations, and optimizer overhead. But it is accurate enough for order-of-magnitude comparisons and for applying scaling law predictions.

For a concrete example: GPT-3 has N = 175 billion parameters and was trained on D ≈ 300 billion tokens. Plugging in: C ≈ 6 × 175×10⁹ × 300×10⁹ ≈ 3.1 × 10²³ FLOPs — approximately 3.1×10²³ floating-point operations. This is consistent with the reported training compute.

The Three Limiting Regimes

Any training run operates in one of three regimes, depending on which resource is the binding constraint:

Compute-Limited

You have a fixed training budget (GPU hours / dollars). You cannot afford to train longer or scale the model further. This is the most common real-world situation — the question becomes: given C FLOPs, how should you split the budget between N and D?

Data-Limited

You have run out of high-quality training data. Adding more parameters or compute does not help because the model has already memorized the available data. This is increasingly relevant as the best web-scale datasets approach saturation. Synthetic data generation is the primary response.

Parameter-Limited

The model is too small to represent the patterns in your data. You could train for longer and loss would still be high because the model lacks capacity. The fix is to increase N, not D or C.

In practice, research labs operate in the compute-limited regime almost exclusively. The Chinchilla paper later showed that most labs were also implicitly operating in a parameter-limited regime — their models were too large for the data they were training on — but that is covered in the next page.

Power Law Exponents — What the Slopes Mean

A power law on a log-log plot appears as a straight line. The slope of that line is the exponent α. For language model loss:

Variable	Typical Exponent (α)	Interpretation
N (parameters)	≈ 0.076	10× more params → loss drops by ~14%
D (tokens)	≈ 0.095	10× more data → loss drops by ~17%
C (compute)	≈ 0.050	10× more compute → loss drops by ~10%

These exponents are small — this is the important lesson. Language model loss does not drop quickly. To halve the loss requires enormous increases in scale. This is why AI capabilities feel "incremental" on a per-generation basis despite orders-of-magnitude increases in compute. A steeper exponent (higher α) would mean faster improvement per unit resource; a shallower exponent means diminishing returns set in faster.

Using Scaling Laws to Predict Performance

One of the most practically useful applications of scaling laws is performance extrapolation: run a series of small training experiments at different compute budgets, fit a power law to the results, and extrapolate to predict what a much larger model will achieve.

Extrapolation Workflow

Train models at 10M, 100M, 1B parameters on the same data mix
Record validation loss at each scale
Fit a line to (log N, log L) — verify linearity holds
Extrapolate the line to 10B or 100B parameters to predict expected loss
Use that predicted loss to estimate downstream benchmark performance

This is how labs decide whether to commit the GPU budget for a full-scale run before building the model. If the extrapolated loss does not reach a useful threshold, the architecture or data mix needs to change.

The key assumption is that the power law observed at small scale continues at large scale — that the line does not bend. This assumption has held surprisingly well for loss, but it breaks down for downstream task performance.

Limitations of Scaling Laws

Scaling laws are a powerful planning tool, but they come with serious limitations that are important to understand before relying on them:

They Predict Loss, Not Task Performance

Scaling laws predict cross-entropy loss on next-token prediction. Downstream benchmark performance (GSM8K, MMLU, HumanEval) can be non-monotonic with scale — a model can get worse on a specific task as it gets larger, before getting better again. Loss predicts the general capability envelope, not specific task scores.

Distribution Shift Between Training and Evaluation

Loss is measured on a held-out slice of the training distribution. If your evaluation benchmark uses a different distribution — domain-specific language, tasks that require reasoning not well-represented in training data — the scaling law prediction does not apply.

Architecture Matters (In the Constant, Not the Exponent)

Different architectures have different multiplicative constants in front of the power law. A Mixture-of-Experts model and a dense transformer may follow the same slope but start at different loss levels. Architecture improvements shift the curve down without changing the exponent.

Data Quality Is Not Captured

Scaling laws assume a fixed data distribution. Training on higher-quality data (filtered web text, curated books) gives better loss than raw web data at the same token count. The "D" in scaling laws measures quantity, not quality.

Why Scaling Laws Matter in Practice

Scaling laws transformed AI development from empirical trial-and-error into something closer to engineering. Before them, deciding whether to train a larger model required intuition and luck. After them, labs could make quantitative predictions and allocate compute rationally.

They also created a powerful narrative: if capabilities follow smooth power laws, then continued investment in compute will yield continued improvement. This narrative drove billions in GPU infrastructure investment from 2020 to 2024, and it largely held — GPT-4, Claude 3, and Gemini Ultra all represented clear step-changes in capability predicted by larger training runs.

The limits of the original Kaplan laws — particularly the claim that model size matters more than data — were exposed by the Chinchilla paper (2022), which revised the optimal allocation formula. That story is on the next page.

Checklist: Do You Understand This?

Can you state what N, D, and C represent in the scaling laws framework?
Can you use the C ≈ 6ND formula to estimate training compute given model size and token count?
Do you understand what a power law exponent means — specifically, why a larger exponent means faster improvement?
Can you explain the three limiting regimes (compute-, data-, parameter-limited) and give an example of each?
Do you know why scaling laws predict loss well but predict downstream task performance poorly?
Can you describe how a lab uses small-scale extrapolation experiments before committing to a large training run?
Do you understand why the Kaplan conclusion ("bigger model beats more data") was later challenged?