🧠 All Things AI
Advanced

Emergent Abilities & Phase Transitions

Scaling laws predict that loss decreases smoothly and continuously as models get larger. But when you measure whether a model can do something — solve a math problem, follow a multi-step instruction, write working code — the improvement is often not smooth at all. Some capabilities appear to be essentially absent at small scale and then emerge sharply as models cross a threshold. This phenomenon, called emergent abilities, is one of the most discussed and most contested ideas in modern AI research.

Wei et al. (2022) — The Emergence Paper

Jason Wei and colleagues at Google Brain published "Emergent Abilities of Large Language Models" in 2022, documenting dozens of tasks where model performance was near random at small scale and then improved sharply, sometimes dramatically, as model size increased. The paper defined emergence as: an ability is emergent if it is not present in smaller models but is present in larger models.

Examples of Emergent Abilities

  • 3-digit arithmetic: Near-chance accuracy below ~10B parameters; sharp improvement above ~50B
  • Chain-of-thought reasoning:Step-by-step reasoning essentially non-existent below a certain scale, then suddenly effective
  • Multi-step instruction following:Small models fail; large models generalize across instruction phrasings
  • Word-in-context disambiguation:Identifying which sense of a homonym is being used — appears late
  • Logical deduction:Syllogism solving near random at small scale

The Pattern

Across many tasks in the Wei et al. dataset, the performance curve looks like a step function: flat near random for a long stretch, then a sharp upward jump, then continued improvement. This is qualitatively different from the smooth power-law loss curves that scaling laws predict.

The threshold at which emergence occurs varies by task — some emerge at ~7B parameters, others require >100B. There is no single "emergence scale."

In-Context Learning as an Emergent Ability

The most practically important emergent ability is in-context learning (ICL)— the ability to perform a task by reading a few examples in the prompt, without any weight updates. In small models, providing examples in the prompt does not help, or sometimes hurts, performance. In large models, few-shot prompting reliably improves performance across a wide range of tasks.

Why In-Context Learning Is Special

ICL is not learned explicitly during training — the model is trained only on next-token prediction, never directly trained to "learn from examples in the prompt." The ability to adapt to examples at inference time appears to emerge from the combination of scale, diverse training data, and the attention mechanism's ability to use the context window as working memory.

This is why GPT-3 (2020) was such a significant inflection point: it was the first model large enough for ICL to be reliably useful, enabling a new paradigm of prompting-based adaptation that did not require fine-tuning.

The Phase Transition Analogy

The term "phase transition" is borrowed from physics — it describes a qualitative change in the state of a system at a critical threshold. Water freezes at 0°C; below that it flows, above that it is rigid. The analogy resonates because neural networks appear to undergo similar categorical shifts at certain scales.

However, the analogy also misleads. Physical phase transitions are driven by well-understood thermodynamic principles and occur at sharp, predictable thresholds determined by fundamental constants. Neural network emergence is empirical and poorly understood — the "threshold" varies by task, metric, model family, and training data. There is no equivalent of the thermodynamic formula that tells you at what parameter count a given capability will appear.

The Counterargument — Emergence as a Metric Artifact

In 2023, Rylan Schaeffer and colleagues published a challenge to the emergence narrative: "Are Emergent Abilities of Large Language Models a Mirage?" Their central argument is that apparent emergence is a consequence of using non-linear or discontinuous metrics, not evidence of genuine phase transitions in the underlying model.

Binary Metrics Create Apparent Discontinuities

Most emergence papers use binary accuracy: a problem is either solved or not. A model that solves 0 of 10 arithmetic problems scores 0%; a model that solves 1 scores 10%. The underlying capability — the model's probability of producing the correct answer — may be increasing smoothly, but the metric does not capture that until it crosses the threshold required to actually solve a problem. The apparent "jump" is the threshold of the metric, not a jump in the model's capability.

Continuous Metrics Show Smooth Improvement

Schaeffer et al. showed that when the same tasks are evaluated with continuous metrics (e.g., token-level probability of the correct answer, or edit distance to the correct answer), the apparent discontinuity often disappears. Improvement looks like a smooth curve — consistent with what scaling laws predict.

Metric Choice is Not Innocent

The choice of metric is partly a choice about what "capability" means. Binary accuracy on a multi-step reasoning task is a meaningful metric — getting the answer right matters. But the paper shows that researchers should be cautious about interpreting the shape of performance curves as evidence about the underlying structure of model capabilities.

What the Debate Tells Us

Both perspectives are partially correct. The resolution is roughly:

  • The Schaeffer argument is correct about metrics: many apparent discontinuities vanish under better measurement. The underlying model probability often improves smoothly.
  • The Wei argument is correct about task thresholds: some tasks genuinely require a minimum capability level to be solvable at all. A model that gets 50% of the required sub-steps right still solves 0% of the composed problems. This is a real phenomenon, not purely a measurement artifact.
  • The phase transition framing is metaphorical: it captures the subjective experience of using models at different scales, but does not imply that there are sharp, predictable thresholds driven by an underlying law equivalent to thermodynamics.

Unpredictability — The Safety Implication

Regardless of how emergence is interpreted mechanistically, it has a practical consequence:you cannot reliably predict which capabilities a larger model will have by testing smaller models. Scaling laws predict loss; they do not predict which tasks will cross the threshold for competent performance at any given scale.

The Predictability Problem

A lab training a model 10× larger than their current largest model cannot reliably enumerate which new capabilities will appear. They can predict the loss, but not which specific tasks will cross the competency threshold.

This is a meaningful AI safety concern: it means that capability evaluations on smaller models may miss dangerous capabilities that only emerge at scale. A model could develop the ability to synthesize harmful information, execute social engineering strategies, or perform novel cyberattacks at a scale threshold that was not identified during pre-deployment testing.

This unpredictability is part of why "red-teaming" — adversarially probing large models for dangerous capabilities — is conducted on the final large model, not on smaller proxies.

Loss Curves vs. Benchmark Performance — The Disconnect

A key technical insight underlying this entire discussion: benchmark performance is a non-linear function of loss. A model's cross-entropy loss decreases smoothly with scale. But the probability of correctly solving a multi-step problem is a composition of many individual next-token probabilities, and this composition is highly non-linear.

ScalePer-step accuracy10-step problem accuracyAppearance
Small model50%0.1% (0.5^10)Appears unable
Medium model80%10.7% (0.8^10)Marginal
Large model95%60% (0.95^10)Appears to emerge

In this stylized example, per-step accuracy improved smoothly (50% → 80% → 95%), but the composed 10-step accuracy jumped from near-zero to usable in the same range. The "emergence" is an artifact of the composition, not a qualitative change in what the model is doing at each step. This is precisely the Schaeffer argument, and it is compelling for many observed cases.

Practical Implications for Builders

For people building with LLMs rather than studying them theoretically, the emergence debate resolves into a few actionable observations:

What This Means for Model Selection

If your task requires multi-step reasoning, instruction following, or reliable composition of many steps, smaller models may fail completely even if they "understand" the individual components. Do not evaluate capability thresholds with small proxy models — test on the model you intend to deploy.

What This Means for Prompting

Chain-of-thought prompting is itself an emergent ability — it does not help small models and can hurt performance. If you are using a small or quantized model and chain-of-thought is not helping, this is expected, not a prompting failure.

What This Means for Evaluation

Benchmark scores are non-linear in loss, which means small differences in loss between models can produce large differences in benchmark performance — and vice versa. A model with slightly higher loss on a simple benchmark may dramatically outperform on a complex multi-step evaluation.

What This Means for Safety

Dangerous capabilities cannot be fully assessed by testing smaller models. Safety evaluations must be conducted on the model at its final scale, and even then, new capabilities may surface under novel prompting conditions or in deployment.

Checklist: Do You Understand This?

  • Can you define emergent abilities as described in Wei et al. (2022) and give three concrete examples?
  • Can you explain why in-context learning is considered an emergent ability and why GPT-3 was a significant inflection point?
  • Can you articulate the Schaeffer et al. counterargument — specifically, how binary metrics can create the appearance of discontinuous improvement?
  • Do you understand the resolution: that per-step accuracy may improve smoothly while composed task accuracy appears to jump?
  • Can you explain why scaling laws (which predict loss) cannot reliably predict which benchmark capabilities will appear at scale?
  • Can you explain the safety implication: why evaluating smaller models does not reliably identify dangerous capabilities in larger models?
  • Do you understand why chain-of-thought prompting is ineffective on small models but powerful on large ones?