Advanced

Test-Time Compute & Tree Search

The standard mental model of AI scaling focuses on training compute — bigger models trained on more data. Test-time compute scaling is a different axis: spend more compute at inference time to get better answers, without changing the model's weights at all. This page explains the key approaches, why they work, and the practical tradeoffs that determine when they are worth using.

The Key Idea

A trained language model can be queried in many ways. The simplest is greedy decoding: predict the single most likely next token at each step. But this is not optimal. The model can also generate many candidate answers, search over reasoning paths, or apply a verifier to check its work. All of these approaches spend additional compute at inference time to improve output quality — and all of them can be meaningfully scaled.

The core insight, formalized by Snell et al. (2024) in "Scaling LLM Test-Time Compute Optimally," is that test-time compute and training compute are partially fungible. Within limits, a smaller model given more inference compute can match a larger model given less. This does not mean you can infinitely substitute one for the other — but it does mean that how you allocate compute across training and inference is a design choice with real consequences.

Best-of-N Sampling

The simplest test-time compute approach: generate N independent responses to the same prompt, score each one using a reward model or verifier, and return the highest-scoring response.

Best-of-N algorithm

Sample N responses from the model (temperature > 0 for diversity)
Score each response using a reward model or outcome verifier
Return the response with the highest score

Cost: linear in N. Accuracy improvement: roughly logarithmic — each doubling of N gives diminishing returns but consistent gains. Commonly used with N = 4, 8, 16, 32, or 64.

Best-of-N is simple to implement and does not require any changes to the model. Its main limitation is that all N samples are generated independently — it cannot learn from failed attempts or explore the reasoning space strategically. Despite this, it provides consistent improvement and is a strong baseline for any test-time compute method.

Beam Search

Beam search maintains N candidate sequences (called beams) at each generation step. At each step, each current beam is extended by all possible next tokens, and the top-N candidates by cumulative score are kept. This prunes low-probability branches early rather than generating full sequences and discarding them afterward.

Property	Best-of-N	Beam Search
Generation independence	N fully independent samples	N beams share prefixes, diverge at each step
Pruning	None — all N sequences completed	Prunes at each token step; explores space more efficiently
Output diversity	High (stochastic sampling)	Low (deterministic pruning reduces diversity)
Common use	Reasoning tasks, open-ended generation	Translation, structured generation (originally)
LLM suitability	High	Limited — diversity reduction hurts open-ended tasks

Beam search was dominant in sequence-to-sequence models (translation, summarization) but proved less effective for open-ended LLM generation. Highly probable sequences in beam search tend to be repetitive and generic — a known failure mode called beam search degeneracy. For reasoning tasks with LLMs, best-of-N sampling with a verifier generally outperforms standard beam search.

Monte Carlo Tree Search (MCTS)

MCTS is the most principled approach to test-time reasoning search. Originally developed for game playing (AlphaGo, AlphaZero), it builds a tree of possible reasoning states through four repeated phases:

1. Selection (UCB1)

Start from the root (the input prompt). At each node, select the child that maximizes the Upper Confidence Bound (UCB1) formula: estimated value + exploration bonus. The exploration bonus is higher for less-visited nodes, balancing exploitation of known good paths with exploration of unknown ones.

2. Expansion

Add one or more new child nodes to the selected leaf. In LLM reasoning, a "node" is typically a partial reasoning step or sentence. Expansion generates candidate next steps using the LLM.

3. Simulation (Rollout)

From the new node, generate a complete solution by sampling to completion. Evaluate the final answer using a reward signal (correct/incorrect for math, or a learned reward model score). This gives an estimate of how promising the current partial path is.

4. Backpropagation

Propagate the reward from the rollout back up to the root, updating value estimates for all nodes on the path. Over many iterations, nodes on paths that frequently lead to correct answers accumulate high value estimates.

MCTS is more computationally efficient than exhaustive best-of-N because it focuses sampling on promising branches. The tradeoff is implementation complexity and the need for a step-level reward signal — which requires a process reward model.

Process Reward Models (PRMs)

Standard reward models evaluate the final answer: correct or incorrect. Process Reward Models (PRMs) evaluate each reasoning step, assigning a quality score to every intermediate step in the chain-of-thought.

Lightman et al. (OpenAI, 2023), "Let's Verify Step by Step," demonstrated that PRMs significantly outperform outcome reward models (ORMs) for mathematical reasoning. The key finding: a PRM trained to detect errors in individual steps can guide tree search to prune bad reasoning branches early, before they reach the final answer. This makes tree search dramatically more efficient.

	Outcome Reward Model (ORM)	Process Reward Model (PRM)
What it scores	Final answer only	Each reasoning step individually
Training data needed	Answer labels (correct/incorrect)	Step-level human annotations of correctness
Search effectiveness	Weak — cannot prune bad paths early	Strong — prunes bad branches at each step
Training cost	Lower	Higher (step annotations are expensive)

PRMs have a significant limitation: they require fine-grained human annotation of reasoning steps, which is expensive and difficult to scale. Research has explored automatic PRM training using Monte Carlo rollouts (estimate step value by sampling completions from that step), but this adds its own computational overhead.

Scaling Test-Time Compute: The Snell et al. Result

Snell et al. (2024), "Scaling LLM Test-Time Compute Optimally," made the compute fungibility claim precise. Their key finding: at a fixed total compute budget, it is often better to use a smaller model with more test-time compute than a larger model with standard decoding.

The optimal allocation depends on task difficulty

For easy problems that the base model solves reliably, additional test-time compute provides no benefit. For hard problems at the edge of the model's capability, test-time compute substantially helps. The optimal strategy adaptively allocates more compute to harder problems — which requires an estimate of problem difficulty before spending the compute.

Compute-matched comparisons favor test-time scaling

When a small model plus test-time compute is given the same total compute budget as a large model with greedy decoding, the small model with search often wins on reasoning benchmarks. This does not mean small models are always better — training compute still matters for general capability — but it complicates the "just train a bigger model" argument for reasoning tasks.

Which approach to use matters

Not all test-time compute methods scale equally. PRM-guided beam search scales better than best-of-N sampling at large N. The choice of method is itself an optimization problem. Revision-based approaches (generate, evaluate, revise) are sometimes more efficient than pure search.

The o1 Inference Approach (Inferred)

OpenAI has not published the architecture of o1's reasoning system. Based on what they have disclosed and community analysis, the most widely accepted inference is:

o1 uses extended chain-of-thought — it generates a long internal reasoning trace before producing an answer. The "thinking tokens" shown in the UI represent this trace.
The model was trained with reinforcement learning to improve the quality of these reasoning traces on verifiable tasks (math, code), using outcome rewards and likely process rewards.
The model has learned to allocate more thinking tokens to harder problems — it does not always generate a fixed-length chain, but adjusts depth based on apparent task difficulty.
The low/medium/high compute settings exposed by the OpenAI API appear to correspond to different budgets for how long the model is allowed to think before producing a response.

This is consistent with the test-time compute framework: o1 is a model trained to use inference compute more effectively, not simply a larger model. The "intelligence" is partly in knowing when and how to think longer.

Practical Tradeoffs

Latency increases with compute

Test-time compute methods increase response latency. A 40-sample best-of-N run takes 40× longer than a single greedy sample (with parallelism this can be reduced, but not eliminated). Extended thinking in o1-style models can add seconds or tens of seconds to simple responses.

Cost scales with compute

API costs for models like o1 and o3 at high compute settings are significantly higher than for standard models. Running best-of-32 on a hosted API multiplies costs by 32. This is acceptable for high-stakes reasoning tasks; it is not acceptable for high-volume routine queries.

Verifier quality is a bottleneck

Best-of-N and tree search are only as good as the reward model or verifier used to evaluate candidates. A poor verifier introduces systematic errors — selecting confidently wrong answers rather than correct uncertain ones. This is a fundamental challenge: you need good verification to benefit from search, but verification is itself a hard problem.

Best suited for: high-stakes reasoning tasks

Mathematical problem solving, complex coding, scientific analysis, legal reasoning, and any task where a verifiable correct answer exists and the cost of an error exceeds the cost of additional inference compute. Not worth it for simple factual queries, creative writing, or routine extraction tasks.

Checklist: Do You Understand This?

Can you explain what "test-time compute scaling" means and how it differs from training compute scaling?
Can you describe the best-of-N algorithm and explain why its accuracy improvement is logarithmic rather than linear in N?
Can you explain why beam search reduces output diversity and when that makes it inappropriate for LLM tasks?
Can you describe all four phases of MCTS (selection, expansion, simulation, backpropagation) in the context of LLM reasoning?
Can you explain the difference between an outcome reward model and a process reward model, and why PRMs enable more effective search?
Can you state the key finding of Snell et al. (2024) and explain its implication for how we think about model size vs. inference compute?
Can you explain the inferred inference mechanism of o1 and how it relates to the test-time compute framework?
Can you identify three practical constraints that limit when test-time compute scaling is worth using?