What is Test-Time Compute?
For years, making AI models smarter meant training them on more data with more compute. Test-time compute is a different approach: spend more compute at the moment of inference — when the model is actually answering your question. This architectural shift underpins every modern reasoning model.
Training Compute vs Inference Compute
Traditional language models had a fixed compute budget at inference time. Given a prompt, the model runs a single forward pass and produces a response. The quality was entirely determined by the training run.
Test-time compute (TTC) changes this: the model gets to spend additional compute during the inference call itself. It generates intermediate reasoning tokens — a chain of thought — before committing to a final answer. The more compute it spends thinking, the better its answers tend to be on hard problems.
The key insight
Training compute and inference compute are both important, but they scale differently. A model trained with 10× more compute is 10× more expensive to produce — you build it once. Inference compute scales per query — you pay it every time. TTC lets you trade cost-per-query for quality-per-query on demand.
How the Reasoning Loop Works
When a reasoning model receives a prompt, it doesn't immediately generate a response. Instead, it runs a structured reasoning loop:
The reasoning loop is a trained behaviour — not a 'think step by step' prompt trick
This is called chain-of-thought reasoning, but in reasoning models it is a trained behaviour — the model has been reinforcement-learned to produce high-quality thinking traces that lead to correct answers. It is not the same as the old "think step by step" prompt trick (though that was an early precursor to this idea).
Thinking Tokens: What They Are and How They're Billed
Thinking tokens are the tokens generated during the reasoning phase. They are:
- Usually hidden from the user — Claude shows a collapsible block, OpenAI shows a summary or hides them entirely; DeepSeek-R1 shows the full trace
- Billed at normal token rates by most providers — a single hard query can generate 10,000–30,000 thinking tokens before the final answer
- Not returned in the context window — thinking tokens don't count against your conversation context; only the final answer does
- Variable in length — the model decides how much to think based on the problem's perceived difficulty (with optional capping via a budget)
Adjustable Reasoning Effort
Most reasoning model APIs expose a way to control how much thinking the model does:
Use medium effort as the default — reserve high only for demonstrably hard problems
| Effort Level | Max thinking tokens | Typical latency | Best for |
|---|---|---|---|
| Low / fast | ~1,000–4,000 | 5–15 seconds | Moderately hard problems where some reasoning helps |
| Medium | ~8,000–16,000 | 20–60 seconds | Complex maths, debugging, structured analysis |
| High / thorough | ~32,000–64,000+ | 1–5 minutes | Research-grade problems, competition maths, hard proofs |
Cost note: At high effort, a single query can cost $0.50–$2.00 with top-tier reasoning models. At scale this adds up fast. Use medium effort as the default and reserve high only for problems that demonstrably need it.
Where Test-Time Compute Excels
Strong TTC benefit
Mathematics
Multi-step algebra, calculus, combinatorics, olympiad-level proofs. AIME benchmark improved from ~15% (standard) to 90%+ (o3/o4-mini).
Formal logic & constraint satisfaction
Logic puzzles, scheduling problems, SAT-like constraint reasoning.
Complex code generation
Algorithms requiring careful implementation, debugging subtle bugs, SWE-bench-style software engineering tasks.
Multi-step planning
Agentic task planning, resource allocation, decision trees with many dependent steps.
Scientific reasoning
Graduate-level science (GPQA benchmark), hypothesis evaluation, experiment design.
Structured analysis
Analysing a complex document, weighing evidence with multiple constraints, detecting inconsistencies in a policy or contract.
Where It Doesn't Help
Minimal or negative TTC benefit
- Factual recall — "What year was the Eiffel Tower built?" — the answer is either in the weights or not; reasoning adds latency without accuracy gain
- Simple generation — Writing a tweet, summarising a paragraph, translating a sentence — fluency tasks, not reasoning tasks
- Real-time chat — 30-second thinking time is incompatible with conversational UX expectations
- High-volume pipelines — If you process 100,000 documents/day, reasoning-model pricing and latency make it impractical for most steps
- Current knowledge needs — Thinking hard doesn't help with information the model wasn't trained on; you still need retrieval (RAG)
Reasoning Models and Tool Use
A key advance in o3 and o4-mini (April 2025) was integrating tool use withinthe reasoning trace. Earlier reasoning models could only think, then call tools after. Now, reasoning models can:
- Search the web mid-reasoning to fetch a fact needed to continue
- Run code in a Python interpreter to check a calculation
- Analyse an uploaded file to answer a question about it
- Generate an image and then reason about whether it meets the requirements
This makes reasoning models much more capable in agentic settings — the model can gather evidence, verify it, and incorporate findings all within a single reasoning trace.
The Inference Cost Curve
Reasoning tokens are not free. The cost premium varies by model:
| Model | Input ($/1M tokens) | Output ($/1M tokens) | vs GPT-4o |
|---|---|---|---|
| GPT-4o (standard) | $2.50 | $10.00 | baseline |
| o4-mini | $1.10 | $4.40 | cheaper input; thinking tokens are output |
| o3 | $10.00 | $40.00 | ~4–10× more expensive at high effort |
| DeepSeek-R1 (API) | $0.55 | $2.19 | significantly cheaper; open-weight option available |
Because thinking tokens are output tokens, and a single hard query can produce 30,000+ thinking tokens, real-world cost per query at high effort can easily reach $0.50–$3.00 for o3. Plan accordingly for production use.
Checklist: Do You Understand This?
- Can you explain the difference between training compute and inference compute (test-time compute)?
- What are thinking tokens, and how are they billed?
- Name three task types where TTC provides strong benefit, and three where it doesn't help.
- What does "adjustable reasoning effort" mean in practice?
- Why did integrating tool use within the reasoning trace matter for agentic applications?
- Roughly how much more expensive is o3 at high effort compared to GPT-4o?