Intermediate

What is Test-Time Compute?

For years, making AI models smarter meant training them on more data with more compute. Test-time compute is a different approach: spend more compute at the moment of inference — when the model is actually answering your question. This architectural shift underpins every modern reasoning model.

Training Compute vs Inference Compute

Traditional language models had a fixed compute budget at inference time. Given a prompt, the model runs a single forward pass and produces a response. The quality was entirely determined by the training run.

Test-time compute (TTC) changes this: the model gets to spend additional compute during the inference call itself. It generates intermediate reasoning tokens — a chain of thought — before committing to a final answer. The more compute it spends thinking, the better its answers tend to be on hard problems.

The key insight

Training compute and inference compute are both important, but they scale differently. A model trained with 10× more compute is 10× more expensive to produce — you build it once. Inference compute scales per query — you pay it every time. TTC lets you trade cost-per-query for quality-per-query on demand.

How the Reasoning Loop Works

When a reasoning model receives a prompt, it doesn't immediately generate a response. Instead, it runs a structured reasoning loop:

Receive prompt

User question or agentic task

→

Generate thinking tokens

Break down, try approaches, backtrack if wrong

→

Self-verify mid-trace

Detect errors, add corrective reasoning

→

Continue or stop

Model decides when thinking is sufficient

→

Final answer

Informed by full reasoning trace

The reasoning loop is a trained behaviour — not a 'think step by step' prompt trick

This is called chain-of-thought reasoning, but in reasoning models it is a trained behaviour — the model has been reinforcement-learned to produce high-quality thinking traces that lead to correct answers. It is not the same as the old "think step by step" prompt trick (though that was an early precursor to this idea).

Thinking Tokens: What They Are and How They're Billed

Thinking tokens are the tokens generated during the reasoning phase. They are:

Usually hidden from the user — Claude shows a collapsible block, OpenAI shows a summary or hides them entirely; DeepSeek-R1 shows the full trace
Billed at normal token rates by most providers — a single hard query can generate 10,000–30,000 thinking tokens before the final answer
Not returned in the context window — thinking tokens don't count against your conversation context; only the final answer does
Variable in length — the model decides how much to think based on the problem's perceived difficulty (with optional capping via a budget)

Adjustable Reasoning Effort

Most reasoning model APIs expose a way to control how much thinking the model does:

Standard (no TTC)

Single forward pass — instant but limited reasoning

High-effort TTC

32K–64K+ thinking tokens — best on hard tasks, costly

Factual Q&A

Low effort (1–4K tokens)

Medium effort (8–16K tokens)

High effort (32K+ tokens)

Use medium effort as the default — reserve high only for demonstrably hard problems

Effort Level	Max thinking tokens	Typical latency	Best for
Low / fast	~1,000–4,000	5–15 seconds	Moderately hard problems where some reasoning helps
Medium	~8,000–16,000	20–60 seconds	Complex maths, debugging, structured analysis
High / thorough	~32,000–64,000+	1–5 minutes	Research-grade problems, competition maths, hard proofs

Cost note: At high effort, a single query can cost $0.50–$2.00 with top-tier reasoning models. At scale this adds up fast. Use medium effort as the default and reserve high only for problems that demonstrably need it.

Where Test-Time Compute Excels

Strong TTC benefit

Mathematics

Multi-step algebra, calculus, combinatorics, olympiad-level proofs. AIME benchmark improved from ~15% (standard) to 90%+ (o3/o4-mini).

Formal logic & constraint satisfaction

Logic puzzles, scheduling problems, SAT-like constraint reasoning.

Complex code generation

Algorithms requiring careful implementation, debugging subtle bugs, SWE-bench-style software engineering tasks.

Multi-step planning

Agentic task planning, resource allocation, decision trees with many dependent steps.

Scientific reasoning

Graduate-level science (GPQA benchmark), hypothesis evaluation, experiment design.

Structured analysis

Analysing a complex document, weighing evidence with multiple constraints, detecting inconsistencies in a policy or contract.

Where It Doesn't Help

Minimal or negative TTC benefit

Factual recall — "What year was the Eiffel Tower built?" — the answer is either in the weights or not; reasoning adds latency without accuracy gain
Simple generation — Writing a tweet, summarising a paragraph, translating a sentence — fluency tasks, not reasoning tasks
Real-time chat — 30-second thinking time is incompatible with conversational UX expectations
High-volume pipelines — If you process 100,000 documents/day, reasoning-model pricing and latency make it impractical for most steps
Current knowledge needs — Thinking hard doesn't help with information the model wasn't trained on; you still need retrieval (RAG)

Reasoning Models and Tool Use

A key advance in o3 and o4-mini (April 2025) was integrating tool use withinthe reasoning trace. Earlier reasoning models could only think, then call tools after. Now, reasoning models can:

Search the web mid-reasoning to fetch a fact needed to continue
Run code in a Python interpreter to check a calculation
Analyse an uploaded file to answer a question about it
Generate an image and then reason about whether it meets the requirements

This makes reasoning models much more capable in agentic settings — the model can gather evidence, verify it, and incorporate findings all within a single reasoning trace.

The Inference Cost Curve

Reasoning tokens are not free. The cost premium varies by model:

Model	Input ($/1M tokens)	Output ($/1M tokens)	vs GPT-4o
GPT-4o (standard)	$2.50	$10.00	baseline
o4-mini	$1.10	$4.40	cheaper input; thinking tokens are output
o3	$10.00	$40.00	~4–10× more expensive at high effort
DeepSeek-R1 (API)	$0.55	$2.19	significantly cheaper; open-weight option available

Because thinking tokens are output tokens, and a single hard query can produce 30,000+ thinking tokens, real-world cost per query at high effort can easily reach $0.50–$3.00 for o3. Plan accordingly for production use.

Checklist: Do You Understand This?

Can you explain the difference between training compute and inference compute (test-time compute)?
What are thinking tokens, and how are they billed?
Name three task types where TTC provides strong benefit, and three where it doesn't help.
What does "adjustable reasoning effort" mean in practice?
Why did integrating tool use within the reasoning trace matter for agentic applications?
Roughly how much more expensive is o3 at high effort compared to GPT-4o?