Advanced

Chain-of-Thought — Why It Works

Chain-of-thought prompting is one of the most practically significant discoveries in the history of large language models. A small change — asking the model to show its reasoning steps — produces dramatic improvements on tasks where direct prompting fails. This page explains what the research actually found, the mechanistic hypotheses for why it works, and the important caveats around faithfulness and scale.

Direct Answer

Q → Answer (no reasoning)

→

Chain-of-Thought

Q → Step 1 → Step 2 → Answer

CoT adds intermediate reasoning steps — the model 'thinks out loud', dramatically improving multi-step problem accuracy

The Original Finding

In 2022, Wei et al. published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," one of the most cited papers in NLP. The setup was simple: instead of giving a model few-shot examples that jump directly from question to answer, the authors included intermediate reasoning steps in each example. The effect was striking. On grade-school math benchmarks like GSM8K, a 540B parameter model with chain-of-thought examples solved 57% of problems — compared to under 20% with standard few-shot prompting. On symbolic reasoning tasks, the improvement was even larger.

The same year, Kojima et al. discovered you do not need to write out worked examples at all. Simply appending the phrase "Let's think step by step" to a prompt — a technique called zero-shot chain-of-thought — is enough to trigger reasoning behavior in sufficiently large models. This zero-shot approach achieved around 78% on MultiArith versus 18% for standard zero-shot prompting. Two tokens appended to a prompt more than quadrupled accuracy on a mathematical reasoning benchmark.

Technique	Paper	What changes in the prompt	Key result
Few-shot CoT	Wei et al. (2022)	Examples include intermediate reasoning steps	GSM8K: ~17% → 57% (PaLM 540B)
Zero-shot CoT	Kojima et al. (2022)	Append "Let's think step by step"	MultiArith: 18% → 78% (InstructGPT 175B)

What CoT Actually Does: Three Hypotheses

Why does generating intermediate text help? There is no single agreed-upon answer, but three mechanistic hypotheses have significant research support:

1. Intermediate tokens as working memory

Autoregressive generation is sequential — each token is predicted from all previous tokens. When a model writes intermediate reasoning steps, those tokens become part of the context that subsequent tokens are conditioned on. This effectively gives the model external scratch space. The model does not need to compress a multi-step computation into a single forward pass through its weights; instead it can offload intermediate results to the token stream, then read them back when computing later steps.

2. Increased effective computation per input

A transformer applies a fixed amount of computation per output token. When a model generates a 200-token chain of thought before producing an answer, it has applied roughly 200× more total computation to the problem than if it generated a one-token answer directly. The total compute budget for answering is proportional to output length, and CoT deliberately increases that length. This reframes CoT as a way to dynamically allocate inference compute.

3. Commitment and constraint propagation

When a model commits to a reasoning step early — "The total number of apples is 12" — that statement becomes a hard constraint on subsequent generation. The model cannot easily contradict itself mid-chain without producing incoherent text. This forces a degree of internal consistency that direct answer generation lacks. Committing to sub-steps progressively narrows the space of valid continuations, guiding the model toward a consistent final answer.

These hypotheses are complementary rather than competing. The actual mechanism likely involves all three operating simultaneously, which is why CoT improvements are so consistent across different model families and task types.

The Faithfulness Question

A deeper question: do the intermediate reasoning steps actually cause the final answer, or are they elaborate post-hoc rationalizations that happen to look plausible? This is the faithfulness problem, and the evidence is genuinely mixed.

Evidence for faithfulness	Evidence against faithfulness
Removing or masking the scratchpad significantly hurts performance — if the steps were post-hoc they should not matter	Scratchpad tokens can sometimes be replaced with incorrect steps without changing the final answer
Introducing deliberate errors into the chain causes the final answer to change in predictable ways	Models produce correct answers with logically invalid intermediate steps in some cases
Models asked to find errors in their own chains can sometimes locate them, suggesting the chain is actively used	The model's natural-language explanation of its reasoning can diverge from which internal components are actually active
Process reward models that evaluate individual steps show that step quality correlates with final answer quality	Human-interpretable reasoning may be a learned output style that partially decouples from internal computation

The practical takeaway: CoT reasoning steps are partially faithful — they matter and influence the output, but they are not a transparent window into the model's computation. This matters for interpretability: you cannot fully trust a chain-of-thought trace as an explanation of why the model reached a particular conclusion.

Emergence and Scale

One of the most important findings in the Wei et al. paper: chain-of-thought benefits are emergent with model scale. For models below roughly 100 billion parameters, CoT prompting either provides no improvement or actively hurts performance. Small models generate incorrect reasoning chains — plausible-sounding but wrong — which then mislead the final answer. The capability appears relatively suddenly at large scale, which is why CoT was initially missed: researchers testing on smaller models would have seen no effect.

This has a practical implication for 2025. Many widely-used open models (7B, 13B, 34B) sit below the threshold where CoT reliably helps on genuine multi-step problems. Instruction-tuning and reinforcement learning from human feedback can shift this threshold somewhat — a carefully fine-tuned 7B reasoning model can outperform a raw 7B base model on CoT tasks — but the fundamental scaling relationship remains. Models trained specifically for reasoning (like DeepSeek-R1 distillations) are a partial exception, but they were fine-tuned on CoT data generated by much larger models.

Self-Consistency: Sampling Multiple Chains

Wang et al. (2022) introduced a simple but powerful extension to CoT: instead of generating one reasoning chain and taking its answer, generate N independent chains (by sampling with temperature > 0) and take a majority vote on the final answer. This is called self-consistency.

The logic is that different reasoning paths reach the same correct answer via different routes, while errors tend to produce diverse wrong answers. Aggregating across many paths substantially improves accuracy. On GSM8K, self-consistency with 40 samples pushed PaLM 540B from 57% (single CoT) to 74%. The cost is linear in the number of samples — 40 samples costs 40× more inference compute. This is an early, explicit instance of test-time compute scaling.

Self-consistency algorithm

Generate N independent completions for the same prompt (temperature > 0 for diversity)
Extract the final answer from each completion
Take a majority vote across all N final answers
Return the most common answer

Typical N values: 10–40 samples. Diminishing returns appear beyond 40 in most benchmarks. Works best on tasks with discrete, verifiable answers (math, logic, classification). Does not help when the model is systematically biased — all samples will agree on the wrong answer.

When CoT Helps — And When It Does Not

Task type	CoT benefit	Reason
Multi-step arithmetic	Large	Requires sequential sub-computations; errors accumulate without explicit steps
Logic puzzles and symbolic reasoning	Large	Many sub-steps with hard constraints; CoT enforces consistency across steps
Multi-hop commonsense reasoning	Moderate	Tasks requiring multiple inference steps across facts benefit most
Complex code generation	Moderate	Planning helps on hard problems; simple coding tasks do not benefit
Simple factual lookup	None / negative	Direct answer is sufficient; CoT adds cost and can introduce noise
Sentiment classification	None	Single-step task with no sub-structure; chain adds noise without benefit
Translation	Mixed	Not decomposable into sequential sub-steps in a useful way for most sentence pairs

The practical rule: use chain-of-thought when the task has multiple steps where an error in an early step would propagate to the final answer. Skip it for tasks where the answer is directly retrievable or where the chain cannot be meaningfully structured. Every CoT token costs inference compute and adds latency — it should be deliberate, not default.

Notable Extensions

Least-to-Most Prompting

Decompose a complex problem into simpler sub-problems, solve them in order, and use prior answers to solve later sub-problems. Particularly effective on compositional generalization tasks where standard CoT still fails.

Tree of Thoughts

Instead of a single linear chain, generate a tree of partial reasoning paths and use a search procedure (BFS or DFS) to find the best path. The model evaluates candidate next steps, not just final answers, enabling backtracking when a branch leads nowhere. Precursor to process-reward-model-guided tree search.

ReAct (Reason + Act)

Interleaves reasoning steps with external actions (tool calls, web searches, code execution). Each observation from an action feeds back into the reasoning chain, grounding CoT in verifiable external information and reducing hallucination on factual tasks.

Program of Thought

Use executable code as the reasoning chain. Instead of natural-language steps, generate a Python script and execute it. Dramatically more reliable on arithmetic tasks because code execution is deterministic and results are verified, not hallucinated.

Checklist: Do You Understand This?

Can you describe the key finding of Wei et al. (2022) in one sentence, including the magnitude of improvement on GSM8K?
Can you explain zero-shot CoT and what the Kojima et al. result implies about where CoT capability is stored?
Can you articulate all three mechanistic hypotheses for why CoT works, and explain why they are complementary rather than competing?
Can you explain the faithfulness problem and provide one piece of evidence on each side of the debate?
Can you explain why CoT benefits are emergent with scale, and what this means for using CoT with 7B open models?
Can you describe the self-consistency algorithm and explain the inference cost tradeoff it introduces?
Can you name at least two task types where CoT helps significantly and two where it provides no benefit?
Can you explain how Tree of Thoughts extends standard linear CoT, and why it enables backtracking?