Chain-of-Thought — Why It Works
Chain-of-thought prompting is one of the most practically significant discoveries in the history of large language models. A small change — asking the model to show its reasoning steps — produces dramatic improvements on tasks where direct prompting fails. This page explains what the research actually found, the mechanistic hypotheses for why it works, and the important caveats around faithfulness and scale.
The Original Finding
In 2022, Wei et al. published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," one of the most cited papers in NLP. The setup was simple: instead of giving a model few-shot examples that jump directly from question to answer, the authors included intermediate reasoning steps in each example. The effect was striking. On grade-school math benchmarks like GSM8K, a 540B parameter model with chain-of-thought examples solved 57% of problems — compared to under 20% with standard few-shot prompting. On symbolic reasoning tasks, the improvement was even larger.
The same year, Kojima et al. discovered you do not need to write out worked examples at all. Simply appending the phrase "Let's think step by step" to a prompt — a technique called zero-shot chain-of-thought — is enough to trigger reasoning behavior in sufficiently large models. This zero-shot approach achieved around 78% on MultiArith versus 18% for standard zero-shot prompting. Two tokens appended to a prompt more than quadrupled accuracy on a mathematical reasoning benchmark.
| Technique | Paper | What changes in the prompt | Key result |
|---|---|---|---|
| Few-shot CoT | Wei et al. (2022) | Examples include intermediate reasoning steps | GSM8K: ~17% → 57% (PaLM 540B) |
| Zero-shot CoT | Kojima et al. (2022) | Append "Let's think step by step" | MultiArith: 18% → 78% (InstructGPT 175B) |
What CoT Actually Does: Three Hypotheses
Why does generating intermediate text help? There is no single agreed-upon answer, but three mechanistic hypotheses have significant research support:
1. Intermediate tokens as working memory
Autoregressive generation is sequential — each token is predicted from all previous tokens. When a model writes intermediate reasoning steps, those tokens become part of the context that subsequent tokens are conditioned on. This effectively gives the model external scratch space. The model does not need to compress a multi-step computation into a single forward pass through its weights; instead it can offload intermediate results to the token stream, then read them back when computing later steps.
2. Increased effective computation per input
A transformer applies a fixed amount of computation per output token. When a model generates a 200-token chain of thought before producing an answer, it has applied roughly 200× more total computation to the problem than if it generated a one-token answer directly. The total compute budget for answering is proportional to output length, and CoT deliberately increases that length. This reframes CoT as a way to dynamically allocate inference compute.
3. Commitment and constraint propagation
When a model commits to a reasoning step early — "The total number of apples is 12" — that statement becomes a hard constraint on subsequent generation. The model cannot easily contradict itself mid-chain without producing incoherent text. This forces a degree of internal consistency that direct answer generation lacks. Committing to sub-steps progressively narrows the space of valid continuations, guiding the model toward a consistent final answer.
These hypotheses are complementary rather than competing. The actual mechanism likely involves all three operating simultaneously, which is why CoT improvements are so consistent across different model families and task types.
The Faithfulness Question
A deeper question: do the intermediate reasoning steps actually cause the final answer, or are they elaborate post-hoc rationalizations that happen to look plausible? This is the faithfulness problem, and the evidence is genuinely mixed.
| Evidence for faithfulness | Evidence against faithfulness |
|---|---|
| Removing or masking the scratchpad significantly hurts performance — if the steps were post-hoc they should not matter | Scratchpad tokens can sometimes be replaced with incorrect steps without changing the final answer |
| Introducing deliberate errors into the chain causes the final answer to change in predictable ways | Models produce correct answers with logically invalid intermediate steps in some cases |
| Models asked to find errors in their own chains can sometimes locate them, suggesting the chain is actively used | The model's natural-language explanation of its reasoning can diverge from which internal components are actually active |
| Process reward models that evaluate individual steps show that step quality correlates with final answer quality | Human-interpretable reasoning may be a learned output style that partially decouples from internal computation |
The practical takeaway: CoT reasoning steps are partially faithful — they matter and influence the output, but they are not a transparent window into the model's computation. This matters for interpretability: you cannot fully trust a chain-of-thought trace as an explanation of why the model reached a particular conclusion.
Emergence and Scale
One of the most important findings in the Wei et al. paper: chain-of-thought benefits are emergent with model scale. For models below roughly 100 billion parameters, CoT prompting either provides no improvement or actively hurts performance. Small models generate incorrect reasoning chains — plausible-sounding but wrong — which then mislead the final answer. The capability appears relatively suddenly at large scale, which is why CoT was initially missed: researchers testing on smaller models would have seen no effect.
This has a practical implication for 2025. Many widely-used open models (7B, 13B, 34B) sit below the threshold where CoT reliably helps on genuine multi-step problems. Instruction-tuning and reinforcement learning from human feedback can shift this threshold somewhat — a carefully fine-tuned 7B reasoning model can outperform a raw 7B base model on CoT tasks — but the fundamental scaling relationship remains. Models trained specifically for reasoning (like DeepSeek-R1 distillations) are a partial exception, but they were fine-tuned on CoT data generated by much larger models.
Self-Consistency: Sampling Multiple Chains
Wang et al. (2022) introduced a simple but powerful extension to CoT: instead of generating one reasoning chain and taking its answer, generate N independent chains (by sampling with temperature > 0) and take a majority vote on the final answer. This is called self-consistency.
The logic is that different reasoning paths reach the same correct answer via different routes, while errors tend to produce diverse wrong answers. Aggregating across many paths substantially improves accuracy. On GSM8K, self-consistency with 40 samples pushed PaLM 540B from 57% (single CoT) to 74%. The cost is linear in the number of samples — 40 samples costs 40× more inference compute. This is an early, explicit instance of test-time compute scaling.
Self-consistency algorithm
- Generate N independent completions for the same prompt (temperature > 0 for diversity)
- Extract the final answer from each completion
- Take a majority vote across all N final answers
- Return the most common answer
Typical N values: 10–40 samples. Diminishing returns appear beyond 40 in most benchmarks. Works best on tasks with discrete, verifiable answers (math, logic, classification). Does not help when the model is systematically biased — all samples will agree on the wrong answer.
When CoT Helps — And When It Does Not
| Task type | CoT benefit | Reason |
|---|---|---|
| Multi-step arithmetic | Large | Requires sequential sub-computations; errors accumulate without explicit steps |
| Logic puzzles and symbolic reasoning | Large | Many sub-steps with hard constraints; CoT enforces consistency across steps |
| Multi-hop commonsense reasoning | Moderate | Tasks requiring multiple inference steps across facts benefit most |
| Complex code generation | Moderate | Planning helps on hard problems; simple coding tasks do not benefit |
| Simple factual lookup | None / negative | Direct answer is sufficient; CoT adds cost and can introduce noise |
| Sentiment classification | None | Single-step task with no sub-structure; chain adds noise without benefit |
| Translation | Mixed | Not decomposable into sequential sub-steps in a useful way for most sentence pairs |
The practical rule: use chain-of-thought when the task has multiple steps where an error in an early step would propagate to the final answer. Skip it for tasks where the answer is directly retrievable or where the chain cannot be meaningfully structured. Every CoT token costs inference compute and adds latency — it should be deliberate, not default.
Notable Extensions
Least-to-Most Prompting
Decompose a complex problem into simpler sub-problems, solve them in order, and use prior answers to solve later sub-problems. Particularly effective on compositional generalization tasks where standard CoT still fails.
Tree of Thoughts
Instead of a single linear chain, generate a tree of partial reasoning paths and use a search procedure (BFS or DFS) to find the best path. The model evaluates candidate next steps, not just final answers, enabling backtracking when a branch leads nowhere. Precursor to process-reward-model-guided tree search.
ReAct (Reason + Act)
Interleaves reasoning steps with external actions (tool calls, web searches, code execution). Each observation from an action feeds back into the reasoning chain, grounding CoT in verifiable external information and reducing hallucination on factual tasks.
Program of Thought
Use executable code as the reasoning chain. Instead of natural-language steps, generate a Python script and execute it. Dramatically more reliable on arithmetic tasks because code execution is deterministic and results are verified, not hallucinated.
Checklist: Do You Understand This?
- Can you describe the key finding of Wei et al. (2022) in one sentence, including the magnitude of improvement on GSM8K?
- Can you explain zero-shot CoT and what the Kojima et al. result implies about where CoT capability is stored?
- Can you articulate all three mechanistic hypotheses for why CoT works, and explain why they are complementary rather than competing?
- Can you explain the faithfulness problem and provide one piece of evidence on each side of the debate?
- Can you explain why CoT benefits are emergent with scale, and what this means for using CoT with 7B open models?
- Can you describe the self-consistency algorithm and explain the inference cost tradeoff it introduces?
- Can you name at least two task types where CoT helps significantly and two where it provides no benefit?
- Can you explain how Tree of Thoughts extends standard linear CoT, and why it enables backtracking?