Reasoning Models Explained
Since late 2024, a new type of AI model has appeared: reasoning models. They tackle hard problems far better than standard models — but they work differently and are not always the right choice. This page explains what they do and when to use them.
What Makes a Reasoning Model Different
A standard model (like GPT-4o or Claude Sonnet) reads your message and immediately starts generating a response. It's fast but can miss nuance in complex problems.
A reasoning model pauses first. Before giving you an answer, it generates a long internal "thinking" trace — working through the problem step by step, trying different approaches, checking its own logic, and backtracking when it spots a mistake. Only after that thinking process does it produce a final answer.
Standard model
- You send a message
- Model generates tokens immediately
- Response appears (fast)
Good for: most everyday tasks, writing, editing, chat
Reasoning model
- You send a message
- Model generates internal thinking (many tokens, often hidden)
- Model checks and refines its reasoning
- Final answer appears (slower)
Good for: hard maths, complex logic, multi-step planning, debugging
The thinking is usually hidden. In ChatGPT, you sometimes see a "Thinking..." indicator. Claude shows a collapsed "thinking" section. DeepSeek-R1 shows the full reasoning trace. The internal thinking can be thousands of tokens long — even if the final answer is short.
Why This Makes Models Much Better at Hard Problems
The improvement on difficult tasks is dramatic. On AIME 2024 (a hard maths olympiad exam), DeepSeek-R1 went from 15.6% to 77.9% accuracy through reinforcement learning that rewarded correct reasoning. OpenAI's o3 scores above 90% on maths benchmarks where standard GPT-4 struggles to reach 50%.
The reasoning process helps because:
- Decomposition — Breaking a hard problem into steps makes each step easier to get right
- Self-checking — The model can catch its own mistakes mid-thought and correct them before the final answer
- Backtracking — If an approach fails, the model can try a different path
- More compute on hard parts — The model spends more tokens on the difficult parts of a problem rather than rushing through them
The Major Reasoning Models (2025)
| Model | Made by | Access | Thinking visible? | Adjustable? |
|---|---|---|---|---|
| o3 | OpenAI | ChatGPT Plus / API | Partial (summary) | Yes (effort level) |
| o4-mini | OpenAI | ChatGPT / API | Partial (summary) | Yes (effort level) |
| DeepSeek-R1 | DeepSeek | Free API / download locally | Yes (full trace) | No |
| Claude + Extended Thinking | Anthropic | Claude.ai Pro / API | Collapsed block | Yes (budget) |
| Gemini 2.5 Pro (thinking) | Gemini Advanced / API | Partial | Yes (budget) |
The Trade-off: Better Answers, But Slower and Costlier
Reasoning models are not always better. The thinking tokens cost money and time. For a simple question like "What is the capital of France?", a reasoning model is overkill — it wastes tokens thinking about something that doesn't need it.
Use a reasoning model when
- Hard maths or scientific calculations
- Logic puzzles and brain teasers
- Complex multi-step planning
- Debugging tricky code errors
- Analysing a complicated argument or text
- The standard model gave a wrong or inconsistent answer
Skip it and use standard model when
- Simple factual questions
- Writing, editing, brainstorming
- Summarising documents
- Real-time chat where speed matters
- Creative writing
- High-volume automated tasks (cost adds up fast)
Adjustable Thinking: Fast vs Thorough
Some reasoning models let you control how much they think. This is sometimes called a "thinking budget" or "reasoning effort level."
- Low effort — quicker, cheaper, good for moderately hard problems
- High effort — longer thinking, better answers on the hardest problems, but noticeably slower and more expensive
o4-mini on low effort is often a good sweet spot — much smarter than a standard model at a fraction of o3's cost. Start there and upgrade only if needed.
DeepSeek-R1: Open-Weight Reasoning
When DeepSeek-R1 launched in January 2025, it shocked the AI world: it matched frontier reasoning model performance while being open-weight — meaning you can download and run it yourself for free.
Key facts about DeepSeek-R1:
- Free to use via DeepSeek's website and API (very cheap API pricing)
- Can be downloaded and run locally via Ollama on a capable home GPU
- Shows its full reasoning trace (not hidden) — good for learning and debugging
- Strong on maths and coding; somewhat weaker on long-form English writing
- Context: trained in China, so data privacy and censorship considerations apply for sensitive topics
A Simple Test: Does Thinking Help?
Not sure whether to use a reasoning model? Try this:
- Ask a standard model (e.g. GPT-4o or Claude Sonnet) your question
- If the answer looks right and you can verify it — done
- If the answer seems wrong, inconsistent, or incomplete — try o3, o4-mini, or DeepSeek-R1
- If the reasoning model also struggles, the problem may need a different approach (more context, clearer framing, or a specialist tool)
You'll quickly develop an intuition for which types of problems benefit. Anything that involves counting steps, tracking constraints, or doing calculations is a strong candidate.
Checklist: Do You Understand This?
- Can you explain in plain words what a reasoning model does differently from a standard model?
- Can you name two reasoning models and who makes them?
- Name three tasks where a reasoning model is the better choice.
- Name three tasks where a standard model is fine and the reasoning model is overkill.
- What does "adjustable thinking budget" mean and why does it matter?
- What is special about DeepSeek-R1 compared to other reasoning models?