Advanced

How o1/o3 Reasons — What We Know

Feb 2026 update: o3 and o4-mini were retired as selectable models in the ChatGPT consumer interface. Reasoning is now surfaced as GPT-5 Thinking. o3 and o4-mini remain available via the OpenAI API. This page covers the technical mechanics of the o-series reasoning approach, which remains valid regardless of consumer UI changes.

OpenAI's o-series models — o1 (September 2024) and o3 (December 2024) — represent a different approach to improving AI capability. Rather than scaling pre-training data and parameters, they scale inference-time reasoning. This page covers what OpenAI has disclosed, what can be reliably inferred from the evidence, and what the benchmark results actually demonstrate.

Receive Question

User sends a complex reasoning task

Extended Thinking

Model explores multiple approaches internally — not shown to user, billed as reasoning tokens

Self-Verification

Model checks its own reasoning chain, backtracks on errors

Final Answer

Concise, verified answer — thinking tokens discarded from context

o1: The First Thinking Model

OpenAI released o1 in September 2024 as the first publicly available model that uses extended internal reasoning before responding. The key visible difference from GPT-4o: o1 shows a "thinking" phase — a stream of reasoning tokens visible to the user — before producing its final answer. Tasks that would take GPT-4o seconds take o1 tens of seconds or longer. The tradeoff is accuracy: on hard reasoning tasks, o1 is dramatically better.

OpenAI described the training approach briefly: o1 was trained with reinforcement learning to "think" before answering. The model learns to generate useful internal reasoning chains that lead to correct answers on verifiable tasks. The thinking tokens are not shown to users in their raw form — they are summarized — and the model decides how long to think based on apparent task difficulty.

What OpenAI has officially stated about o1

Trained with reinforcement learning to reason before answering
Uses a private chain of thought (not exposed verbatim to users)
Learns to allocate more thinking to harder problems
Thinking tokens are hidden from the context window of subsequent turns (to prevent gaming)
Architecture details (number of parameters, exact RL algorithm) not disclosed

Benchmark Results: The Evidence

The benchmark improvements for o1 over GPT-4o are the most compelling public evidence for the effectiveness of the reasoning approach. These are not marginal improvements.

Benchmark	What it tests	GPT-4o	o1
AIME 2024	High school math competition (AMC/AIME problems)	13.4%	83.3%
GPQA Diamond	PhD-level science questions (biology, chemistry, physics)	56.1%	78.3%
Codeforces	Competitive programming problems	~11th percentile	89th percentile
MATH	Competition math at various difficulty levels	~76%	~94%
SWE-bench Verified	Real GitHub issues requiring code patches	~49%	~49%

The AIME result is particularly striking. AIME is a competition that fewer than 5% of participants in the American Mathematics Competition qualify for, and the problems require multi-step original mathematical reasoning. GPT-4o solving 1 in 7 problems; o1 solving more than 8 in 10 — while using the same underlying language model capability — is direct evidence that the reasoning approach works. Note the SWE-bench result: coding on real-world repositories (as opposed to competitive programming) showed no improvement, suggesting the reasoning gains are concentrated in well-structured problem domains.

o3: The ARC-AGI Result

OpenAI announced o3 in December 2024. The most striking result: o3 at high compute settings achieved 87.5% on ARC-AGI — the Abstraction and Reasoning Corpus benchmark designed by François Chollet specifically to resist LLM solutions by requiring genuine novel reasoning that cannot be pattern-matched from training data.

The previous state of the art on ARC-AGI was around 34% from any AI system. Humans score around 85%. o3's 87.5% matched human performance on a benchmark specifically designed to be hard for LLMs. This was a significant result in AI capability evaluation, though it generated substantial discussion about what ARC-AGI actually measures and whether o3 truly "solved" the benchmark or exploited properties of the test set.

The ARC-AGI result: what it does and does not show

The o3 ARC-AGI result demonstrates that test-time compute scaling can produce qualitative improvements in novel reasoning tasks — not just quantitative gains on existing benchmarks. However, the "high compute" setting reportedly cost thousands of dollars per task, meaning the result is not representative of how o3 performs in practical deployment. It also remains debated whether ARC-AGI measures the kind of general fluid intelligence Chollet intended, or whether o3 exploited specific properties of the benchmark.

Inferred Architecture: What the Evidence Suggests

OpenAI has not disclosed the internal architecture of o1 or o3. Based on the technical report, community analysis, and analogous open systems (particularly DeepSeek-R1), the following reconstruction is widely accepted as the most likely approach:

Extended chain-of-thought with RL training

The thinking tokens represent an extended chain-of-thought generated autoregressively. The model was trained with reinforcement learning to generate reasoning traces that maximize the probability of correct final answers on verifiable tasks (math problems with checkable answers, code with runnable tests). The RL training signal does not require human annotation of reasoning steps — only final answer correctness.

Learned compute allocation

The model appears to have learned when longer reasoning helps and when it does not. Simple factual questions receive short thinking traces; hard mathematical or logical problems receive long ones. This allocation is learned from the RL training signal, not hardcoded. The model that thinks longer on hard problems and shorter on easy ones will accumulate more reward than one that allocates compute uniformly.

Process reward model guidance (likely)

The Lightman et al. PRM research at OpenAI (2023), combined with the fact that o1 significantly outperforms what extended CoT alone would predict, suggests a process reward model is involved in training or inference-time search. This would enable the model to evaluate intermediate reasoning steps, not just final answers.

The crucial caveat: this is inference from public evidence, not a disclosed architecture. DeepSeek-R1 has demonstrated that extended CoT with GRPO RL training alone (without a PRM) can produce comparable results. It is possible o1 uses a simpler RL approach than commonly assumed.

The Compute Settings and Cost Tradeoff

OpenAI exposes compute settings for the o-series API: low, medium, and high. These correspond to different budgets for how long the model is allowed to reason. The relationship between compute and accuracy is consistent but shows diminishing returns.

Setting	Thinking time	Relative cost	Best use case
Low	Short (seconds)	~2–5× standard model	Routine tasks where some reasoning is useful
Medium	Moderate (5–30s)	~10–20× standard model	Most reasoning tasks; good accuracy/cost balance
High	Long (30s–minutes)	~50–100× standard model	Maximum accuracy on hardest problems; not for production volume

The cost difference between o1-high and GPT-4o is not marginal. Running o1 at high compute settings for a production application handling thousands of queries would be economically infeasible without very careful task selection. The practical recommendation is to use o1 for tasks where reasoning quality clearly matters and errors have real consequences — not as a general-purpose replacement for GPT-4o.

Limitations and Failure Modes

Still hallucinates — and confidently

o1's extended reasoning does not eliminate hallucination. It can produce long, detailed, internally consistent chains of thought that lead to a confidently stated incorrect answer. Worse, the extended reasoning can make confident errors harder to detect — the answer comes with apparent justification.

Cannot verify its own reasoning errors

o1 cannot reliably catch its own mistakes by re-reading its reasoning chain. Self-checking helps sometimes, but the model that generates an incorrect chain of thought is the same model evaluating it — it tends to confirm rather than refute its own prior conclusions.

Reasoning gains are domain-specific

The dramatic improvements appear primarily in structured domains with verifiable answers: mathematics, competitive programming, formal logic. On open-ended tasks (writing, general knowledge, nuanced judgment), o1 shows more modest improvements over GPT-4o and the cost increase is rarely justified.

Thinking tokens are opaque

The reasoning trace shown to users is a summarized version, not the raw thinking tokens. This means you cannot fully audit what the model actually computed. The visible reasoning chain is itself a generated output, subject to the same faithfulness concerns as chain-of-thought prompting generally.

What "Reasoning" Actually Means Here

It is tempting to describe o1 as reasoning in the same sense humans reason — planning, exploring possibilities, checking work. The more precise description: o1 is a statistical model that has learned, through reinforcement learning, to produce extended token sequences that correlate with correct answers on hard problems. The "search" over possibilities happens in the token generation process, guided by learned policies and value estimates, not through symbolic manipulation or explicit world models.

This is not a dismissal — the practical capability is real and significant. But it means the failure modes are different from human reasoning. o1 does not get tired or distracted, but it also does not truly understand what it is doing in the way a human mathematician does. Its errors reflect the statistical nature of the underlying model, not reasoning errors in a cognitive sense.

Checklist: Do You Understand This?

Can you describe what o1 does differently from GPT-4o at inference time, based on what OpenAI has officially stated?
Can you give three specific benchmark comparisons between GPT-4o and o1, with numbers?
Can you explain what the ARC-AGI result demonstrates, and what it does not demonstrate?
Can you describe the inferred training approach for o1 (extended CoT + RL) and explain which parts are confirmed vs. inferred?
Can you explain why the model learns to allocate more thinking to harder problems, and how this emerges from RL training?
Can you describe the three compute settings exposed by the o1 API and when each is appropriate?
Can you name three specific limitations of o1 that distinguish it from the "perfect reasoner" framing?
Can you explain precisely what "reasoning" means in the context of o1, and why it differs from symbolic reasoning?