Advanced

Codex Agent

OpenAI Codex is an agentic software engineering product — not to be confused with the original 2021 Codex text-completion model that powered GitHub Copilot's first generation. The current Codex is a full software engineering agent built on GPT-5-Codex, capable of taking a task description and autonomously building, testing, debugging, and refactoring across a real codebase.

What Codex Is

Codex is OpenAI's answer to the question: what happens when you give a highly capable language model access to a real development environment and ask it to get work done? Rather than just answering coding questions conversationally, Codex is a delegate — you hand it a task, it executes that task across your codebase, writes and runs tests, reads error output, adjusts its approach, and reports back when done (or when it needs clarification).

The underlying model is GPT-5-Codex — a variant of GPT-5 specifically optimised for software engineering tasks. It is tuned for code comprehension across entire repositories, test writing, multi-file refactoring, and interpreting compiler and test output.

Capabilities

What Codex Can Do

Build full projects from a written specification
Add features to an existing codebase
Write, run, and fix failing unit tests
Debug issues by reading stack traces and runtime errors
Perform large-scale refactors across many files simultaneously
Conduct code reviews and produce structured feedback
Handle long sessions without losing context
Reason across web sources, cloud environments, and IDE context

Task Execution Model

You give Codex a task — written in natural language — and it executes autonomously. You can interrupt mid-task and redirect it if you want to change the approach. It maintains context over long sessions and can handle tasks that span hours of real development work. It is designed to be used like a junior engineer you can delegate to: you describe what you want, it does the work, you review the output.

Codex vs ChatGPT Coding

The distinction is important for understanding when to use each:

Dimension	ChatGPT (coding questions)	Codex Agent
Interaction model	Conversational — ask, receive answer	Delegation — assign task, agent executes
Code execution	Sandbox only (Code Interpreter)	Real codebase, real environment
Scope	Single function, snippet, explanation	Entire feature, refactor, project build
Test integration	Writes tests, does not run them	Writes and runs tests, fixes failures
Session length	Short turn-by-turn	Long autonomous sessions

Performance Benchmarks

As of early 2026, GPT-5.3-Codex leads Terminal-Bench 2.0 at 77.3%, compared to Claude Code's 65.4%. On SWE-Bench Pro (real-world GitHub issue resolution), it achieves approximately 56.8%. These are among the highest scores recorded on agentic coding benchmarks at this point in time.

Benchmark scores should be treated as directional indicators rather than guarantees — performance varies significantly by codebase, language, and task type.

Access and Availability

Codex is included with ChatGPT Plus, Pro, Business, Edu, and Enterprise plans — there is no separate pricing or add-on required. It is accessible from:

chatgpt.com: Via the web interface with a dedicated Codex tab
VS Code extension: Integrated into the editor for in-IDE task delegation
Codex CLI: Terminal-based access (see the Codex CLI page for details)

Checklist

What is the underlying model that powers Codex Agent, and what is it optimised for?
How does the interaction model of Codex differ from using ChatGPT for coding help?
What can Codex do with tests that ChatGPT coding mode cannot?
On which ChatGPT plans is Codex included?
What does Codex's Terminal-Bench 2.0 score indicate relative to competitors?