Coding Agent Architecture
A coding agent is an LLM that can read a codebase, write and edit files, execute code, run tests, and iterate until a task is complete β autonomously, within a sandbox, with a human checkpoint before anything is committed. This architecture underpins Claude Code, GitHub Copilot Agent mode, and custom internal coding agents.
System Architecture
The human review gate is non-negotiable β the agent never commits autonomously
Codebase Context: The Hardest Problem
A coding agent needs to understand a codebase that is often much larger than the LLM's context window. Naively dumping all files into context is expensive, slow, and often counterproductive (the model gets confused by irrelevant code). Production coding agents use a structured context strategy:
Repository map
A compact summary of the repository structure β files, classes, functions, and their signatures β without the full implementation body. This fits in a few thousand tokens and gives the agent a map of what exists and where. Tools like tree-sitterparse code structure statically; aider's repo-map algorithm selects the most relevant symbols based on the current task.
Relevant file retrieval
Given the repository map and the task, the agent selects which files to read in full. Retrieval strategies used in practice:
- Semantic search over code β embed all functions/classes, retrieve by similarity to the task description
- Symbol search β grep for specific class/function names mentioned in the task
- Dependency tracing β given a file to modify, find all files that import it (impact analysis)
- Test file pairing β automatically include test files alongside the implementation files they test
Context window management
Read only what is needed. Use the read_file tool rather than pre-loading everything. The agent can request additional files mid-task as it discovers dependencies. Cache file contents within a session to avoid re-reading unchanged files. For very large files, read specific line ranges rather than the full file.
The Tool Set
A minimal but sufficient tool set for a coding agent covers five categories:
| Tool | What it does | Key parameters |
|---|---|---|
read_file | Read a file's content (all or a line range) | path, start_line, end_line |
write_file | Write or overwrite a file with new content | path, content |
edit_file | Apply a targeted diff/patch to a file (safer than full write) | path, old_content, new_content |
run_command | Execute a shell command and return stdout/stderr | command, timeout, working_dir |
run_tests | Run the test suite (or a subset) and return results | test_path, filter |
search_code | Grep/ripgrep for a pattern across the codebase | pattern, file_glob, context_lines |
list_directory | List files in a directory with metadata | path, recursive |
Security note: The run_command tool is the most dangerous. Always run it inside a sandboxed container with no network access and no access to production credentials. The container should be ephemeral β created fresh per session and destroyed after. Never run this tool in the user's live environment without explicit sandboxing.
Planning: Task Decomposition
Before writing any code, a well-designed coding agent produces a plan:
- Read and understand the task (issue description, failing test, feature request)
- Explore the repository map and retrieve relevant files
- Identify which files need to be created or modified
- Identify which tests need to pass (or be written first in TDD mode)
- Decompose into ordered implementation steps
- Optionally, present the plan to the user before execution (confirm gate)
The plan is kept in the agent's working memory (typically as a list in the system prompt or conversation context) and updated as steps complete. This prevents the agent from losing track on long multi-file tasks.
Execution Loop
The core ReAct-style execution loop iterates until all planned steps complete or the agent decides it cannot proceed:
Max step limit (e.g. 50 tool calls) prevents runaway loops β agent returns partial result + status on limit hit
Set a maximum step limit (e.g., 50 tool calls) to prevent runaway loops. If the agent hits the limit without succeeding, it should return a partial result and a status summary rather than silently failing.
Sandboxed Execution
All file writes and command executions must happen inside an isolated environment:
Docker container (recommended)
- No network access (--network none)
- Read-only mounts for source code; writable /workspace
- Resource limits (CPU, memory, disk I/O)
- Non-root user inside container
- Time limit per command (30s default)
- Container destroyed after session ends
E2B cloud sandbox
- Managed sandbox-as-a-service
- Instant spin-up (<150ms)
- Python, Node, and custom environments
- Filesystem persistence across calls within a session
- Automatic cleanup on timeout
- Good for cloud-hosted coding agents
The sandbox must contain the same language runtime, dependencies, and toolchain as the target environment. This is where βit works in sandbox but fails in productionβ bugs come from β invest in keeping the sandbox environment aligned with CI/CD.
Test-Driven Development Mode
The most reliable coding agent workflow follows TDD: write failing tests first, then implement until the tests pass. This gives the agent an unambiguous success criterion:
- Agent reads the task and writes test cases that would verify the correct implementation
- Runs tests β all should fail (red)
- Implements the feature/fix iteratively
- Runs tests after each implementation step β success when all pass (green)
- Optional: refactor while keeping tests green
Without a clear success criterion (tests passing), the agent has no reliable way to know when it is done and may keep making unnecessary changes or stop too early.
Human-in-the-Loop: The Review Gate
A coding agent should never commit code autonomously. The human review gate is a non-negotiable checkpoint:
- Agent completes implementation β all tests pass in sandbox
- Agent generates a diff/PR summary: what changed, why, which tests validate it
- Human reviews the diff in their IDE or PR interface
- Human approves, requests changes, or rejects
- If changes requested, agent resumes the execution loop with the feedback
- Only on human approval does the agent commit/push or open a PR
This gate is where hallucinated logic, subtle security issues, and unintended side effects get caught. The agent handles the mechanical work; the human maintains ownership of what enters the codebase.
SWE-bench: Measuring Coding Agent Quality
SWE-bench (Software Engineering Benchmark) is the standard evaluation for coding agents. It consists of 2,294 real GitHub issues from popular open-source Python repositories (scikit-learn, matplotlib, Django, etc.). For each issue, the agent must produce a patch that makes the failing test pass on the original repository.
| Agent | SWE-bench Verified score (approx. 2025) | Notes |
|---|---|---|
| Claude Code (Anthropic) | ~70β72% | Highest published as of mid-2025; uses full tool loop |
| o3 + Claude scaffold | ~69% | Strong reasoning + tool use combination |
| GitHub Copilot Agent | ~55β60% | IDE-integrated; good for in-context editing tasks |
| GPT-4o (standard) | ~33β38% | Without reasoning; general-purpose model |
SWE-bench Verified (a curated subset of ~500 high-quality issues) is the more reliable metric β SWE-bench full has many ambiguous or underspecified issues. Use SWE-bench Verified scores when comparing coding agent claims.
Framework Choices
| Tool | Best for | Key constraint |
|---|---|---|
| Claude Code (Anthropic) | Terminal-based autonomous coding; highest SWE-bench performance | Requires Claude API; terminal workflow |
| GitHub Copilot Agent | IDE-integrated; PR workflow; team environments already on GitHub | GitHub ecosystem only; moderate autonomy |
| Cursor | Interactive coding with agent assist; best UX for interactive editing | Primarily interactive, not fully autonomous |
| Custom (LangGraph + tools) | Full control; custom sandboxes; enterprise requirements | High implementation effort; maintain yourself |
| Aider | Open-source; multi-model support; repo-map context | CLI-only; less mature than Claude Code for complex tasks |
Checklist: Do You Understand This?
- Why can't a coding agent simply load all files in a repository into context?
- What is a repository map and how does it help with codebase context?
- Why must the
run_commandtool always execute inside a sandboxed container? - What is the TDD workflow for a coding agent, and why does it produce more reliable results?
- What happens at the human-in-the-loop review gate, and why is autonomous commit never acceptable?
- What does SWE-bench Verified measure, and what score represents state-of-the-art as of 2025?
- What should the agent do when it hits its maximum step limit without completing the task?