🧠 All Things AI
Advanced

Coding Agent Architecture

A coding agent is an LLM that can read a codebase, write and edit files, execute code, run tests, and iterate until a task is complete β€” autonomously, within a sandbox, with a human checkpoint before anything is committed. This architecture underpins Claude Code, GitHub Copilot Agent mode, and custom internal coding agents.

System Architecture

Input
Task Input
Issue / ticket / natural language
Context
Repository Map
Symbols + signatures (tree-sitter)
File Retrieval
Semantic + symbol + dependency search
Agent Core
LLM Planner
Task decomposition + action selection
Tool Loop
read / write / run / test / search
Execution (sandboxed)
Docker Container
No network, ephemeral, resource-limited
Test Runner
pytest / jest / cargo test
Review Gate
Human Review
Diff + PR summary before any commit

The human review gate is non-negotiable β€” the agent never commits autonomously

Codebase Context: The Hardest Problem

A coding agent needs to understand a codebase that is often much larger than the LLM's context window. Naively dumping all files into context is expensive, slow, and often counterproductive (the model gets confused by irrelevant code). Production coding agents use a structured context strategy:

Repository map

A compact summary of the repository structure β€” files, classes, functions, and their signatures β€” without the full implementation body. This fits in a few thousand tokens and gives the agent a map of what exists and where. Tools like tree-sitterparse code structure statically; aider's repo-map algorithm selects the most relevant symbols based on the current task.

Relevant file retrieval

Given the repository map and the task, the agent selects which files to read in full. Retrieval strategies used in practice:

  • Semantic search over code β€” embed all functions/classes, retrieve by similarity to the task description
  • Symbol search β€” grep for specific class/function names mentioned in the task
  • Dependency tracing β€” given a file to modify, find all files that import it (impact analysis)
  • Test file pairing β€” automatically include test files alongside the implementation files they test

Context window management

Read only what is needed. Use the read_file tool rather than pre-loading everything. The agent can request additional files mid-task as it discovers dependencies. Cache file contents within a session to avoid re-reading unchanged files. For very large files, read specific line ranges rather than the full file.

The Tool Set

A minimal but sufficient tool set for a coding agent covers five categories:

ToolWhat it doesKey parameters
read_fileRead a file's content (all or a line range)path, start_line, end_line
write_fileWrite or overwrite a file with new contentpath, content
edit_fileApply a targeted diff/patch to a file (safer than full write)path, old_content, new_content
run_commandExecute a shell command and return stdout/stderrcommand, timeout, working_dir
run_testsRun the test suite (or a subset) and return resultstest_path, filter
search_codeGrep/ripgrep for a pattern across the codebasepattern, file_glob, context_lines
list_directoryList files in a directory with metadatapath, recursive

Security note: The run_command tool is the most dangerous. Always run it inside a sandboxed container with no network access and no access to production credentials. The container should be ephemeral β€” created fresh per session and destroyed after. Never run this tool in the user's live environment without explicit sandboxing.

Planning: Task Decomposition

Before writing any code, a well-designed coding agent produces a plan:

  1. Read and understand the task (issue description, failing test, feature request)
  2. Explore the repository map and retrieve relevant files
  3. Identify which files need to be created or modified
  4. Identify which tests need to pass (or be written first in TDD mode)
  5. Decompose into ordered implementation steps
  6. Optionally, present the plan to the user before execution (confirm gate)

The plan is kept in the agent's working memory (typically as a list in the system prompt or conversation context) and updated as steps complete. This prevents the agent from losing track on long multi-file tasks.

Execution Loop

The core ReAct-style execution loop iterates until all planned steps complete or the agent decides it cannot proceed:

Think
What is the next action to take?
β†’
Act
Call tool: read / write / run / test / search
β†’
Observe
Receive and parse tool output
β†’
Evaluate
Tests passing? Action succeed?
β†’
Update plan
Mark complete or add correction steps
β†’
Exit or escalate
Done β†’ human review; stuck 3Γ— β†’ surface to human

Max step limit (e.g. 50 tool calls) prevents runaway loops β€” agent returns partial result + status on limit hit

Set a maximum step limit (e.g., 50 tool calls) to prevent runaway loops. If the agent hits the limit without succeeding, it should return a partial result and a status summary rather than silently failing.

Sandboxed Execution

All file writes and command executions must happen inside an isolated environment:

Docker container (recommended)

  • No network access (--network none)
  • Read-only mounts for source code; writable /workspace
  • Resource limits (CPU, memory, disk I/O)
  • Non-root user inside container
  • Time limit per command (30s default)
  • Container destroyed after session ends

E2B cloud sandbox

  • Managed sandbox-as-a-service
  • Instant spin-up (<150ms)
  • Python, Node, and custom environments
  • Filesystem persistence across calls within a session
  • Automatic cleanup on timeout
  • Good for cloud-hosted coding agents

The sandbox must contain the same language runtime, dependencies, and toolchain as the target environment. This is where β€œit works in sandbox but fails in production” bugs come from β€” invest in keeping the sandbox environment aligned with CI/CD.

Test-Driven Development Mode

The most reliable coding agent workflow follows TDD: write failing tests first, then implement until the tests pass. This gives the agent an unambiguous success criterion:

  1. Agent reads the task and writes test cases that would verify the correct implementation
  2. Runs tests β€” all should fail (red)
  3. Implements the feature/fix iteratively
  4. Runs tests after each implementation step β€” success when all pass (green)
  5. Optional: refactor while keeping tests green

Without a clear success criterion (tests passing), the agent has no reliable way to know when it is done and may keep making unnecessary changes or stop too early.

Human-in-the-Loop: The Review Gate

A coding agent should never commit code autonomously. The human review gate is a non-negotiable checkpoint:

  • Agent completes implementation β†’ all tests pass in sandbox
  • Agent generates a diff/PR summary: what changed, why, which tests validate it
  • Human reviews the diff in their IDE or PR interface
  • Human approves, requests changes, or rejects
  • If changes requested, agent resumes the execution loop with the feedback
  • Only on human approval does the agent commit/push or open a PR

This gate is where hallucinated logic, subtle security issues, and unintended side effects get caught. The agent handles the mechanical work; the human maintains ownership of what enters the codebase.

SWE-bench: Measuring Coding Agent Quality

SWE-bench (Software Engineering Benchmark) is the standard evaluation for coding agents. It consists of 2,294 real GitHub issues from popular open-source Python repositories (scikit-learn, matplotlib, Django, etc.). For each issue, the agent must produce a patch that makes the failing test pass on the original repository.

AgentSWE-bench Verified score (approx. 2025)Notes
Claude Code (Anthropic)~70–72%Highest published as of mid-2025; uses full tool loop
o3 + Claude scaffold~69%Strong reasoning + tool use combination
GitHub Copilot Agent~55–60%IDE-integrated; good for in-context editing tasks
GPT-4o (standard)~33–38%Without reasoning; general-purpose model

SWE-bench Verified (a curated subset of ~500 high-quality issues) is the more reliable metric β€” SWE-bench full has many ambiguous or underspecified issues. Use SWE-bench Verified scores when comparing coding agent claims.

Framework Choices

ToolBest forKey constraint
Claude Code (Anthropic)Terminal-based autonomous coding; highest SWE-bench performanceRequires Claude API; terminal workflow
GitHub Copilot AgentIDE-integrated; PR workflow; team environments already on GitHubGitHub ecosystem only; moderate autonomy
CursorInteractive coding with agent assist; best UX for interactive editingPrimarily interactive, not fully autonomous
Custom (LangGraph + tools)Full control; custom sandboxes; enterprise requirementsHigh implementation effort; maintain yourself
AiderOpen-source; multi-model support; repo-map contextCLI-only; less mature than Claude Code for complex tasks

Checklist: Do You Understand This?

  • Why can't a coding agent simply load all files in a repository into context?
  • What is a repository map and how does it help with codebase context?
  • Why must the run_command tool always execute inside a sandboxed container?
  • What is the TDD workflow for a coding agent, and why does it produce more reliable results?
  • What happens at the human-in-the-loop review gate, and why is autonomous commit never acceptable?
  • What does SWE-bench Verified measure, and what score represents state-of-the-art as of 2025?
  • What should the agent do when it hits its maximum step limit without completing the task?