Advanced

Coding Agent Architecture

A coding agent is an LLM that can read a codebase, write and edit files, execute code, run tests, and iterate until a task is complete — autonomously, within a sandbox, with a human checkpoint before anything is committed. This architecture underpins Claude Code, GitHub Copilot Agent mode, and custom internal coding agents.

System Architecture

Input

Task Input

Issue / ticket / natural language

Context

Repository Map

Symbols + signatures (tree-sitter)

File Retrieval

Semantic + symbol + dependency search

Agent Core

LLM Planner

Task decomposition + action selection

Tool Loop

read / write / run / test / search

Execution (sandboxed)

Docker Container

No network, ephemeral, resource-limited

Test Runner

pytest / jest / cargo test

Review Gate

Human Review

Diff + PR summary before any commit

The human review gate is non-negotiable — the agent never commits autonomously

Codebase Context: The Hardest Problem

A coding agent needs to understand a codebase that is often much larger than the LLM's context window. Naively dumping all files into context is expensive, slow, and often counterproductive (the model gets confused by irrelevant code). Production coding agents use a structured context strategy:

Repository map

A compact summary of the repository structure — files, classes, functions, and their signatures — without the full implementation body. This fits in a few thousand tokens and gives the agent a map of what exists and where. Tools like tree-sitterparse code structure statically; aider's repo-map algorithm selects the most relevant symbols based on the current task.

Relevant file retrieval

Given the repository map and the task, the agent selects which files to read in full. Retrieval strategies used in practice:

Semantic search over code — embed all functions/classes, retrieve by similarity to the task description
Symbol search — grep for specific class/function names mentioned in the task
Dependency tracing — given a file to modify, find all files that import it (impact analysis)
Test file pairing — automatically include test files alongside the implementation files they test

Context window management

Read only what is needed. Use the read_file tool rather than pre-loading everything. The agent can request additional files mid-task as it discovers dependencies. Cache file contents within a session to avoid re-reading unchanged files. For very large files, read specific line ranges rather than the full file.

The Tool Set

A minimal but sufficient tool set for a coding agent covers five categories:

Tool	What it does	Key parameters
`read_file`	Read a file's content (all or a line range)	path, start_line, end_line
`write_file`	Write or overwrite a file with new content	path, content
`edit_file`	Apply a targeted diff/patch to a file (safer than full write)	path, old_content, new_content
`run_command`	Execute a shell command and return stdout/stderr	command, timeout, working_dir
`run_tests`	Run the test suite (or a subset) and return results	test_path, filter
`search_code`	Grep/ripgrep for a pattern across the codebase	pattern, file_glob, context_lines
`list_directory`	List files in a directory with metadata	path, recursive

Security note: The run_command tool is the most dangerous. Always run it inside a sandboxed container with no network access and no access to production credentials. The container should be ephemeral — created fresh per session and destroyed after. Never run this tool in the user's live environment without explicit sandboxing.

Planning: Task Decomposition

Before writing any code, a well-designed coding agent produces a plan:

Read and understand the task (issue description, failing test, feature request)
Explore the repository map and retrieve relevant files
Identify which files need to be created or modified
Identify which tests need to pass (or be written first in TDD mode)
Decompose into ordered implementation steps
Optionally, present the plan to the user before execution (confirm gate)

The plan is kept in the agent's working memory (typically as a list in the system prompt or conversation context) and updated as steps complete. This prevents the agent from losing track on long multi-file tasks.

Execution Loop

The core ReAct-style execution loop iterates until all planned steps complete or the agent decides it cannot proceed:

Think

What is the next action to take?

→

Act

Call tool: read / write / run / test / search

→

Observe

Receive and parse tool output

→

Evaluate

Tests passing? Action succeed?

→

Update plan

Mark complete or add correction steps

→

Exit or escalate

Done → human review; stuck 3× → surface to human

Max step limit (e.g. 50 tool calls) prevents runaway loops — agent returns partial result + status on limit hit

Set a maximum step limit (e.g., 50 tool calls) to prevent runaway loops. If the agent hits the limit without succeeding, it should return a partial result and a status summary rather than silently failing.

Sandboxed Execution

All file writes and command executions must happen inside an isolated environment:

Docker container (recommended)

No network access (--network none)
Read-only mounts for source code; writable /workspace
Resource limits (CPU, memory, disk I/O)
Non-root user inside container
Time limit per command (30s default)
Container destroyed after session ends

E2B cloud sandbox

Managed sandbox-as-a-service
Instant spin-up (<150ms)
Python, Node, and custom environments
Filesystem persistence across calls within a session
Automatic cleanup on timeout
Good for cloud-hosted coding agents

The sandbox must contain the same language runtime, dependencies, and toolchain as the target environment. This is where “it works in sandbox but fails in production” bugs come from — invest in keeping the sandbox environment aligned with CI/CD.

Test-Driven Development Mode

The most reliable coding agent workflow follows TDD: write failing tests first, then implement until the tests pass. This gives the agent an unambiguous success criterion:

Agent reads the task and writes test cases that would verify the correct implementation
Runs tests — all should fail (red)
Implements the feature/fix iteratively
Runs tests after each implementation step — success when all pass (green)
Optional: refactor while keeping tests green

Without a clear success criterion (tests passing), the agent has no reliable way to know when it is done and may keep making unnecessary changes or stop too early.

Human-in-the-Loop: The Review Gate

A coding agent should never commit code autonomously. The human review gate is a non-negotiable checkpoint:

Agent completes implementation → all tests pass in sandbox
Agent generates a diff/PR summary: what changed, why, which tests validate it
Human reviews the diff in their IDE or PR interface
Human approves, requests changes, or rejects
If changes requested, agent resumes the execution loop with the feedback
Only on human approval does the agent commit/push or open a PR

This gate is where hallucinated logic, subtle security issues, and unintended side effects get caught. The agent handles the mechanical work; the human maintains ownership of what enters the codebase.

SWE-bench: Measuring Coding Agent Quality

SWE-bench (Software Engineering Benchmark) is the standard evaluation for coding agents. It consists of 2,294 real GitHub issues from popular open-source Python repositories (scikit-learn, matplotlib, Django, etc.). For each issue, the agent must produce a patch that makes the failing test pass on the original repository.

Agent	SWE-bench Verified score (approx. 2025)	Notes
Claude Code (Anthropic)	~70–72%	Highest published as of mid-2025; uses full tool loop
o3 + Claude scaffold	~69%	Strong reasoning + tool use combination
GitHub Copilot Agent	~55–60%	IDE-integrated; good for in-context editing tasks
GPT-4o (standard)	~33–38%	Without reasoning; general-purpose model

SWE-bench Verified (a curated subset of ~500 high-quality issues) is the more reliable metric — SWE-bench full has many ambiguous or underspecified issues. Use SWE-bench Verified scores when comparing coding agent claims.

Framework Choices

Tool	Best for	Key constraint
Claude Code (Anthropic)	Terminal-based autonomous coding; highest SWE-bench performance	Requires Claude API; terminal workflow
GitHub Copilot Agent	IDE-integrated; PR workflow; team environments already on GitHub	GitHub ecosystem only; moderate autonomy
Cursor	Interactive coding with agent assist; best UX for interactive editing	Primarily interactive, not fully autonomous
Custom (LangGraph + tools)	Full control; custom sandboxes; enterprise requirements	High implementation effort; maintain yourself
Aider	Open-source; multi-model support; repo-map context	CLI-only; less mature than Claude Code for complex tasks

Checklist: Do You Understand This?

Why can't a coding agent simply load all files in a repository into context?
What is a repository map and how does it help with codebase context?
Why must the run_command tool always execute inside a sandboxed container?
What is the TDD workflow for a coding agent, and why does it produce more reliable results?
What happens at the human-in-the-loop review gate, and why is autonomous commit never acceptable?
What does SWE-bench Verified measure, and what score represents state-of-the-art as of 2025?
What should the agent do when it hits its maximum step limit without completing the task?