Intermediate

Context Windows

Claude supports a 200,000-token context window. Understanding what that means in practice — how much content fits, how it affects quality and cost, and when it replaces retrieval — shapes how you architect Claude-powered applications.

What Actually Fits in 200K Tokens?

A rough rule: 1 token ≈ 0.75 words in English prose. At 200K tokens, you can fit approximately:

Content TypeApproximate VolumeNotes
English prose~150,000 wordsAbout 500–600 pages of a typical book
PDF documents~250–400 pagesVaries with tables, headers, whitespace
Source code~15,000–25,000 linesCode tokenises at ~1.3–1.5 tokens/word (identifiers, symbols)
JSON/CSV data~5,000–20,000 rowsDense structured data tokenises less efficiently
Chat history~1,000–2,000 turnsDepends heavily on average message length

In practice, long contexts mean you can feed Claude an entire codebase, a full contract, a transcript of a lengthy meeting, or multiple research papers simultaneously — without needing to chunk and retrieve.

Long Context vs RAG: When to Use Which

The 200K context window changes the retrieval calculus. For many tasks, stuffing the full document into context is simpler and more accurate than building a RAG pipeline:

Use long context directly when:

  • The document fits in context (<200K tokens)
  • You need cross-document reasoning (comparing section A to section Z)
  • The query is unpredictable — you can't pre-plan what chunks to retrieve
  • You want zero retrieval infrastructure complexity
  • Accuracy matters more than cost and the document is read infrequently

Use RAG when:

  • Your knowledge base is far larger than 200K tokens
  • You need to search across thousands of documents
  • Cost is critical — you pay per token, RAG retrieves only relevant chunks
  • You need up-to-date information (RAG indexes refresh; context is static)
  • You need source attribution at the chunk level

A practical heuristic: if your document collection fits in 200K tokens and changes infrequently, long context is often the right choice. If your corpus is large, dynamic, or cost-sensitive at scale, build a RAG pipeline instead.

Attention and the "Lost in the Middle" Problem

Longer contexts don't always mean better performance. Research has identified a pattern called "lost in the middle": information placed in the middle of a very long context is retrieved less reliably than information at the start or end.

  • Claude's position: Claude performs well on long-context recall tasks and is specifically trained to handle 200K windows. But no model is immune to some quality degradation at extreme context lengths.
  • Mitigation: Place the most critical information — system instructions, the most relevant document sections, key constraints — at the beginning or end of the context, not buried in the middle.
  • Evaluation: Always test your specific use case at the actual context length you plan to use in production. Benchmark accuracy at 10K, 50K, 100K, and 200K tokens to understand where quality starts to degrade for your task.

Cost Implications

Claude's API pricing charges per input token and per output token. Large context windows amplify input token costs significantly:

  • Input cost scales linearly: Sending a 200K-token document costs 200× more than sending a 1K-token prompt. For tasks run thousands of times per day, this matters enormously.
  • Prompt caching: Anthropic's API supports prompt caching — if a large portion of your context is repeated across calls (e.g. a large system prompt, a reference document), cached tokens cost significantly less. Structure your prompts to put static content first to maximise cache hits.
  • Haiku for long-context tasks: If your task is well-defined (e.g. extracting specific fields from a long document), using Haiku with a large context can be cheaper than using Sonnet with a large context — Haiku's lower per-token price applies to all tokens including the large input.

Strategies for Managing Context Efficiently

  • Summarise conversation history: For long-running chat sessions, periodically summarise earlier parts of the conversation into a compact summary that replaces the raw turn history. This keeps costs manageable without losing continuity.
  • Selective inclusion: Don't send an entire document if only a portion is relevant. Pre-filter, chunk, or trim to the sections most likely to matter before sending to Claude.
  • Structured extraction in passes: For very large corpora, use a first-pass extraction (with Haiku) to pull out the 5–10% of content relevant to the query, then feed that condensed output to Sonnet/Opus for reasoning.
  • Context window as working memory, not permanent storage: Context is stateless between API calls — it resets each time. For persistent memory across sessions, use Projects (Claude.ai) or an external memory store, not the raw context window.

Checklist: Do You Understand This?

  • 200K tokens ≈ 150,000 words ≈ 500 book pages — enough for entire codebases, contracts, or reports
  • Long context is simpler than RAG for single-document or small-corpus use cases where cross-section reasoning matters
  • RAG is preferred when the knowledge base exceeds 200K tokens, updates frequently, or cost is critical at scale
  • Place critical information at the start or end of long contexts — avoid burying key facts in the middle
  • Input token cost scales linearly with context length — use prompt caching and selective content inclusion to control costs
  • Context is stateless — it resets between API calls; use external memory or Projects for persistence across sessions

Page built: 01 Jun 2026