Intermediate

Long Document Handling

Claude's 200K-token context window allows you to work with genuinely long documents — entire books, legal contracts, codebases, and research collections — without building a retrieval pipeline. This page covers how to use that capability effectively.

Long Context vs RAG for Document Work

For single-document or small-collection work, loading the full document into context is simpler and often more accurate than chunking and retrieving:

Long context works well when

The document fits in 200K tokens
Your queries are unpredictable — hard to pre-plan what to retrieve
You need cross-document reasoning (comparing section A to section Z)
You want simplicity — no retrieval infrastructure needed
The document is read a small number of times (cost-per-query is acceptable)

RAG works better when

Your collection is far larger than 200K tokens
You need to search across thousands of documents
The same documents are queried frequently (cost matters at scale)
You need source attribution at the chunk level
Documents update frequently and you need fresh indexing

Chunking vs Full-Document: When Each Is Better

Even within a single document, you sometimes need to choose between loading the whole thing vs sending selected sections:

Full document: Best for summarisation, questions that might span the whole document, and when you don't know in advance which parts are relevant. The overhead is cost (all tokens billed) and slight quality degradation at extreme lengths.
Selective chunking: Best for targeted extraction from a known section. If you know the answer is in Chapter 3, send only Chapter 3. This is cheaper and potentially more accurate because the model's attention is less diluted.
Hierarchical approach: First summarise each section, then synthesise the summaries. For very long documents (300K+ tokens beyond the window), this is the only option — and it works well for broad overviews.

Q&A Over Long Documents

For question-answering over uploaded long documents:

Be specific in your question: Vague questions like "what does this document say about risks?" produce vague answers. "List all risk factors mentioned in Section 2, with the exact language used" produces precise output.
Ask for source location: "For each point, tell me which section or page it comes from." This lets you verify the answer against the source.
Decompose multi-part questions: Instead of "summarise the contract and identify risks and obligations", ask these as three separate questions. Claude handles one focused task better than three interleaved.
Ask what's missing: "Are there any aspects of [topic] that the document does NOT cover?" This surfaces gaps Claude noticed while reading.

Summarisation at Scale

For documents that fit in context, direct summarisation works well:

"Summarise this document in 5 bullet points for a non-specialist audience."
"Give me an executive summary (3 paragraphs) followed by a detailed summary of each section."
"Summarise the key findings, methodology, and limitations of this paper."

For documents too large for the context window, use a hierarchical summarisation approach:

Split the document into chunks that each fit in context
Ask Claude to summarise each chunk independently (these can be run in parallel)
Concatenate the summaries and ask Claude to synthesise them into a final summary
For very deep hierarchies, repeat this process at the synthesis level

This map-reduce approach trades some cross-chunk coherence for scalability. The summaries from step 2 lose information present only in inter-chunk relationships.

Attention Degradation Warning

Claude's 200K context window does not mean 200K-token contexts perform identically to 10K-token contexts. Research shows:

"Lost in the middle" effect

Information placed in the middle of a very long context is retrieved less reliably than information at the start or end. This is well-documented across large language models. At 150K+ tokens, recall accuracy for facts buried in the middle of the context can degrade meaningfully.

Practical mitigations:

Place the most critical information — system instructions, the most relevant document sections, the key constraints — at the beginning or end of the prompt
For retrieval-critical tasks at extreme lengths, consider hybrid approaches: RAG to identify the relevant section, then full-document context for that section only
Test your specific use case at the actual context length you plan to deploy — don't assume 200K performance from benchmarks at shorter context lengths

Checklist: Do You Understand This?

Full-document context is simpler than RAG for single documents under 200K tokens — best for unpredictable queries and cross-section reasoning
Use RAG for corpora larger than the context window or high-frequency querying where cost matters
Selective chunking (send only relevant sections) is cheaper and sometimes more accurate than full-document context
For documents larger than 200K tokens, use hierarchical summarisation (summarise chunks, then synthesise summaries)
Place critical information at the start or end of long contexts — "lost in the middle" degradation affects retrieval at extreme lengths
Always test your specific use case at your target context length — don't assume peak performance across the entire 200K window