Intermediate

Long Context vs RAG

Gemini 2.5 Pro supports 1 million tokens. Llama 4 Scout supports 10 million. With windows this large, why not just stuff your entire document corpus into the prompt and skip RAG entirely? The answer comes down to cost, latency, accuracy, and scale. This page gives you a rigorous framework for choosing between them — or combining them.

The Core Choice

Long Context (stuffing)

Simple, no infrastructure — but expensive and slow at scale

RAG (retrieval-augmented)

Cheap and fast at scale — but requires retrieval infrastructure

Prototype / one-off (<50 docs)

Static small corpus

Dynamic or large corpus

High-volume production (>10K queries/day)

Long context wins for prototypes; RAG wins for production at scale

Cost: The 1,250× Difference

The most decisive factor at scale is cost per query:

Approach	Query cost (1M-token context)	Query cost (RAG, ~3K context)	Ratio
Gemini 2.5 Pro	~$0.075 (input) + generation	~$0.00022 (3K tokens)	~340× cheaper with RAG
GPT-4o	~$2.50 per 1M input tokens	~$0.0075 (3K tokens)	~333× cheaper with RAG
Claude Sonnet	~$3.00 per 1M input tokens	~$0.009 (3K tokens)	~333× cheaper with RAG
Optimised RAG (with cache)	N/A	~$0.00008 (semantic cache hit)	Up to 1,250× cheaper vs 1M-context

At 10,000 queries per day, the difference between stuffing a 1M-token context and using RAG is roughly $750/day vs $0.80/day with Gemini 2.5 Pro — a $270,000/year difference from a single architectural choice.

Latency: The 45-Second Wall

Approach	Time to first token	Total response time
1M-token context (Gemini 2.5 Pro)	20–45 seconds	45–90 seconds
100K-token context (GPT-4o)	5–15 seconds	15–30 seconds
Basic RAG (3K injected context)	1–2 seconds	3–8 seconds
Optimised RAG (with reranking + cache)	0.5–1 second	2–5 seconds

A 45-second wait is acceptable for batch document processing. It is not acceptable for a conversational chatbot or any real-time interaction. Latency alone rules out long-context stuffing for interactive applications.

Accuracy: The Lost-in-the-Middle Problem

Counterintuitively, larger context does not always mean better accuracy on the information within that context. The "lost in the middle" problem is well-documented:

Gemini 2.5 Pro at 1M tokens: accuracy drops to approximately 77% on RULER (long-context retrieval benchmark) — even though it achieves 98%+ on short contexts
Most frontier models: accuracy degrades to 65–70% when relevant information is in the middle of a very long window
RAG with top-k retrieval: surfaces only the most relevant 3–10 chunks, keeping context tight and precise — typical domain accuracy of 85–95%

Why the middle gets lost

LLMs attend more strongly to tokens at the beginning and end of the context window. Information buried in the middle of a 1M-token window competes with hundreds of thousands of other tokens for attention. RAG sidesteps this entirely — it surfaces only the relevant content, so everything in the context is by definition near the beginning.

Decision Framework

Is your corpus static and under ~50 documents?

Yes → long context is fine. No infrastructure needed, acceptable cost at low volume.

Do you need sub-2-second latency?

Yes → RAG is required. Long context is too slow for interactive applications.

Will this run more than ~100 queries per day?

Yes → RAG is almost certainly cheaper. Calculate the cost crossover for your token count.

Does your corpus update frequently?

Yes → RAG. Re-indexing is fast; stuffing a changing corpus into every prompt is unmanageable.

Do you need source citations?

Yes → RAG. Long-context generation cannot reliably attribute which part of the context informed the answer.

Scenario	Recommended approach	Reason
Prototype — exploring a 30-doc dataset this week	Long context	Zero infrastructure, fastest time to insight
Customer support bot on 5,000-page knowledge base	RAG	Volume, latency, cost, update frequency all favour RAG
Reviewing a single 300-page legal contract once	Long context	One-off task; setup cost of RAG not justified
Compliance Q&A across 10,000 contracts	RAG (+ GraphRAG for cross-doc)	Scale, latency, cost, citation requirements
Nightly batch analysis of 1,000 reports	Long context acceptable	No latency constraint; cost manageable if one-off per doc
Real-time research assistant (>1K queries/day)	RAG	Cost and latency; long context impractical at volume

The Hybrid Pattern: Long Context + RAG Together

The best-performing production systems often combine both:

Query intent classification

Simple factual vs complex cross-doc

→

Simple query → RAG

Fast, cheap path for 80% of queries

→

Complex query → long context

Retrieve top-N docs, stuff full text for synthesis

→

Very complex → agentic RAG

Multi-hop retrieval with reasoning loop

Route by query complexity — use long context only where retrieval precision is insufficient

Practical hybrid approach

Use RAG to retrieve the top 5–10 most relevant documents
For synthesis queries that require deep reasoning across those documents, stuff the full text of the retrieved documents into a long-context call (now manageable — 5–10 docs is 20K–50K tokens, not 1M)
Cache common RAG results to cut cost further
Reserve true 1M-token calls for genuine all-corpus analysis tasks that justify the cost

When Long Context Genuinely Wins

Use cases where long context is the right choice

Full codebase analysis

Analysing or refactoring an entire repo (100K–500K tokens) where relationships across all files matter — RAG would fragment the context.

Book or long-document Q&A (one-off)

A single 300-page document analysed once — the setup cost of RAG is not justified.

Multi-document synthesis (small set)

Synthesising insights across 10–20 long documents into a single output — RAG would miss cross-document relationships.

Few-shot with many examples

Tasks where providing 50–100 examples dramatically improves output quality — long context enables large example libraries.

Checklist: Do You Understand This?

Why is a 1M-token context window not a direct replacement for RAG in production?
At what query volume does the cost difference between long context and RAG become significant?
What is the "lost in the middle" problem and how does RAG avoid it?
Name three scenarios where long context is the better choice over RAG.
Describe the hybrid pattern: when do you use RAG, and when do you escalate to long context?
Roughly how much more expensive is a 1M-token query than an equivalent RAG query at scale?