🧠 All Things AI
Intermediate

Long Context vs RAG

Gemini 2.5 Pro supports 1 million tokens. Llama 4 Scout supports 10 million. With windows this large, why not just stuff your entire document corpus into the prompt and skip RAG entirely? The answer comes down to cost, latency, accuracy, and scale. This page gives you a rigorous framework for choosing between them — or combining them.

The Core Choice

Prototype / one-off (<50 docs)
Static small corpus
Dynamic or large corpus
High-volume production (>10K queries/day)
Long Context (stuffing)
Simple, no infrastructure — but expensive and slow at scale
RAG (retrieval-augmented)
Cheap and fast at scale — but requires retrieval infrastructure

Long context wins for prototypes; RAG wins for production at scale

Cost: The 1,250× Difference

The most decisive factor at scale is cost per query:

ApproachQuery cost (1M-token context)Query cost (RAG, ~3K context)Ratio
Gemini 2.5 Pro~$0.075 (input) + generation~$0.00022 (3K tokens)~340× cheaper with RAG
GPT-4o~$2.50 per 1M input tokens~$0.0075 (3K tokens)~333× cheaper with RAG
Claude Sonnet~$3.00 per 1M input tokens~$0.009 (3K tokens)~333× cheaper with RAG
Optimised RAG (with cache)N/A~$0.00008 (semantic cache hit)Up to 1,250× cheaper vs 1M-context

At 10,000 queries per day, the difference between stuffing a 1M-token context and using RAG is roughly $750/day vs $0.80/day with Gemini 2.5 Pro — a $270,000/year difference from a single architectural choice.

Latency: The 45-Second Wall

ApproachTime to first tokenTotal response time
1M-token context (Gemini 2.5 Pro)20–45 seconds45–90 seconds
100K-token context (GPT-4o)5–15 seconds15–30 seconds
Basic RAG (3K injected context)1–2 seconds3–8 seconds
Optimised RAG (with reranking + cache)0.5–1 second2–5 seconds

A 45-second wait is acceptable for batch document processing. It is not acceptable for a conversational chatbot or any real-time interaction. Latency alone rules out long-context stuffing for interactive applications.

Accuracy: The Lost-in-the-Middle Problem

Counterintuitively, larger context does not always mean better accuracy on the information within that context. The "lost in the middle" problem is well-documented:

  • Gemini 2.5 Pro at 1M tokens: accuracy drops to approximately 77% on RULER (long-context retrieval benchmark) — even though it achieves 98%+ on short contexts
  • Most frontier models: accuracy degrades to 65–70% when relevant information is in the middle of a very long window
  • RAG with top-k retrieval: surfaces only the most relevant 3–10 chunks, keeping context tight and precise — typical domain accuracy of 85–95%

Why the middle gets lost

LLMs attend more strongly to tokens at the beginning and end of the context window. Information buried in the middle of a 1M-token window competes with hundreds of thousands of other tokens for attention. RAG sidesteps this entirely — it surfaces only the relevant content, so everything in the context is by definition near the beginning.

Decision Framework

1
Is your corpus static and under ~50 documents?

Yes → long context is fine. No infrastructure needed, acceptable cost at low volume.

2
Do you need sub-2-second latency?

Yes → RAG is required. Long context is too slow for interactive applications.

3
Will this run more than ~100 queries per day?

Yes → RAG is almost certainly cheaper. Calculate the cost crossover for your token count.

4
Does your corpus update frequently?

Yes → RAG. Re-indexing is fast; stuffing a changing corpus into every prompt is unmanageable.

5
Do you need source citations?

Yes → RAG. Long-context generation cannot reliably attribute which part of the context informed the answer.

ScenarioRecommended approachReason
Prototype — exploring a 30-doc dataset this weekLong contextZero infrastructure, fastest time to insight
Customer support bot on 5,000-page knowledge baseRAGVolume, latency, cost, update frequency all favour RAG
Reviewing a single 300-page legal contract onceLong contextOne-off task; setup cost of RAG not justified
Compliance Q&A across 10,000 contractsRAG (+ GraphRAG for cross-doc)Scale, latency, cost, citation requirements
Nightly batch analysis of 1,000 reportsLong context acceptableNo latency constraint; cost manageable if one-off per doc
Real-time research assistant (>1K queries/day)RAGCost and latency; long context impractical at volume

The Hybrid Pattern: Long Context + RAG Together

The best-performing production systems often combine both:

Query intent classification
Simple factual vs complex cross-doc
Simple query → RAG
Fast, cheap path for 80% of queries
Complex query → long context
Retrieve top-N docs, stuff full text for synthesis
Very complex → agentic RAG
Multi-hop retrieval with reasoning loop

Route by query complexity — use long context only where retrieval precision is insufficient

Practical hybrid approach

  • Use RAG to retrieve the top 5–10 most relevant documents
  • For synthesis queries that require deep reasoning across those documents, stuff the full text of the retrieved documents into a long-context call (now manageable — 5–10 docs is 20K–50K tokens, not 1M)
  • Cache common RAG results to cut cost further
  • Reserve true 1M-token calls for genuine all-corpus analysis tasks that justify the cost

When Long Context Genuinely Wins

Use cases where long context is the right choice

Full codebase analysis

Analysing or refactoring an entire repo (100K–500K tokens) where relationships across all files matter — RAG would fragment the context.

Book or long-document Q&A (one-off)

A single 300-page document analysed once — the setup cost of RAG is not justified.

Multi-document synthesis (small set)

Synthesising insights across 10–20 long documents into a single output — RAG would miss cross-document relationships.

Few-shot with many examples

Tasks where providing 50–100 examples dramatically improves output quality — long context enables large example libraries.

Checklist: Do You Understand This?

  • Why is a 1M-token context window not a direct replacement for RAG in production?
  • At what query volume does the cost difference between long context and RAG become significant?
  • What is the "lost in the middle" problem and how does RAG avoid it?
  • Name three scenarios where long context is the better choice over RAG.
  • Describe the hybrid pattern: when do you use RAG, and when do you escalate to long context?
  • Roughly how much more expensive is a 1M-token query than an equivalent RAG query at scale?