Intermediate

When to Use RAG

RAG is not the right answer to every knowledge problem — and building unnecessary retrieval infrastructure is one of the most common over-engineering mistakes in AI systems. This page gives you a decision framework: the specific signals that tell you RAG is warranted, the alternatives and when each wins, and the 2026 production pattern that most mature teams converge on.

The Wrong Question and the Right One

Most teams frame the decision as "RAG or fine-tuning?" — but that is the wrong question. The right question is: what is the failure mode you are trying to fix?

If your chatbot fails because…	Reach for…
It doesn't know your company's policies, products, or documents	RAG
Its knowledge is stale — things changed since training	RAG
It cannot cite sources or verify claims	RAG
Its tone is wrong, format is inconsistent, or it ignores instructions	Fine-tuning or stronger system prompt
It uses the wrong vocabulary or terminology for your domain	Fine-tuning
Your entire knowledge base fits in a single prompt (<50 documents)	Long context or prompt caching
It gives good answers but is too slow or too expensive at scale	Semantic caching, smaller model, or quantisation
It refuses valid requests or outputs wrong classifications	Fine-tuning or prompt engineering

Knowledge gap?

Model doesn't know your docs/data

→

Behaviour gap?

Wrong tone, format, terminology

→

Small static corpus?

<50 docs, rarely changes

→

RAG

Knowledge gap + large/dynamic corpus

→

Fine-tuning

Behaviour gap — RAG cannot fix this

→

Long context / cache

Small static knowledge base

Identify the failure mode first — the right tool follows from that diagnosis

RAG fixes knowledge gaps. Fine-tuning fixes behaviour gaps. Conflating them leads to systems that retrieve perfectly but still output in the wrong format, or systems that are beautifully tuned but confidently hallucinate proprietary facts they were never trained on.

Signals That Tell You RAG Is Right

Use RAG when three or more of these signals are true for your use case:

Strong RAG signals

Corpus >200 documents — small enough corpora often fit in a single long-context prompt, eliminating the need for retrieval infrastructure
Knowledge updates frequently — weekly, daily, or in real-time (news, pricing, regulations, support docs); re-indexing is far cheaper than retraining
Hallucinations are unacceptable — regulated industries (healthcare, legal, finance), customer-facing support, any system where wrong facts cause harm or erode trust
Source citations required — compliance, audit trails, users who need to verify claims independently
Proprietary or private data — internal docs, customer records, codebases that are not and cannot be in a model's training data
High query volume at scale — cost advantage of RAG vs long-context grows rapidly; at 1M queries/day the difference is orders of magnitude

RAG is probably overkill when

Your corpus is small and static — under ~50 documents that rarely change; just use a long-context model or prompt caching
The model already knows the domain — general coding assistants, writing tools, math tutors — parametric knowledge is sufficient
Failure mode is behaviour, not facts — wrong tone, wrong format, wrong classification: retrieval cannot fix these
You're in early prototyping — validate the core use case with long context first; add retrieval infra only when you have evidence it's needed
Latency is sub-200ms critical — retrieval adds 300ms–1s minimum; for real-time applications consider serving from cache or a local model

RAG vs the Alternatives

RAG vs Long Context (1M+ token windows)

Gemini 2.5 Pro (1M tokens), Llama 4 Scout (10M tokens), and Claude 3.7 Sonnet (200K tokens) make it tempting to abandon retrieval altogether and just send everything. The numbers tell a different story at scale:

Dimension	Long context	RAG
Cost per query (1M tokens)	~$0.10	~$0.00008 (optimised)
Latency	45–60s	500ms–1.5s
Accuracy at full load	65–77% (degrades — "lost in the middle")	85–95% (tight, relevant context)
Knowledge updates	Rewrite the prompt	Re-index the document
Best for	Prototyping; single large static document; one-off analysis	Production; large/dynamic corpora; high query volume

The "lost in the middle" problem is real: LLMs systematically underweight information in the middle of very long contexts. RAG sidesteps this by surfacing only the 3–10 most relevant chunks — keeping the injected context short, precise, and at the top of the prompt.

RAG vs Prompt Caching

For small, static knowledge bases (under ~200K tokens, updated infrequently), prompt caching is a cheaper and simpler alternative to retrieval. Both Anthropic and OpenAI offer prompt caching — repeated system prompt prefixes are computed once and cached, reducing both latency and input token costs by 50–90%.

Use prompt caching when:

Knowledge base fits in ~50–200K tokens
Content is stable (weekly or slower updates)
You want zero retrieval infrastructure
Internal copilot with modest, known doc set

Switch to RAG when:

Corpus grows beyond ~200K tokens
Documents update daily or faster
You need per-query source citations
Query volume makes full-context expensive

RAG vs Fine-Tuning

Fine-tuning bakes knowledge and behaviour into model weights. RAG injects knowledge at query time. They are not competing solutions — they fix different problems:

Dimension	Fine-tuning wins	RAG wins
Latency	Sub-second; no retrieval step	Adds 300ms–1.5s for retrieval
Knowledge freshness	Frozen at training time	Updated on re-index — no retraining
Behaviour consistency	Model internalises tone, format, policy	Cannot change model behaviour
Source citations	Cannot cite — knowledge is opaque	Every answer traceable to a document
Corpus size	Scales — becomes model knowledge	Scales — index grows without retraining
Update cost	Full re-run (hours, $100s–$1,000s)	Re-index affected documents (minutes)
Best for	Tone, classification, structured output, domain vocabulary, policy adherence	Dynamic/proprietary knowledge, citations, regulated accuracy

Fine-tuning a legal model will likely outperform both plain prompting and RAG on legal question-answering benchmarks — because the model has internalised domain terminology and reasoning patterns. But it still cannot tell you about a case filed last week.

RAG vs Pure Generation (no knowledge augmentation)

For tasks that rely entirely on the model's parametric knowledge — general coding, writing, math, brainstorming, summarisation of provided text — there is no retrieval needed. Adding RAG infrastructure to these use cases is pure overhead.

Coding assistants (no private codebase) — model knows the language and frameworks
Writing and editing tools — no external knowledge required
Math and reasoning tasks — parametric knowledge is sufficient
Summarising a document the user just pasted — context is already in the prompt
General Q&A on widely-known topics with low hallucination risk

Full Decision Framework

Scenario	Approach	Rationale
Large dynamic knowledge base (1,000+ docs, updates weekly+)	RAG	Fresh knowledge without retraining; citations; scales cost-effectively
Small static corpus (<50 docs, rarely changes)	Long context or prompt caching	No retrieval infrastructure needed; simpler, cheaper
Medium static corpus (50–200K tokens, monthly updates)	Prompt caching	Fits in context; 50–90% cost reduction with caching; no retrieval overhead
Wrong tone / format / terminology (no knowledge gap)	Fine-tuning	RAG cannot change model behaviour; fine-tuning internalises style and policy
Regulated domain needing source citations	RAG	Every claim traceable to a retrieved document; audit trail required
High-volume FAQ bot (>100k queries/day)	RAG + semantic cache	1,250× cheaper than long-context at scale; cache cuts repeat query costs further
Prototyping — validating a concept this week	Long context	No infra, instant iteration; migrate to RAG once you have evidence of scale
Sub-200ms latency requirement	Fine-tuning or prompt cache	Retrieval adds 300ms–1.5s minimum; infeasible for real-time use cases
Cross-document reasoning (compliance across 10,000 contracts)	GraphRAG or agentic RAG	Standard vector search can't reason across document relationships; needs graph or multi-hop
Mature production system: facts + consistent behaviour	RAG + fine-tuning (hybrid)	RAG handles knowledge freshness; fine-tuning handles style, policy, domain vocabulary

The 2026 Production Pattern

For most production-grade AI systems in 2026, the consensus answer is a composable adaptation stack: each technique applied to the problem it is best suited to fix.

Prompt engineering

Most quality improvements come from a better system prompt. Free, instant — always do this first.

Validate with long context

Prove the concept using full-document context. No infrastructure cost, fast iteration.

Add RAG

When corpus is dynamic, large, proprietary, or citations are required. Add retrieval infra here — not before.

Add fine-tuning

When behaviour is the bottleneck: tone, terminology, policy adherence, structured output. Fine-tuning + RAG outperform either alone.

Add semantic caching

For cost at scale — 68.8% cost reduction on FAQ-style bots with 30–50% cache hit rates.

The 2025 LaRA benchmark confirmed there is no single best approach — the right choice depends on task type, model behaviour, context length, and retrieval quality. The practical implication: start simple, measure your actual failure modes, and add complexity only where evidence demands it.

Common Mistakes

Building RAG you don't need

RAG for a 20-page handbook: stuff it in the system prompt and use prompt caching — no vector DB needed
RAG to fix a tone problem: retrieval cannot change how the model writes; you need fine-tuning or a better system prompt
RAG in prototype phase: validate product-market fit first; add infrastructure once you have evidence the product needs to scale
Over-retrieving: high k values and wide semantic search hurt precision and inflate cost by up to 80% — retrieve less, retrieve better

Skipping RAG when you need it

Hoping fine-tuning fixes hallucinations: baking stale facts into weights does not help when knowledge changes weekly; it just makes confident errors harder to update
Using long context at scale: 1M-token queries cost ~$0.10 each; at 100K queries/day that is $10,000/day vs ~$8/day with optimised RAG
No grounding enforcement: adding retrieval without a system prompt that instructs the model to answer only from context; the model still hallucinates
Stale index: retrieval is only as good as the last re-index; a stale corpus becomes a hallucination anchor

2025–2026 Developments

Llama 4 Scout at 10M tokens — the largest public context window as of early 2026 significantly expands the "use long context" zone, but the cost and accuracy arguments for RAG at scale remain unchanged.
Prompt caching now standard — both Anthropic and OpenAI offer automatic prompt caching with 50–90% input token cost reductions, making the "small static corpus" case for RAG even weaker.
Fine-tuning more accessible — OpenAI expanded fine-tuning controls, validation metrics, multimodal (vision) support, and workflow tooling in 2025; the barrier to the hybrid RAG + fine-tuning pattern has dropped significantly.
Agentic RAG as the new baseline — agents that decide whether to retrieve (vs answering from parametric knowledge) reduce over-retrieval and handle multi-hop reasoning, addressing two of the biggest RAG cost and quality problems simultaneously.
GraphRAG for structured corpora — for knowledge bases with strong entity relationships (contracts, regulations, org charts), Microsoft's GraphRAG approach achieves 99% precision on structured queries — at 3–5× the build cost of standard RAG, worth it only for cross-document relational reasoning.

Checklist: Do You Understand This?

Can you state the core question to ask before choosing RAG vs alternatives?
Can you name four signals that indicate RAG is the right choice for a system?
Do you know when prompt caching is a better alternative to RAG, and what the size threshold roughly is?
Can you explain why fine-tuning cannot fix a knowledge-freshness problem, and why RAG cannot fix a behaviour-consistency problem?
Do you understand why the cost difference between long context and optimised RAG is roughly 1,250× at scale?
Can you describe the five-step progression from prompt engineering → long context → RAG → fine-tuning → semantic caching?
Can you identify the two most common mistakes: building RAG you don't need, and skipping RAG when you do?