🧠 All Things AI
Intermediate

Retrieval vs Generation

Every chatbot draws on two sources of knowledge: what the model learned during training (parametric memory) and what you inject into the prompt at runtime (non-parametric retrieval). The choice between relying on one or the other — or combining them — is one of the most consequential architectural decisions you will make. Get it wrong and you either pay 1,250× too much in compute, hallucinate confidently on your own data, or build infrastructure you don't need. This page maps the full decision space and the tradeoffs at every level.

Creative / brainstorming
General reasoning & code
Domain-specific Q&A
Regulated / high-stakes
Parametric (Pure Generation)
Knowledge in weights — fast, opaque, frozen at training cutoff
Non-Parametric (RAG)
Knowledge in corpus — accurate, citeable, always fresh

Most production chatbots sit to the right — RAG for accuracy, generation for fluency

The Core Distinction

When an LLM answers a question it can draw on two fundamentally different sources:

DimensionParametric memory (pure generation)Non-parametric memory (retrieval / RAG)
Where knowledge livesEncoded in model weights during trainingExternal corpus — vector store, database, files — accessed at query time
How it is accessedImplicit — model "remembers" by generatingExplicit — embed query → search → retrieve → inject into prompt
CurrencyFrozen at training cutoffAs fresh as your last index update
VerifiabilityCannot cite sources — knowledge is opaqueEvery answer traceable to a retrieved document
Hallucination riskHigh (25–40% on domain-specific queries)Lower (2–15% with proper grounding)
Latency<100ms (no retrieval step)500ms–2s (retrieval adds overhead)
Cost per queryLow ($0.001–0.01)Very low ($0.00008 optimized) — up to 1,250× cheaper than long-context
UpdatabilityRequires full retraining or fine-tuningAdd documents to index — no model change needed
Best forGeneral reasoning, code patterns, creative tasks, brainstormingFAQ bots, customer support, domain-specific Q&A, regulated industries

RAG (Retrieval-Augmented Generation) is the standard pattern that fuses both: retrieve relevant documents from an external store, inject them into the prompt as context, then let the model generate a grounded answer. Anthropic's Contextual Retrieval technique achieved a 49% reduction in failed retrievals and a 67% reduction with reranking — demonstrating how much the retrieval quality drives the final answer quality.

When to Use Each Approach

Use pure generation when

  • Queries test general reasoning, math, or coding — no private corpus needed
  • Answers don't require current or proprietary information
  • Latency is critical (sub-200ms requirement)
  • Hallucination is acceptable — brainstorming, creative drafts, code stubs
  • Dataset is small enough to fit in a single prompt (<50 documents)
  • Failure mode is wrong format or tone, not missing facts

Use RAG when

  • Your corpus has 1,000+ documents or updates frequently
  • Precision and source citations are required (legal, medical, regulated industries)
  • Data is proprietary and cannot be baked into model weights
  • Real-time or frequently updated data (news, regulations, product catalogs)
  • Response latency budget is 1–2 seconds (acceptable for most web apps)
  • Failure mode is outdated or missing facts — not just wrong format

The long-context temptation — and its limits

With Gemini 2.5 Pro at 1M tokens and Llama 4 Scout at 10M tokens, it is tempting to just stuff your entire corpus into the context window. Resist this for production systems:

  • Cost: A 1M-token query costs roughly $0.10; an equivalent RAG query costs $0.00008 — a 1,250× difference at scale
  • Latency: Full 1M-token generation takes 45–60 seconds; RAG returns in 500–800ms
  • Accuracy degrades: Gemini 2.5 Pro accuracy drops to ~77% at full 1M-token load; competitors hit 65–70%
  • Long context is useful for: Prototyping, one-off analysis, static corpora under ~100 documents
Pure generation path
User query
LLM forward pass
Parametric knowledge only
Response
No source citations
RAG path
User query
Embed → Search
Vector + BM25 hybrid
Rerank
Cross-encoder
Inject context
Top-k chunks
LLM generation
Grounded answer + citations

RAG adds a retrieval step before generation — the quality of retrieval determines the quality of the answer

The RAG Pipeline

A standard RAG system has two phases: offline indexing and online query-time retrieval.

Phase 1 — Offline indexing

  1. Load: Ingest documents from files, APIs, databases, web scrapes
  2. Chunk: Split into segments (typically 256–512 tokens with overlap) — see the chunking page for full strategy
  3. Embed: Convert each chunk to a dense vector using an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, Nomic-embed)
  4. Store: Write vectors + metadata to a vector database (Pinecone, Qdrant, pgvector, Weaviate)

Phase 2 — Query-time retrieval

  1. Embed query: Convert the user's question to a vector using the same model
  2. Search: Top-k approximate nearest-neighbour search in the vector store (typically k=3–10)
  3. Rerank (optional): Pass candidates through a cross-encoder reranker for precision (adds 200–400ms latency)
  4. Inject: Prepend retrieved chunks to the prompt as context
  5. Generate: LLM produces an answer grounded in the retrieved context
  6. Enforce grounding: System prompt instructs model to answer only from context and flag uncertainty

Hybrid Retrieval Architectures

Basic single-hop vector search is insufficient for complex production requirements. The field has evolved toward several hybrid patterns:

Dense + Sparse (Hybrid Retrieval)

Combines vector semantic search with lexical keyword search (BM25, ElasticSearch). Dense retrieval captures conceptual similarity ("what does cloud computing cost?" finds pricing documents even without the word "cost"). Sparse excels at exact-term precision ("SKU-XJ-4421" must match exactly). Reciprocal Rank Fusion (RRF) merges the two result sets without needing a learned ensemble.

Supported natively in Pinecone, Weaviate; or via two-stack (vector DB + OpenSearch).

GraphRAG (Microsoft, 2025)

Replaces flat chunk retrieval with a knowledge graph: entities, relationships, and taxonomies are extracted from the corpus and stored in a graph database (Neo4j). At query time the system traverses relationships rather than searching for nearest vectors. This enables global/thematic queries that span many documents: "What are the compliance risks across all vendor contracts?" GraphRAG achieves 99% precision on structured queries but costs 3–5× more to build than baseline RAG. Use it when cross-document reasoning is your primary use case.

Agentic RAG (2025–2026 production standard)

An LLM agent decides whether retrieval is necessary at all — for simple factual questions well-covered by parametric knowledge, it skips the retrieval step entirely. For complex queries it plans multi-hop retrieval strategies, reflects on intermediate results, and re-queries with refined searches. This reduces over-retrieval (a major cost driver) and handles reasoning that spans many documents without bloating a single context window.

Emerging research: TreePS-RAG, RAG-Critic (ACL 2025), T-GRAG for temporal conflict resolution.

Speculative Retrieval

Overlaps the retrieval pipeline with early generation — the model begins producing tokens while retrieval is still in flight. Reduces time-to-first-token (TTFT) by 20–30% in high-throughput systems. Risk: speculative hallucinations if the retrieved context contradicts the early tokens. Requires a fallback/correction mechanism. Best for latency-sensitive applications with predictable query patterns.

Cache-Augmented Generation

Semantic caching stores computed query embeddings and LLM responses. On a cache hit, the response is returned without any LLM call. Production workloads typically see 30–50% cache hit rates on FAQ-style bots, yielding up to a 68.8% cost reduction. Pair with retrieval for maximum efficiency: cache warm paths, retrieve cold paths.

Performance Tradeoffs

MetricPure generationBasic RAGOptimized RAGLong context (1M)
Latency<100ms1–2s500–800ms45–60s
Cost per query$0.001–0.01~$0.002~$0.00008~$0.10
Hallucination rate25–40%5–15%2–5%20–25%
Domain accuracy60–75%75–90%85–95%65–77%
Source citationsNoneYesYesNone
Corpus updatabilityRetrain / fine-tuneRe-index documentsRe-index documentsRewrite prompt
Infrastructure complexityNone beyond LLM APIVector DB + embedding model+ reranker + cache + routerNone beyond LLM API

Long context accuracy degrades at scale because the "lost in the middle" problem is real — LLMs systematically underweight information in the middle of large windows. RAG sidesteps this by surfacing only the most relevant 3–10 chunks, keeping the context tight and precise.

Failure Modes

Pure generation failures

  • Knowledge cutoff: model confidently answers with stale information
  • Domain hallucination: invents product names, SKUs, regulations, people
  • Cannot cite: no verifiable source for any claim
  • Unstable policy: model behavior on edge cases varies between calls
  • Proprietary knowledge gap: anything not in training data is unknown

RAG failures (cascade effect)

  • Retrieval miss: wrong chunks retrieved — the highest single failure driver
  • Reranker miss: relevant document deprioritized behind noise
  • Context fragmentation: answer spans multiple chunks, model loses thread
  • Hallucination persistence: retrieved context doesn't prevent confabulation
  • Cascade failure: at 95% accuracy per layer, a 4-layer pipeline has only 81% end-to-end reliability (0.95⁴)
  • Stale corpus: outdated documents become hallucination anchors
  • Over-retrieval: enterprises overpay by up to 80% from excessive k values

Tools and Frameworks

ToolRoleStrengthsLatency (p99)
LlamaIndexData & retrieval frameworkPurpose-built for indexing: data connectors, node parsers, vector/tree/graph/keyword indices, hybrid retrieval30ms @ 1,000 RPS
LangChainOrchestration & agentsAgent loops, tool calling, memory management, chain composition — typically layers above LlamaIndex45ms @ 1,000 RPS
PineconeVector databaseNative dense + sparse hybrid retrieval, serverless, production-grade<50ms
QdrantVector databaseOpen-source, on-premise option, payload filtering, sparse vector support<40ms
pgvectorVector extension for PostgresNo new infrastructure if already on Postgres; SQL filtering; lower operational overhead50–100ms
GraphRAG (Microsoft)Knowledge graph retrieval99% precision on structured/relational queries; global thematic queries300–800ms
Cohere RerankCross-encoder rerankerBoosts precision after retrieval; 67% fewer failed retrievals combined with contextual retrieval+200–400ms

Decision Framework

ScenarioApproachRationale
Customer support on product docs (1,000+ pages, updates monthly)RAG — hybrid dense+sparseAccuracy, citations, updatability without retraining
Code generation assistant (no private codebase)Pure generationLow latency, no private corpus needed, model trained well on code
Compliance & audit across 10,000 contractsAgentic RAG + GraphRAGCross-document reasoning, audit trail, regulatory traceability
Real-time news or market intelligence botRAG with frequent re-indexingParametric knowledge is stale; fresh corpus required
Prototyping — exploring a 50-doc dataset this weekLong context (1M+ tokens)Simplest path; no indexing overhead; acceptable for one-off
Cost-sensitive high-volume FAQ bot (>100k queries/day)RAG + semantic cache + router68.8% cost reduction from cache; router avoids over-retrieval
Multi-document research assistant (reasoning across sources)Agentic RAG with multi-hopAgent plans sequential retrievals; handles cross-document reasoning

The 2026 Production Pattern

For most production chatbot systems in 2026, the consensus architecture is a dual pipeline: RAG handles factual accuracy and knowledge freshness, fine-tuning (or strong system-prompt engineering) handles style, tone, and policy behavior. Neither alone is sufficient:

  • RAG without style tuning produces accurate but off-brand responses
  • Fine-tuning without RAG produces fluent but hallucinating responses
  • The combination achieves both factual grounding and behavioural consistency

A typical production stack:

Query → Semantic cache → (cache hit: return immediately)
↓ cache miss
Intent router → decides: retrieve or answer from model
↓ retrieve path
Embed queryHybrid search (dense + sparse) → Reranker
Inject top-k chunksLLM (fine-tuned or system-prompted)
Grounding checkCitation extraction → Response

2025–2026 Developments

  • Standard RAG declared insufficient: Basic single-hop vector search is no longer competitive for complex enterprise queries. Agentic retrieval, hybrid dense+sparse, and GraphRAG are the new baselines for serious production systems.
  • Contextual Retrieval (Anthropic): Adding document-level context to each chunk before embedding — a low-cost change that cut retrieval failures by 49%, or 67% combined with reranking. Now considered a standard best practice.
  • Llama 4 Scout at 10M tokens: The largest publicly available context window in early 2026. Changes the calculus for very long static documents but doesn't change the cost argument for dynamic, high-volume production systems.
  • RAG as knowledge runtime: The enterprise trajectory for 2026–2030 is RAG evolving from a retrieval technique into a full knowledge runtime — unified orchestration of retrieval, verification, reasoning, access control, and audit trails.
  • Evaluation tooling maturing: RAGAS, LangWatch, and Arize have standardised how teams measure retrieval quality, not just generation quality — recall@k, context precision, answer faithfulness, and end-to-end cascade reliability are now table stakes metrics for any RAG deployment.

Checklist: Do You Understand This?

  • Can you explain why a 1M-token context window is not a replacement for RAG in production?
  • Do you know the five stages of a RAG pipeline (load → chunk → embed → store → retrieve → generate)?
  • Can you describe the cascade failure problem and why a 4-layer pipeline with 95% per-layer accuracy yields only 81% end-to-end reliability?
  • Do you know when to reach for GraphRAG vs standard vector retrieval?
  • Can you name two hybrid retrieval strategies and what problem each solves (dense+sparse vs speculative)?
  • Do you understand why semantic caching can cut RAG costs by up to 68.8%?
  • Can you choose the right approach from the decision framework table for a given scenario?