Intermediate

Retrieval vs Generation

Every chatbot draws on two sources of knowledge: what the model learned during training (parametric memory) and what you inject into the prompt at runtime (non-parametric retrieval). The choice between relying on one or the other — or combining them — is one of the most consequential architectural decisions you will make. Get it wrong and you either pay 1,250× too much in compute, hallucinate confidently on your own data, or build infrastructure you don't need. This page maps the full decision space and the tradeoffs at every level.

Parametric (Pure Generation)

Knowledge in weights — fast, opaque, frozen at training cutoff

Non-Parametric (RAG)

Knowledge in corpus — accurate, citeable, always fresh

Creative / brainstorming

General reasoning & code

Domain-specific Q&A

Regulated / high-stakes

Most production chatbots sit to the right — RAG for accuracy, generation for fluency

The Core Distinction

When an LLM answers a question it can draw on two fundamentally different sources:

Dimension	Parametric memory (pure generation)	Non-parametric memory (retrieval / RAG)
Where knowledge lives	Encoded in model weights during training	External corpus — vector store, database, files — accessed at query time
How it is accessed	Implicit — model "remembers" by generating	Explicit — embed query → search → retrieve → inject into prompt
Currency	Frozen at training cutoff	As fresh as your last index update
Verifiability	Cannot cite sources — knowledge is opaque	Every answer traceable to a retrieved document
Hallucination risk	High (25–40% on domain-specific queries)	Lower (2–15% with proper grounding)
Latency	<100ms (no retrieval step)	500ms–2s (retrieval adds overhead)
Cost per query	Low ($0.001–0.01)	Very low ($0.00008 optimized) — up to 1,250× cheaper than long-context
Updatability	Requires full retraining or fine-tuning	Add documents to index — no model change needed
Best for	General reasoning, code patterns, creative tasks, brainstorming	FAQ bots, customer support, domain-specific Q&A, regulated industries

RAG (Retrieval-Augmented Generation) is the standard pattern that fuses both: retrieve relevant documents from an external store, inject them into the prompt as context, then let the model generate a grounded answer. Anthropic's Contextual Retrieval technique achieved a 49% reduction in failed retrievals and a 67% reduction with reranking — demonstrating how much the retrieval quality drives the final answer quality.

When to Use Each Approach

Use pure generation when

Queries test general reasoning, math, or coding — no private corpus needed
Answers don't require current or proprietary information
Latency is critical (sub-200ms requirement)
Hallucination is acceptable — brainstorming, creative drafts, code stubs
Dataset is small enough to fit in a single prompt (<50 documents)
Failure mode is wrong format or tone, not missing facts

Use RAG when

Your corpus has 1,000+ documents or updates frequently
Precision and source citations are required (legal, medical, regulated industries)
Data is proprietary and cannot be baked into model weights
Real-time or frequently updated data (news, regulations, product catalogs)
Response latency budget is 1–2 seconds (acceptable for most web apps)
Failure mode is outdated or missing facts — not just wrong format

The long-context temptation — and its limits

With Gemini 2.5 Pro at 1M tokens and Llama 4 Scout at 10M tokens, it is tempting to just stuff your entire corpus into the context window. Resist this for production systems:

Cost: A 1M-token query costs roughly $0.10; an equivalent RAG query costs $0.00008 — a 1,250× difference at scale
Latency: Full 1M-token generation takes 45–60 seconds; RAG returns in 500–800ms
Accuracy degrades: Gemini 2.5 Pro accuracy drops to ~77% at full 1M-token load; competitors hit 65–70%
Long context is useful for: Prototyping, one-off analysis, static corpora under ~100 documents

Pure generation path

User query

LLM forward pass

Parametric knowledge only

Response

No source citations

RAG path

User query

Embed → Search

Vector + BM25 hybrid

Rerank

Cross-encoder

Inject context

Top-k chunks

LLM generation

Grounded answer + citations

RAG adds a retrieval step before generation — the quality of retrieval determines the quality of the answer

The RAG Pipeline

A standard RAG system has two phases: offline indexing and online query-time retrieval.

Phase 1 — Offline indexing

Load: Ingest documents from files, APIs, databases, web scrapes
Chunk: Split into segments (typically 256–512 tokens with overlap) — see the chunking page for full strategy
Embed: Convert each chunk to a dense vector using an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, Nomic-embed)
Store: Write vectors + metadata to a vector database (Pinecone, Qdrant, pgvector, Weaviate)

Phase 2 — Query-time retrieval

Embed query: Convert the user's question to a vector using the same model
Search: Top-k approximate nearest-neighbour search in the vector store (typically k=3–10)
Rerank (optional): Pass candidates through a cross-encoder reranker for precision (adds 200–400ms latency)
Inject: Prepend retrieved chunks to the prompt as context
Generate: LLM produces an answer grounded in the retrieved context
Enforce grounding: System prompt instructs model to answer only from context and flag uncertainty

Hybrid Retrieval Architectures

Basic single-hop vector search is insufficient for complex production requirements. The field has evolved toward several hybrid patterns:

Dense + Sparse (Hybrid Retrieval)

Combines vector semantic search with lexical keyword search (BM25, ElasticSearch). Dense retrieval captures conceptual similarity ("what does cloud computing cost?" finds pricing documents even without the word "cost"). Sparse excels at exact-term precision ("SKU-XJ-4421" must match exactly). Reciprocal Rank Fusion (RRF) merges the two result sets without needing a learned ensemble.

Supported natively in Pinecone, Weaviate; or via two-stack (vector DB + OpenSearch).

GraphRAG (Microsoft, 2025)

Replaces flat chunk retrieval with a knowledge graph: entities, relationships, and taxonomies are extracted from the corpus and stored in a graph database (Neo4j). At query time the system traverses relationships rather than searching for nearest vectors. This enables global/thematic queries that span many documents: "What are the compliance risks across all vendor contracts?" GraphRAG achieves 99% precision on structured queries but costs 3–5× more to build than baseline RAG. Use it when cross-document reasoning is your primary use case.

Agentic RAG (2025–2026 production standard)

An LLM agent decides whether retrieval is necessary at all — for simple factual questions well-covered by parametric knowledge, it skips the retrieval step entirely. For complex queries it plans multi-hop retrieval strategies, reflects on intermediate results, and re-queries with refined searches. This reduces over-retrieval (a major cost driver) and handles reasoning that spans many documents without bloating a single context window.

Emerging research: TreePS-RAG, RAG-Critic (ACL 2025), T-GRAG for temporal conflict resolution.

Speculative Retrieval

Overlaps the retrieval pipeline with early generation — the model begins producing tokens while retrieval is still in flight. Reduces time-to-first-token (TTFT) by 20–30% in high-throughput systems. Risk: speculative hallucinations if the retrieved context contradicts the early tokens. Requires a fallback/correction mechanism. Best for latency-sensitive applications with predictable query patterns.

Cache-Augmented Generation

Semantic caching stores computed query embeddings and LLM responses. On a cache hit, the response is returned without any LLM call. Production workloads typically see 30–50% cache hit rates on FAQ-style bots, yielding up to a 68.8% cost reduction. Pair with retrieval for maximum efficiency: cache warm paths, retrieve cold paths.

Performance Tradeoffs

Metric	Pure generation	Basic RAG	Optimized RAG	Long context (1M)
Latency	<100ms	1–2s	500–800ms	45–60s
Cost per query	$0.001–0.01	~$0.002	~$0.00008	~$0.10
Hallucination rate	25–40%	5–15%	2–5%	20–25%
Domain accuracy	60–75%	75–90%	85–95%	65–77%
Source citations	None	Yes	Yes	None
Corpus updatability	Retrain / fine-tune	Re-index documents	Re-index documents	Rewrite prompt
Infrastructure complexity	None beyond LLM API	Vector DB + embedding model	+ reranker + cache + router	None beyond LLM API

Long context accuracy degrades at scale because the "lost in the middle" problem is real — LLMs systematically underweight information in the middle of large windows. RAG sidesteps this by surfacing only the most relevant 3–10 chunks, keeping the context tight and precise.

Failure Modes

Pure generation failures

Knowledge cutoff: model confidently answers with stale information
Domain hallucination: invents product names, SKUs, regulations, people
Cannot cite: no verifiable source for any claim
Unstable policy: model behavior on edge cases varies between calls
Proprietary knowledge gap: anything not in training data is unknown

RAG failures (cascade effect)

Retrieval miss: wrong chunks retrieved — the highest single failure driver
Reranker miss: relevant document deprioritized behind noise
Context fragmentation: answer spans multiple chunks, model loses thread
Hallucination persistence: retrieved context doesn't prevent confabulation
Cascade failure: at 95% accuracy per layer, a 4-layer pipeline has only 81% end-to-end reliability (0.95⁴)
Stale corpus: outdated documents become hallucination anchors
Over-retrieval: enterprises overpay by up to 80% from excessive k values

Tools and Frameworks

Tool	Role	Strengths	Latency (p99)
LlamaIndex	Data & retrieval framework	Purpose-built for indexing: data connectors, node parsers, vector/tree/graph/keyword indices, hybrid retrieval	30ms @ 1,000 RPS
LangChain	Orchestration & agents	Agent loops, tool calling, memory management, chain composition — typically layers above LlamaIndex	45ms @ 1,000 RPS
Pinecone	Vector database	Native dense + sparse hybrid retrieval, serverless, production-grade	<50ms
Qdrant	Vector database	Open-source, on-premise option, payload filtering, sparse vector support	<40ms
pgvector	Vector extension for Postgres	No new infrastructure if already on Postgres; SQL filtering; lower operational overhead	50–100ms
GraphRAG (Microsoft)	Knowledge graph retrieval	99% precision on structured/relational queries; global thematic queries	300–800ms
Cohere Rerank	Cross-encoder reranker	Boosts precision after retrieval; 67% fewer failed retrievals combined with contextual retrieval	+200–400ms

Decision Framework

Scenario	Approach	Rationale
Customer support on product docs (1,000+ pages, updates monthly)	RAG — hybrid dense+sparse	Accuracy, citations, updatability without retraining
Code generation assistant (no private codebase)	Pure generation	Low latency, no private corpus needed, model trained well on code
Compliance & audit across 10,000 contracts	Agentic RAG + GraphRAG	Cross-document reasoning, audit trail, regulatory traceability
Real-time news or market intelligence bot	RAG with frequent re-indexing	Parametric knowledge is stale; fresh corpus required
Prototyping — exploring a 50-doc dataset this week	Long context (1M+ tokens)	Simplest path; no indexing overhead; acceptable for one-off
Cost-sensitive high-volume FAQ bot (>100k queries/day)	RAG + semantic cache + router	68.8% cost reduction from cache; router avoids over-retrieval
Multi-document research assistant (reasoning across sources)	Agentic RAG with multi-hop	Agent plans sequential retrievals; handles cross-document reasoning

The 2026 Production Pattern

For most production chatbot systems in 2026, the consensus architecture is a dual pipeline: RAG handles factual accuracy and knowledge freshness, fine-tuning (or strong system-prompt engineering) handles style, tone, and policy behavior. Neither alone is sufficient:

RAG without style tuning produces accurate but off-brand responses
Fine-tuning without RAG produces fluent but hallucinating responses
The combination achieves both factual grounding and behavioural consistency

A typical production stack:

Query → Semantic cache → (cache hit: return immediately)

↓ cache miss

Intent router → decides: retrieve or answer from model

↓ retrieve path

Embed query → Hybrid search (dense + sparse) → Reranker

↓

Inject top-k chunks → LLM (fine-tuned or system-prompted)

↓

Grounding check → Citation extraction → Response

2025–2026 Developments

Standard RAG declared insufficient: Basic single-hop vector search is no longer competitive for complex enterprise queries. Agentic retrieval, hybrid dense+sparse, and GraphRAG are the new baselines for serious production systems.
Contextual Retrieval (Anthropic): Adding document-level context to each chunk before embedding — a low-cost change that cut retrieval failures by 49%, or 67% combined with reranking. Now considered a standard best practice.
Llama 4 Scout at 10M tokens: The largest publicly available context window in early 2026. Changes the calculus for very long static documents but doesn't change the cost argument for dynamic, high-volume production systems.
RAG as knowledge runtime: The enterprise trajectory for 2026–2030 is RAG evolving from a retrieval technique into a full knowledge runtime — unified orchestration of retrieval, verification, reasoning, access control, and audit trails.
Evaluation tooling maturing: RAGAS, LangWatch, and Arize have standardised how teams measure retrieval quality, not just generation quality — recall@k, context precision, answer faithfulness, and end-to-end cascade reliability are now table stakes metrics for any RAG deployment.

Checklist: Do You Understand This?

Can you explain why a 1M-token context window is not a replacement for RAG in production?
Do you know the five stages of a RAG pipeline (load → chunk → embed → store → retrieve → generate)?
Can you describe the cascade failure problem and why a 4-layer pipeline with 95% per-layer accuracy yields only 81% end-to-end reliability?
Do you know when to reach for GraphRAG vs standard vector retrieval?
Can you name two hybrid retrieval strategies and what problem each solves (dense+sparse vs speculative)?
Do you understand why semantic caching can cut RAG costs by up to 68.8%?
Can you choose the right approach from the decision framework table for a given scenario?