Advanced RAG
Basic RAG — embed, search, include top-k chunks — works for many use cases, but retrieval quality degrades for vague queries, complex questions, and mismatches between query phrasing and document language. Advanced techniques address these failure modes systematically.
Hybrid Search: Dense + Sparse
Advanced retrieval pipeline — hybrid search + reranking
Pure vector search fails when the query contains specific terms (product codes, names, technical jargon) that need exact matching. Hybrid search combines dense vector similarity with sparse keyword search (BM25) and merges the results.
- Dense retrieval (embeddings): Finds semantically similar content even with different wording
- Sparse retrieval (BM25/TF-IDF): Finds content with exact keyword matches — critical for codes, names, and technical terms
- Reciprocal Rank Fusion (RRF): The standard merging algorithm — combines rankings from both systems without requiring score normalisation
Weaviate and Qdrant have native hybrid search. For Pinecone or custom setups, maintain a parallel BM25 index (Elasticsearch, Typesense) and merge results using RRF. Most production RAG systems benefit from hybrid search — implement it if retrieval quality matters.
HyDE: Hypothetical Document Embeddings
Queries and documents often live in different semantic spaces — a question is phrased differently from the answer. HyDE bridges this gap by generating a hypothetical answer first, then using the answer (not the question) as the search query.
- User asks: "What is the refund policy for software subscriptions?"
- Ask Claude to generate a hypothetical answer: "Software subscriptions can be refunded within 30 days of purchase if the software is defective..."
- Embed the hypothetical answer and use it to search the index
- The hypothetical answer's embedding is closer to the actual policy document than the question's embedding
HyDE adds one LLM call to the retrieval step. It consistently improves recall for factual Q&A over structured documents. Implement it when baseline retrieval misses relevant chunks for well-phrased questions.
Reranking: Cross-Encoder Rescoring
Similarity search retrieves a candidate set (top-20 or top-50 chunks). A reranker then scores each candidate against the query more precisely — using a cross-encoder model that jointly considers both the query and the document, unlike the bi-encoder embedding approach used for retrieval.
- Retrieve broadly (top-20): High recall — don't miss the relevant chunk
- Rerank (cross-encoder): Score each candidate accurately — identify the truly relevant chunks
- Pass top-5 to Claude: High precision — Claude's context window contains only the best matches
Reranker options: cross-encoder/ms-marco-MiniLM-L-6-v2 (open-source, fast), Cohere Rerank API (managed), Jina Reranker API. The two-stage retrieve-then-rerank pattern is standard in production RAG — it combines the speed of vector search with the precision of cross-encoder scoring.
Query Rewriting and Expansion
User queries are often ambiguous, short, or missing context. Use Claude to transform the query before retrieval:
- Query expansion: Generate multiple rephrasings of the same question, retrieve for each, merge results — improves recall for queries with ambiguous terminology
- Step-back prompting: Ask Claude to identify the broader concept behind the specific question, retrieve on the broader concept — useful for questions that require background context
- Decomposition: For complex multi-part questions, break into sub-questions, retrieve separately, combine results — enables answering questions that span multiple documents
Multi-Hop Retrieval
Some questions cannot be answered from a single retrieved chunk — the answer requires combining information across multiple documents or following a chain of references. Multi-hop retrieval handles this:
- Retrieve initial chunks based on the original query
- Ask Claude to identify what additional information is needed based on initial results
- Retrieve again using the identified information needs
- Repeat until Claude has sufficient context or a maximum hop count is reached
- Generate the final answer from all accumulated retrieved content
Multi-hop retrieval is more complex and adds latency. Use it for knowledge bases where answers inherently span multiple documents — regulatory Q&A, contract analysis, technical architecture questions.
When to Apply Each Technique
- Always consider: Hybrid search — adds robustness for specific terms at low implementation cost
- High-value retrieval: Reranking — consistent quality improvement for production systems
- Poor baseline recall: HyDE or query expansion — when users find the system "doesn't find obvious things"
- Complex knowledge bases: Multi-hop — when answers require connecting multiple documents
Implement in order of impact. Most teams see the largest gains from: (1) hybrid search, (2) reranking, (3) query rewriting. Add complexity only when retrieval quality metrics justify it.
Checklist: Do You Understand This?
- Hybrid search: dense (embedding) + sparse (BM25) merged via RRF — handles both semantic and exact-match retrieval
- HyDE: generate hypothetical answer → embed answer → search with answer embedding — closes query-document vocabulary gap
- Reranking: retrieve broad set (top-20) → cross-encoder rescores → pass top-5 to Claude — high recall + high precision
- Query rewriting: expand, rephrase, or decompose the query before retrieval — improves coverage for ambiguous queries
- Multi-hop: chain retrievals for questions spanning multiple documents — adds latency, use for complex knowledge bases