Advanced

Advanced RAG

Basic RAG — embed, search, include top-k chunks — works for many use cases, but retrieval quality degrades for vague queries, complex questions, and mismatches between query phrasing and document language. Advanced techniques address these failure modes systematically.

Hybrid Search: Dense + Sparse

Query

→

Dense Search

Embedding similarity

→

Sparse Search

BM25 keyword match

→

RRF Merge

Rank fusion

→

Reranker

Cross-encoder rescore

→

Top-k Chunks

Advanced retrieval pipeline — hybrid search + reranking

Pure vector search fails when the query contains specific terms (product codes, names, technical jargon) that need exact matching. Hybrid search combines dense vector similarity with sparse keyword search (BM25) and merges the results.

Dense retrieval (embeddings): Finds semantically similar content even with different wording
Sparse retrieval (BM25/TF-IDF): Finds content with exact keyword matches — critical for codes, names, and technical terms
Reciprocal Rank Fusion (RRF): The standard merging algorithm — combines rankings from both systems without requiring score normalisation

Weaviate and Qdrant have native hybrid search. For Pinecone or custom setups, maintain a parallel BM25 index (Elasticsearch, Typesense) and merge results using RRF. Most production RAG systems benefit from hybrid search — implement it if retrieval quality matters.

HyDE: Hypothetical Document Embeddings

Queries and documents often live in different semantic spaces — a question is phrased differently from the answer. HyDE bridges this gap by generating a hypothetical answer first, then using the answer (not the question) as the search query.

User asks: "What is the refund policy for software subscriptions?"
Ask Claude to generate a hypothetical answer: "Software subscriptions can be refunded within 30 days of purchase if the software is defective..."
Embed the hypothetical answer and use it to search the index
The hypothetical answer's embedding is closer to the actual policy document than the question's embedding

HyDE adds one LLM call to the retrieval step. It consistently improves recall for factual Q&A over structured documents. Implement it when baseline retrieval misses relevant chunks for well-phrased questions.

Reranking: Cross-Encoder Rescoring

Similarity search retrieves a candidate set (top-20 or top-50 chunks). A reranker then scores each candidate against the query more precisely — using a cross-encoder model that jointly considers both the query and the document, unlike the bi-encoder embedding approach used for retrieval.

Retrieve broadly (top-20): High recall — don't miss the relevant chunk
Rerank (cross-encoder): Score each candidate accurately — identify the truly relevant chunks
Pass top-5 to Claude: High precision — Claude's context window contains only the best matches

Reranker options: cross-encoder/ms-marco-MiniLM-L-6-v2 (open-source, fast), Cohere Rerank API (managed), Jina Reranker API. The two-stage retrieve-then-rerank pattern is standard in production RAG — it combines the speed of vector search with the precision of cross-encoder scoring.

Query Rewriting and Expansion

User queries are often ambiguous, short, or missing context. Use Claude to transform the query before retrieval:

Query expansion: Generate multiple rephrasings of the same question, retrieve for each, merge results — improves recall for queries with ambiguous terminology
Step-back prompting: Ask Claude to identify the broader concept behind the specific question, retrieve on the broader concept — useful for questions that require background context
Decomposition: For complex multi-part questions, break into sub-questions, retrieve separately, combine results — enables answering questions that span multiple documents

Multi-Hop Retrieval

Some questions cannot be answered from a single retrieved chunk — the answer requires combining information across multiple documents or following a chain of references. Multi-hop retrieval handles this:

Retrieve initial chunks based on the original query
Ask Claude to identify what additional information is needed based on initial results
Retrieve again using the identified information needs
Repeat until Claude has sufficient context or a maximum hop count is reached
Generate the final answer from all accumulated retrieved content

Multi-hop retrieval is more complex and adds latency. Use it for knowledge bases where answers inherently span multiple documents — regulatory Q&A, contract analysis, technical architecture questions.

When to Apply Each Technique

Always consider: Hybrid search — adds robustness for specific terms at low implementation cost
High-value retrieval: Reranking — consistent quality improvement for production systems
Poor baseline recall: HyDE or query expansion — when users find the system "doesn't find obvious things"
Complex knowledge bases: Multi-hop — when answers require connecting multiple documents

Implement in order of impact. Most teams see the largest gains from: (1) hybrid search, (2) reranking, (3) query rewriting. Add complexity only when retrieval quality metrics justify it.

Checklist: Do You Understand This?

Hybrid search: dense (embedding) + sparse (BM25) merged via RRF — handles both semantic and exact-match retrieval
HyDE: generate hypothetical answer → embed answer → search with answer embedding — closes query-document vocabulary gap
Reranking: retrieve broad set (top-20) → cross-encoder rescores → pass top-5 to Claude — high recall + high precision
Query rewriting: expand, rephrase, or decompose the query before retrieval — improves coverage for ambiguous queries
Multi-hop: chain retrievals for questions spanning multiple documents — adds latency, use for complex knowledge bases