Search & Reranking
Retrieval quality is the single biggest driver of RAG output quality β if the wrong chunks reach the LLM, no amount of prompt engineering recovers the answer. This page covers the full retrieval stack: how vector search and keyword search work and where each fails, how to fuse them with hybrid retrieval, how query expansion and HyDE improve recall, and how reranking restores precision after you've cast a wide net. The production benchmark result sets the target: BM25 alone 58% accuracy β hybrid 79% β hybrid + rerank 91%.
Two Retrieval Paradigms
| Dimension | Dense (vector / semantic) | Sparse (keyword / BM25) |
|---|---|---|
| How it works | Query and document are embedded into dense vectors; similarity is cosine distance in high-dimensional space | Term-frequency scoring (BM25); ranks documents by exact keyword matches, weighted by rarity and document length |
| Strength | Captures semantics β "cost" matches "price", synonyms, paraphrases, cross-lingual queries | Exact-term precision β serial numbers, product codes, names, technical identifiers |
| Weakness | Misses rare/specific terms; "SKU-XJ-4421" may not match if not in training vocabulary | Vocabulary mismatch β "inexpensive" does not match "cheap"; no semantic understanding |
| Speed | Approximate nearest-neighbour (ANN); fast at <100ms for millions of vectors | Inverted index lookup; extremely fast (<10ms); ElasticSearch / OpenSearch / Typesense |
| Accuracy alone | ~79% in production benchmarks (semantic only) | ~58% in production benchmarks (BM25 only) |
| Best for | Natural language queries, concept search, Q&A over prose | Technical queries with exact identifiers, proper nouns, code snippets |
Neither approach dominates in production. Dense retrieval wins on semantic understanding; sparse wins on exact-term recall. Combining them is the first upgrade every RAG system should make.
Hybrid Retrieval
Hybrid retrieval runs both dense and sparse searches independently, then merges the two ranked result lists before passing candidates to the LLM (or a reranker). The merge step is the key design decision.
Reciprocal Rank Fusion (RRF) β the standard merge algorithm
RRF ignores raw similarity scores entirely (which have incompatible scales between BM25 and cosine) and works only on rank position. For each document, its RRF score is:
Where k is a constant (typically 60) that dampens the impact of very-high-ranked documents. Documents that rank highly in both lists accumulate the highest RRF scores. Documents that rank highly in only one list still appear β preserving the benefit of each search type.
- No score normalisation required β works out-of-the-box across any two retrieval methods
- No learned parameters β zero-shot, no training data needed
- Scales to billion-document indices where global score normalisation is expensive
- Native support in Pinecone, Weaviate, Elasticsearch, and MongoDB Atlas
Implementation options
- Native hybrid DB: Pinecone, Weaviate, MongoDB Atlas β run both searches and RRF in one query call; lowest operational overhead
- Two-stack: vector DB (Qdrant / pgvector) + keyword search (Elasticsearch / OpenSearch) β more control, more infra
- BGE-M3 single model: one model produces dense + sparse vectors simultaneously β no separate BM25 index needed; best for latency-sensitive systems
Alpha weighting
Some implementations let you tune an alpha weight: score = Ξ±Β·dense + (1-Ξ±)Β·sparse.
- Ξ± = 1.0 β pure dense
- Ξ± = 0.0 β pure sparse
- Ξ± = 0.5 β equal weight (typical starting point)
- Ξ± = 0.75 β dense-heavy (natural language heavy corpora)
- Tune Ξ± against your recall@k eval set β don't guess
Query-Side Techniques
Even with hybrid search, a poorly phrased query retrieves poor results. These techniques improve recall before the search even runs.
Query expansion
Use an LLM to generate synonym terms, related concepts, and alternative phrasings of the original query. The expanded terms are added to the keyword search component, increasing the chance of matching documents that use different vocabulary.
Example: user asks "how do I reduce latency?" β expanded with "performance optimisation", "response time", "throughput bottleneck", "p95 latency" before BM25 search.
HyDE β Hypothetical Document Embeddings
Instead of embedding the query, ask an LLM to write a hypothetical answer to the query, then embed that document for retrieval. The hypothesis occupies the same embedding space as real answers, often finding better matches than the short, ambiguous original query.
Benchmark: Hybrid + HyDE achieves NDCG of 0.91 on mixed queries vs standard Hybrid at 0.85 β a significant lift for open-ended questions.
Cost: one extra LLM call per query (~$0.001β0.005 with a small model). Worth it when queries are short, ambiguous, or use very different vocabulary from your corpus.
Multi-query retrieval (RAG-Fusion)
Ask an LLM to generate 3β5 alternative phrasings of the original query. Run retrieval for each sub-query independently, then merge all result sets with RRF. Documents that surface across multiple sub-queries receive the highest scores.
Benchmarks show +8β10% answer accuracy and +30β40% answer comprehensivenessvs vanilla RAG. Cost: 3β5Γ more retrieval calls and embedding calls per query. Best for complex, multi-faceted questions β overkill for simple factoid lookups.
Step-back prompting
For specific queries that require broad context, first retrieve on a more general "step-back" version of the query, then on the specific query. Example: "What does Section 4.3 say about termination?" β step-back: "What are the contract termination clauses?" β specific: "Section 4.3 termination notice period". Combines general context with specific detail in the injected chunks.
Reranking
Retrieval is a recall problem: get enough relevant documents into the candidate set (top-50 to top-100). Reranking is a precision problem: from those candidates, surface the top-3 to top-10 most relevant for the LLM. These are different optimisation targets requiring different models.
Bi-encoder vs cross-encoder
| Dimension | Bi-encoder (embedding model) | Cross-encoder (reranker) |
|---|---|---|
| How it processes | Query and document embedded separately; similarity = cosine distance | Query + document concatenated; processed together through transformer; outputs a relevance score |
| Accuracy | Good β but misses fine-grained relevance signals | Higher β full attention across query + document reveals subtle relevance |
| Speed | Fast β precompute document embeddings; ANN search at query time | Slow at scale β full forward pass per (query, document) pair; cannot precompute |
| Scales to | Millions of documents | 50β200 candidates per query (hence: retrieve first, then rerank) |
| Role | First-stage retrieval (cast wide net) | Second-stage reranking (precision cut) |
Reranker Models
| Model | Type | Latency added | Cost | Best for |
|---|---|---|---|---|
| Cohere Rerank 3 | Cross-encoder (API) | 200β400ms | $0.002 / 1K searches | Production, multilingual, no infra to manage |
| Cohere Rerank 3 Nimble | Cross-encoder (API, fast) | 100β200ms | Lower than Rerank 3 | Latency-sensitive production with high throughput |
| Voyage Rerank 2 | Cross-encoder (API) | 150β300ms | ~$0.05 / 1M tokens | Pairs well with Voyage embeddings; high BEIR accuracy |
| Mixedbread reranker-large | Cross-encoder (open-source) | 200β400ms | Free (self-hosted) | Top BEIR score (57.49); self-hosted; outperforms Cohere on benchmarks |
| FlashRank | Cross-encoder (lightweight) | <50ms (CPU) | Free (open-source) | CPU-bound environments, edge deployments, cost-sensitive low-latency |
| ColBERT / RAGatouille | Late interaction | Tens of ms (precomputed doc embeddings) | Free (open-source) | High accuracy + near-bi-encoder speed; token-level interaction; large corpora |
ColBERT: late interaction explained
Standard cross-encoders cannot precompute document representations β every query forces a full forward pass over every candidate. ColBERT solves this with late interaction: query and document are encoded separately (like bi-encoders), but similarity is computed as token-level MaxSim β the sum of the maximum similarity between each query token and all document tokens. This captures fine-grained relevance while allowing document embeddings to be precomputed and cached.
With PLAID's centroid pruning, ColBERT achieves tens-of-milliseconds retrieval on large corpora β bridging the gap between cross-encoder accuracy and bi-encoder speed. RAGatouille is the recommended Python library for ColBERT integration in RAG pipelines.
The Full Retrieval Pipeline
The three stages β retrieval, fusion, reranking β compose into a cascade:
BM25 alone 58% β hybrid 79% β hybrid + rerank 91%. Each stage adds latency β add only where you have measured a gap.
Not every stage is always necessary. Start with dense-only, measure recall@k, then add stages where you see gaps. Adding all three stages is correct only when you have measured that each stage improves your specific eval metrics.
Choosing k Values
| Stage | Typical k | Notes |
|---|---|---|
| Dense retrieval candidates | 50β100 | High recall; cost is cheap (ANN search); wider net for reranker to work with |
| After RRF fusion | 50 | Pass merged top-50 to reranker; balance reranker compute cost vs recall |
| After reranking | 3β10 | Only inject the top-3 to top-10 into the LLM context; stay under ~2,500 tokens injected total |
| Final LLM context | 3β5 chunks typical | More chunks = more cost and risk of "lost in the middle" dilution; precision matters more than volume here |
Over-retrieval is a real cost driver β enterprises overpay by up to 80% from excessive k values. Retrieval is cheap; LLM tokens are expensive. Keep the final injected k small and precise.
What to Optimise First
- Establish a dense-only baseline β measure recall@10 and recall@50 with your embedding model and chunking strategy before adding any complexity
- Add hybrid (BM25 + dense with RRF) β if recall@50 is below 90%, this is your highest-leverage single upgrade; adds <10ms latency with native hybrid DB support
- Add reranking once recall@50 >90% β if the right documents are in the top-50 but the top-5 is still wrong, add a cross-encoder; adds 200β400ms latency
- Add HyDE or multi-query if recall@50 is still low β for corpora with strong vocabulary mismatch between queries and documents; adds one LLM call per query
- Add contextual retrieval to the indexing pipeline β prepend document context to each chunk before embedding; 49β67% retrieval failure reduction for low indexing cost
Failure Modes
Retrieval failures
- Vocabulary mismatch: user says "cost", doc says "pricing" β dense retrieval helps, but hybrid catches both
- Short query, ambiguous intent: "what is the limit?" retrieves wrong document type; HyDE or step-back prompting improves this
- Top-k too small: the relevant chunk is at position 15 but k=10; increase retrieval k before adding more expensive reranking
- Stale index: documents updated but not re-indexed; retriever surfaces outdated chunks with full confidence
Reranking failures
- Relevant document not in candidates: reranker cannot rescue a document that was never retrieved; fix retrieval recall first
- Reranking noise: passing 200+ candidates to a cross-encoder inflates latency and cost without improving precision over top-50
- Latency budget exceeded: adding a reranker to a sub-500ms pipeline breaks SLAs; use ColBERT or FlashRank for latency-constrained systems
- Final k too large: injecting 20 reranked chunks into the LLM causes "lost in the middle" dilution; keep final k at 3β10
2025β2026 Developments
- Hybrid retrieval is now the production default β pure vector search is no longer considered sufficient for production RAG. Native hybrid support (dense + sparse + RRF) is now built into Pinecone, Weaviate, MongoDB Atlas, and Elasticsearch, removing the operational barrier.
- BGE-M3 unifies three retrieval paradigms β BAAI's BGE-M3 produces dense, sparse (SPLADE), and ColBERT embeddings from a single model pass, enabling a full hybrid + late-interaction pipeline without multiple separate embedding models.
- Mixedbread leads BEIR benchmarks β open-source Mixedbread reranker-large (BEIR 57.49) outperforms Cohere Rerank on several benchmarks, making fully open-source, self-hosted high-accuracy reranking practical for the first time.
- HyDE adoption growing β Hypothetical Document Embeddings are seeing wider production adoption as teams discover that the NDCG improvement (0.85 β 0.91) is worth the additional LLM call for open-ended queries.
- Agentic retrieval β retrieval as a tool call β agents that decide whether and how to retrieve (single-hop vs multi-hop, which index to query) are replacing static pipeline configurations in advanced systems, allowing dynamic retrieval strategies per query.
Checklist: Do You Understand This?
- Can you explain why BM25 alone achieves ~58% accuracy, hybrid ~79%, and hybrid + reranking ~91%?
- Do you understand how RRF works β why it uses rank position rather than raw scores?
- Can you describe HyDE and when the extra LLM call is justified?
- Do you know the difference between a bi-encoder (embedding model) and a cross-encoder (reranker) β and why you need both?
- Can you explain ColBERT's late interaction mechanism and why it achieves cross-encoder accuracy at near-bi-encoder speed?
- Do you know the typical k values at each pipeline stage β retrieval (50β100), after reranking (3β10)?
- Can you name two failure modes that reranking cannot fix (answer: document never retrieved, final k too large)?