Intermediate

Search & Reranking

Retrieval quality is the single biggest driver of RAG output quality — if the wrong chunks reach the LLM, no amount of prompt engineering recovers the answer. This page covers the full retrieval stack: how vector search and keyword search work and where each fails, how to fuse them with hybrid retrieval, how query expansion and HyDE improve recall, and how reranking restores precision after you've cast a wide net. The production benchmark result sets the target: BM25 alone 58% accuracy → hybrid 79% → hybrid + rerank 91%.

Two Retrieval Paradigms

Dimension	Dense (vector / semantic)	Sparse (keyword / BM25)
How it works	Query and document are embedded into dense vectors; similarity is cosine distance in high-dimensional space	Term-frequency scoring (BM25); ranks documents by exact keyword matches, weighted by rarity and document length
Strength	Captures semantics — "cost" matches "price", synonyms, paraphrases, cross-lingual queries	Exact-term precision — serial numbers, product codes, names, technical identifiers
Weakness	Misses rare/specific terms; "SKU-XJ-4421" may not match if not in training vocabulary	Vocabulary mismatch — "inexpensive" does not match "cheap"; no semantic understanding
Speed	Approximate nearest-neighbour (ANN); fast at <100ms for millions of vectors	Inverted index lookup; extremely fast (<10ms); ElasticSearch / OpenSearch / Typesense
Accuracy alone	~79% in production benchmarks (semantic only)	~58% in production benchmarks (BM25 only)
Best for	Natural language queries, concept search, Q&A over prose	Technical queries with exact identifiers, proper nouns, code snippets

Neither approach dominates in production. Dense retrieval wins on semantic understanding; sparse wins on exact-term recall. Combining them is the first upgrade every RAG system should make.

Hybrid Retrieval

Hybrid retrieval runs both dense and sparse searches independently, then merges the two ranked result lists before passing candidates to the LLM (or a reranker). The merge step is the key design decision.

Reciprocal Rank Fusion (RRF) — the standard merge algorithm

RRF ignores raw similarity scores entirely (which have incompatible scales between BM25 and cosine) and works only on rank position. For each document, its RRF score is:

RRF_score(doc) = Σ 1 / (k + rank_in_list_i)

Where k is a constant (typically 60) that dampens the impact of very-high-ranked documents. Documents that rank highly in both lists accumulate the highest RRF scores. Documents that rank highly in only one list still appear — preserving the benefit of each search type.

No score normalisation required — works out-of-the-box across any two retrieval methods
No learned parameters — zero-shot, no training data needed
Scales to billion-document indices where global score normalisation is expensive
Native support in Pinecone, Weaviate, Elasticsearch, and MongoDB Atlas

Implementation options

Native hybrid DB: Pinecone, Weaviate, MongoDB Atlas — run both searches and RRF in one query call; lowest operational overhead
Two-stack: vector DB (Qdrant / pgvector) + keyword search (Elasticsearch / OpenSearch) — more control, more infra
BGE-M3 single model: one model produces dense + sparse vectors simultaneously — no separate BM25 index needed; best for latency-sensitive systems

Alpha weighting

Some implementations let you tune an alpha weight: score = α·dense + (1-α)·sparse.

α = 1.0 → pure dense
α = 0.0 → pure sparse
α = 0.5 → equal weight (typical starting point)
α = 0.75 → dense-heavy (natural language heavy corpora)
Tune α against your recall@k eval set — don't guess

Query-Side Techniques

Even with hybrid search, a poorly phrased query retrieves poor results. These techniques improve recall before the search even runs.

Query expansion

Use an LLM to generate synonym terms, related concepts, and alternative phrasings of the original query. The expanded terms are added to the keyword search component, increasing the chance of matching documents that use different vocabulary.

Example: user asks "how do I reduce latency?" → expanded with "performance optimisation", "response time", "throughput bottleneck", "p95 latency" before BM25 search.

HyDE — Hypothetical Document Embeddings

Instead of embedding the query, ask an LLM to write a hypothetical answer to the query, then embed that document for retrieval. The hypothesis occupies the same embedding space as real answers, often finding better matches than the short, ambiguous original query.

Benchmark: Hybrid + HyDE achieves NDCG of 0.91 on mixed queries vs standard Hybrid at 0.85 — a significant lift for open-ended questions.

Cost: one extra LLM call per query (~$0.001–0.005 with a small model). Worth it when queries are short, ambiguous, or use very different vocabulary from your corpus.

Multi-query retrieval (RAG-Fusion)

Ask an LLM to generate 3–5 alternative phrasings of the original query. Run retrieval for each sub-query independently, then merge all result sets with RRF. Documents that surface across multiple sub-queries receive the highest scores.

Benchmarks show +8–10% answer accuracy and +30–40% answer comprehensivenessvs vanilla RAG. Cost: 3–5× more retrieval calls and embedding calls per query. Best for complex, multi-faceted questions — overkill for simple factoid lookups.

Step-back prompting

For specific queries that require broad context, first retrieve on a more general "step-back" version of the query, then on the specific query. Example: "What does Section 4.3 say about termination?" → step-back: "What are the contract termination clauses?" → specific: "Section 4.3 termination notice period". Combines general context with specific detail in the injected chunks.

Reranking

Retrieval is a recall problem: get enough relevant documents into the candidate set (top-50 to top-100). Reranking is a precision problem: from those candidates, surface the top-3 to top-10 most relevant for the LLM. These are different optimisation targets requiring different models.

Bi-encoder vs cross-encoder

Dimension	Bi-encoder (embedding model)	Cross-encoder (reranker)
How it processes	Query and document embedded separately; similarity = cosine distance	Query + document concatenated; processed together through transformer; outputs a relevance score
Accuracy	Good — but misses fine-grained relevance signals	Higher — full attention across query + document reveals subtle relevance
Speed	Fast — precompute document embeddings; ANN search at query time	Slow at scale — full forward pass per (query, document) pair; cannot precompute
Scales to	Millions of documents	50–200 candidates per query (hence: retrieve first, then rerank)
Role	First-stage retrieval (cast wide net)	Second-stage reranking (precision cut)

Reranker Models

Model	Type	Latency added	Cost	Best for
Cohere Rerank 3	Cross-encoder (API)	200–400ms	$0.002 / 1K searches	Production, multilingual, no infra to manage
Cohere Rerank 3 Nimble	Cross-encoder (API, fast)	100–200ms	Lower than Rerank 3	Latency-sensitive production with high throughput
Voyage Rerank 2	Cross-encoder (API)	150–300ms	~$0.05 / 1M tokens	Pairs well with Voyage embeddings; high BEIR accuracy
Mixedbread reranker-large	Cross-encoder (open-source)	200–400ms	Free (self-hosted)	Top BEIR score (57.49); self-hosted; outperforms Cohere on benchmarks
FlashRank	Cross-encoder (lightweight)	<50ms (CPU)	Free (open-source)	CPU-bound environments, edge deployments, cost-sensitive low-latency
ColBERT / RAGatouille	Late interaction	Tens of ms (precomputed doc embeddings)	Free (open-source)	High accuracy + near-bi-encoder speed; token-level interaction; large corpora

ColBERT: late interaction explained

Standard cross-encoders cannot precompute document representations — every query forces a full forward pass over every candidate. ColBERT solves this with late interaction: query and document are encoded separately (like bi-encoders), but similarity is computed as token-level MaxSim — the sum of the maximum similarity between each query token and all document tokens. This captures fine-grained relevance while allowing document embeddings to be precomputed and cached.

With PLAID's centroid pruning, ColBERT achieves tens-of-milliseconds retrieval on large corpora — bridging the gap between cross-encoder accuracy and bi-encoder speed. RAGatouille is the recommended Python library for ColBERT integration in RAG pipelines.

The Full Retrieval Pipeline

The three stages — retrieval, fusion, reranking — compose into a cascade:

Query

Optional: HyDE / query expansion / multi-query

→

Dense search

Vector ANN → top-50 candidates

→

Sparse search (BM25)

Keyword → top-50 candidates

→

RRF merge

Rank fusion → combined top-50

→

Reranker

Cross-encoder → top-5 to top-10

→

LLM generation

Grounded answer with injected top-k chunks

BM25 alone 58% → hybrid 79% → hybrid + rerank 91%. Each stage adds latency — add only where you have measured a gap.

Not every stage is always necessary. Start with dense-only, measure recall@k, then add stages where you see gaps. Adding all three stages is correct only when you have measured that each stage improves your specific eval metrics.

Choosing k Values

Stage	Typical k	Notes
Dense retrieval candidates	50–100	High recall; cost is cheap (ANN search); wider net for reranker to work with
After RRF fusion	50	Pass merged top-50 to reranker; balance reranker compute cost vs recall
After reranking	3–10	Only inject the top-3 to top-10 into the LLM context; stay under ~2,500 tokens injected total
Final LLM context	3–5 chunks typical	More chunks = more cost and risk of "lost in the middle" dilution; precision matters more than volume here

Over-retrieval is a real cost driver — enterprises overpay by up to 80% from excessive k values. Retrieval is cheap; LLM tokens are expensive. Keep the final injected k small and precise.

What to Optimise First

Establish a dense-only baseline — measure recall@10 and recall@50 with your embedding model and chunking strategy before adding any complexity
Add hybrid (BM25 + dense with RRF) — if recall@50 is below 90%, this is your highest-leverage single upgrade; adds <10ms latency with native hybrid DB support
Add reranking once recall@50 >90% — if the right documents are in the top-50 but the top-5 is still wrong, add a cross-encoder; adds 200–400ms latency
Add HyDE or multi-query if recall@50 is still low — for corpora with strong vocabulary mismatch between queries and documents; adds one LLM call per query
Add contextual retrieval to the indexing pipeline — prepend document context to each chunk before embedding; 49–67% retrieval failure reduction for low indexing cost

Failure Modes

Retrieval failures

Vocabulary mismatch: user says "cost", doc says "pricing" — dense retrieval helps, but hybrid catches both
Short query, ambiguous intent: "what is the limit?" retrieves wrong document type; HyDE or step-back prompting improves this
Top-k too small: the relevant chunk is at position 15 but k=10; increase retrieval k before adding more expensive reranking
Stale index: documents updated but not re-indexed; retriever surfaces outdated chunks with full confidence

Reranking failures

Relevant document not in candidates: reranker cannot rescue a document that was never retrieved; fix retrieval recall first
Reranking noise: passing 200+ candidates to a cross-encoder inflates latency and cost without improving precision over top-50
Latency budget exceeded: adding a reranker to a sub-500ms pipeline breaks SLAs; use ColBERT or FlashRank for latency-constrained systems
Final k too large: injecting 20 reranked chunks into the LLM causes "lost in the middle" dilution; keep final k at 3–10

2025–2026 Developments

Hybrid retrieval is now the production default — pure vector search is no longer considered sufficient for production RAG. Native hybrid support (dense + sparse + RRF) is now built into Pinecone, Weaviate, MongoDB Atlas, and Elasticsearch, removing the operational barrier.
BGE-M3 unifies three retrieval paradigms — BAAI's BGE-M3 produces dense, sparse (SPLADE), and ColBERT embeddings from a single model pass, enabling a full hybrid + late-interaction pipeline without multiple separate embedding models.
Mixedbread leads BEIR benchmarks — open-source Mixedbread reranker-large (BEIR 57.49) outperforms Cohere Rerank on several benchmarks, making fully open-source, self-hosted high-accuracy reranking practical for the first time.
HyDE adoption growing — Hypothetical Document Embeddings are seeing wider production adoption as teams discover that the NDCG improvement (0.85 → 0.91) is worth the additional LLM call for open-ended queries.
Agentic retrieval — retrieval as a tool call — agents that decide whether and how to retrieve (single-hop vs multi-hop, which index to query) are replacing static pipeline configurations in advanced systems, allowing dynamic retrieval strategies per query.

Checklist: Do You Understand This?

Can you explain why BM25 alone achieves ~58% accuracy, hybrid ~79%, and hybrid + reranking ~91%?
Do you understand how RRF works — why it uses rank position rather than raw scores?
Can you describe HyDE and when the extra LLM call is justified?
Do you know the difference between a bi-encoder (embedding model) and a cross-encoder (reranker) — and why you need both?
Can you explain ColBERT's late interaction mechanism and why it achieves cross-encoder accuracy at near-bi-encoder speed?
Do you know the typical k values at each pipeline stage — retrieval (50–100), after reranking (3–10)?
Can you name two failure modes that reranking cannot fix (answer: document never retrieved, final k too large)?