🧠 All Things AI
Intermediate

Search & Reranking

Retrieval quality is the single biggest driver of RAG output quality β€” if the wrong chunks reach the LLM, no amount of prompt engineering recovers the answer. This page covers the full retrieval stack: how vector search and keyword search work and where each fails, how to fuse them with hybrid retrieval, how query expansion and HyDE improve recall, and how reranking restores precision after you've cast a wide net. The production benchmark result sets the target: BM25 alone 58% accuracy β†’ hybrid 79% β†’ hybrid + rerank 91%.

Two Retrieval Paradigms

DimensionDense (vector / semantic)Sparse (keyword / BM25)
How it worksQuery and document are embedded into dense vectors; similarity is cosine distance in high-dimensional spaceTerm-frequency scoring (BM25); ranks documents by exact keyword matches, weighted by rarity and document length
StrengthCaptures semantics β€” "cost" matches "price", synonyms, paraphrases, cross-lingual queriesExact-term precision β€” serial numbers, product codes, names, technical identifiers
WeaknessMisses rare/specific terms; "SKU-XJ-4421" may not match if not in training vocabularyVocabulary mismatch β€” "inexpensive" does not match "cheap"; no semantic understanding
SpeedApproximate nearest-neighbour (ANN); fast at <100ms for millions of vectorsInverted index lookup; extremely fast (<10ms); ElasticSearch / OpenSearch / Typesense
Accuracy alone~79% in production benchmarks (semantic only)~58% in production benchmarks (BM25 only)
Best forNatural language queries, concept search, Q&A over proseTechnical queries with exact identifiers, proper nouns, code snippets

Neither approach dominates in production. Dense retrieval wins on semantic understanding; sparse wins on exact-term recall. Combining them is the first upgrade every RAG system should make.

Hybrid Retrieval

Hybrid retrieval runs both dense and sparse searches independently, then merges the two ranked result lists before passing candidates to the LLM (or a reranker). The merge step is the key design decision.

Reciprocal Rank Fusion (RRF) β€” the standard merge algorithm

RRF ignores raw similarity scores entirely (which have incompatible scales between BM25 and cosine) and works only on rank position. For each document, its RRF score is:

RRF_score(doc) = Ξ£ 1 / (k + rank_in_list_i)

Where k is a constant (typically 60) that dampens the impact of very-high-ranked documents. Documents that rank highly in both lists accumulate the highest RRF scores. Documents that rank highly in only one list still appear β€” preserving the benefit of each search type.

  • No score normalisation required β€” works out-of-the-box across any two retrieval methods
  • No learned parameters β€” zero-shot, no training data needed
  • Scales to billion-document indices where global score normalisation is expensive
  • Native support in Pinecone, Weaviate, Elasticsearch, and MongoDB Atlas

Implementation options

  • Native hybrid DB: Pinecone, Weaviate, MongoDB Atlas β€” run both searches and RRF in one query call; lowest operational overhead
  • Two-stack: vector DB (Qdrant / pgvector) + keyword search (Elasticsearch / OpenSearch) β€” more control, more infra
  • BGE-M3 single model: one model produces dense + sparse vectors simultaneously β€” no separate BM25 index needed; best for latency-sensitive systems

Alpha weighting

Some implementations let you tune an alpha weight: score = Ξ±Β·dense + (1-Ξ±)Β·sparse.

  • Ξ± = 1.0 β†’ pure dense
  • Ξ± = 0.0 β†’ pure sparse
  • Ξ± = 0.5 β†’ equal weight (typical starting point)
  • Ξ± = 0.75 β†’ dense-heavy (natural language heavy corpora)
  • Tune Ξ± against your recall@k eval set β€” don't guess

Query-Side Techniques

Even with hybrid search, a poorly phrased query retrieves poor results. These techniques improve recall before the search even runs.

Query expansion

Use an LLM to generate synonym terms, related concepts, and alternative phrasings of the original query. The expanded terms are added to the keyword search component, increasing the chance of matching documents that use different vocabulary.

Example: user asks "how do I reduce latency?" β†’ expanded with "performance optimisation", "response time", "throughput bottleneck", "p95 latency" before BM25 search.

HyDE β€” Hypothetical Document Embeddings

Instead of embedding the query, ask an LLM to write a hypothetical answer to the query, then embed that document for retrieval. The hypothesis occupies the same embedding space as real answers, often finding better matches than the short, ambiguous original query.

Benchmark: Hybrid + HyDE achieves NDCG of 0.91 on mixed queries vs standard Hybrid at 0.85 β€” a significant lift for open-ended questions.

Cost: one extra LLM call per query (~$0.001–0.005 with a small model). Worth it when queries are short, ambiguous, or use very different vocabulary from your corpus.

Multi-query retrieval (RAG-Fusion)

Ask an LLM to generate 3–5 alternative phrasings of the original query. Run retrieval for each sub-query independently, then merge all result sets with RRF. Documents that surface across multiple sub-queries receive the highest scores.

Benchmarks show +8–10% answer accuracy and +30–40% answer comprehensivenessvs vanilla RAG. Cost: 3–5Γ— more retrieval calls and embedding calls per query. Best for complex, multi-faceted questions β€” overkill for simple factoid lookups.

Step-back prompting

For specific queries that require broad context, first retrieve on a more general "step-back" version of the query, then on the specific query. Example: "What does Section 4.3 say about termination?" β†’ step-back: "What are the contract termination clauses?" β†’ specific: "Section 4.3 termination notice period". Combines general context with specific detail in the injected chunks.

Reranking

Retrieval is a recall problem: get enough relevant documents into the candidate set (top-50 to top-100). Reranking is a precision problem: from those candidates, surface the top-3 to top-10 most relevant for the LLM. These are different optimisation targets requiring different models.

Bi-encoder vs cross-encoder

DimensionBi-encoder (embedding model)Cross-encoder (reranker)
How it processesQuery and document embedded separately; similarity = cosine distanceQuery + document concatenated; processed together through transformer; outputs a relevance score
AccuracyGood β€” but misses fine-grained relevance signalsHigher β€” full attention across query + document reveals subtle relevance
SpeedFast β€” precompute document embeddings; ANN search at query timeSlow at scale β€” full forward pass per (query, document) pair; cannot precompute
Scales toMillions of documents50–200 candidates per query (hence: retrieve first, then rerank)
RoleFirst-stage retrieval (cast wide net)Second-stage reranking (precision cut)

Reranker Models

ModelTypeLatency addedCostBest for
Cohere Rerank 3Cross-encoder (API)200–400ms$0.002 / 1K searchesProduction, multilingual, no infra to manage
Cohere Rerank 3 NimbleCross-encoder (API, fast)100–200msLower than Rerank 3Latency-sensitive production with high throughput
Voyage Rerank 2Cross-encoder (API)150–300ms~$0.05 / 1M tokensPairs well with Voyage embeddings; high BEIR accuracy
Mixedbread reranker-largeCross-encoder (open-source)200–400msFree (self-hosted)Top BEIR score (57.49); self-hosted; outperforms Cohere on benchmarks
FlashRankCross-encoder (lightweight)<50ms (CPU)Free (open-source)CPU-bound environments, edge deployments, cost-sensitive low-latency
ColBERT / RAGatouilleLate interactionTens of ms (precomputed doc embeddings)Free (open-source)High accuracy + near-bi-encoder speed; token-level interaction; large corpora

ColBERT: late interaction explained

Standard cross-encoders cannot precompute document representations β€” every query forces a full forward pass over every candidate. ColBERT solves this with late interaction: query and document are encoded separately (like bi-encoders), but similarity is computed as token-level MaxSim β€” the sum of the maximum similarity between each query token and all document tokens. This captures fine-grained relevance while allowing document embeddings to be precomputed and cached.

With PLAID's centroid pruning, ColBERT achieves tens-of-milliseconds retrieval on large corpora β€” bridging the gap between cross-encoder accuracy and bi-encoder speed. RAGatouille is the recommended Python library for ColBERT integration in RAG pipelines.

The Full Retrieval Pipeline

The three stages β€” retrieval, fusion, reranking β€” compose into a cascade:

Query
Optional: HyDE / query expansion / multi-query
β†’
Dense search
Vector ANN β†’ top-50 candidates
β†’
Sparse search (BM25)
Keyword β†’ top-50 candidates
β†’
RRF merge
Rank fusion β†’ combined top-50
β†’
Reranker
Cross-encoder β†’ top-5 to top-10
β†’
LLM generation
Grounded answer with injected top-k chunks

BM25 alone 58% β†’ hybrid 79% β†’ hybrid + rerank 91%. Each stage adds latency β€” add only where you have measured a gap.

Not every stage is always necessary. Start with dense-only, measure recall@k, then add stages where you see gaps. Adding all three stages is correct only when you have measured that each stage improves your specific eval metrics.

Choosing k Values

StageTypical kNotes
Dense retrieval candidates50–100High recall; cost is cheap (ANN search); wider net for reranker to work with
After RRF fusion50Pass merged top-50 to reranker; balance reranker compute cost vs recall
After reranking3–10Only inject the top-3 to top-10 into the LLM context; stay under ~2,500 tokens injected total
Final LLM context3–5 chunks typicalMore chunks = more cost and risk of "lost in the middle" dilution; precision matters more than volume here

Over-retrieval is a real cost driver β€” enterprises overpay by up to 80% from excessive k values. Retrieval is cheap; LLM tokens are expensive. Keep the final injected k small and precise.

What to Optimise First

  1. Establish a dense-only baseline β€” measure recall@10 and recall@50 with your embedding model and chunking strategy before adding any complexity
  2. Add hybrid (BM25 + dense with RRF) β€” if recall@50 is below 90%, this is your highest-leverage single upgrade; adds <10ms latency with native hybrid DB support
  3. Add reranking once recall@50 >90% β€” if the right documents are in the top-50 but the top-5 is still wrong, add a cross-encoder; adds 200–400ms latency
  4. Add HyDE or multi-query if recall@50 is still low β€” for corpora with strong vocabulary mismatch between queries and documents; adds one LLM call per query
  5. Add contextual retrieval to the indexing pipeline β€” prepend document context to each chunk before embedding; 49–67% retrieval failure reduction for low indexing cost

Failure Modes

Retrieval failures

  • Vocabulary mismatch: user says "cost", doc says "pricing" β€” dense retrieval helps, but hybrid catches both
  • Short query, ambiguous intent: "what is the limit?" retrieves wrong document type; HyDE or step-back prompting improves this
  • Top-k too small: the relevant chunk is at position 15 but k=10; increase retrieval k before adding more expensive reranking
  • Stale index: documents updated but not re-indexed; retriever surfaces outdated chunks with full confidence

Reranking failures

  • Relevant document not in candidates: reranker cannot rescue a document that was never retrieved; fix retrieval recall first
  • Reranking noise: passing 200+ candidates to a cross-encoder inflates latency and cost without improving precision over top-50
  • Latency budget exceeded: adding a reranker to a sub-500ms pipeline breaks SLAs; use ColBERT or FlashRank for latency-constrained systems
  • Final k too large: injecting 20 reranked chunks into the LLM causes "lost in the middle" dilution; keep final k at 3–10

2025–2026 Developments

  • Hybrid retrieval is now the production default β€” pure vector search is no longer considered sufficient for production RAG. Native hybrid support (dense + sparse + RRF) is now built into Pinecone, Weaviate, MongoDB Atlas, and Elasticsearch, removing the operational barrier.
  • BGE-M3 unifies three retrieval paradigms β€” BAAI's BGE-M3 produces dense, sparse (SPLADE), and ColBERT embeddings from a single model pass, enabling a full hybrid + late-interaction pipeline without multiple separate embedding models.
  • Mixedbread leads BEIR benchmarks β€” open-source Mixedbread reranker-large (BEIR 57.49) outperforms Cohere Rerank on several benchmarks, making fully open-source, self-hosted high-accuracy reranking practical for the first time.
  • HyDE adoption growing β€” Hypothetical Document Embeddings are seeing wider production adoption as teams discover that the NDCG improvement (0.85 β†’ 0.91) is worth the additional LLM call for open-ended queries.
  • Agentic retrieval β€” retrieval as a tool call β€” agents that decide whether and how to retrieve (single-hop vs multi-hop, which index to query) are replacing static pipeline configurations in advanced systems, allowing dynamic retrieval strategies per query.

Checklist: Do You Understand This?

  • Can you explain why BM25 alone achieves ~58% accuracy, hybrid ~79%, and hybrid + reranking ~91%?
  • Do you understand how RRF works β€” why it uses rank position rather than raw scores?
  • Can you describe HyDE and when the extra LLM call is justified?
  • Do you know the difference between a bi-encoder (embedding model) and a cross-encoder (reranker) β€” and why you need both?
  • Can you explain ColBERT's late interaction mechanism and why it achieves cross-encoder accuracy at near-bi-encoder speed?
  • Do you know the typical k values at each pipeline stage β€” retrieval (50–100), after reranking (3–10)?
  • Can you name two failure modes that reranking cannot fix (answer: document never retrieved, final k too large)?