🧠 All Things AI
Intermediate

RAG Pitfalls & What Goes Wrong

More than 80% of in-house RAG projects fail to make it out of the proof-of-concept stage. The gap between a demo that impresses in a meeting and a system that is boringly reliable in production is almost always explained by a handful of repeatable, preventable failure modes that nobody warns you about in tutorials.

Where Failures Hide in the Pipeline

A RAG system has six distinct stages, and a failure at any one of them poisons everything downstream. Most teams only test the LLM generation step and assume the rest is fine.

StageWhat happens hereMost common failure
1. IngestionParse, clean, and load raw documentsGarbage in — bad parsing produces junk chunks
2. ChunkingSplit documents into indexable unitsAnswer split across two chunks — never retrieved together
3. Embedding & IndexingConvert chunks to vectors, store with metadataStale index — new docs added but embeddings not refreshed
4. RetrievalFind relevant chunks for a given queryCorrect chunks exist but rank below the cutoff k
5. Context ConstructionAssemble retrieved chunks into an LLM promptLost-in-the-middle — key info buried and ignored
6. GenerationLLM synthesises an answer from context + queryModel ignores context and answers from parametric memory

Stage 1 — Ingestion Pitfalls

Bad ingestion is the silent killer. No retrieval strategy, no matter how sophisticated, can rescue a pipeline built on malformed chunks. Teams routinely skip this step in demos and discover the problem months into production.

Ingestion Pitfalls
  • PDF parsing collapse — PDFs with multi-column layouts, scanned images, or embedded tables produce scrambled text when parsed with naive tools like PyPDF2. The resulting chunks are nonsensical.
  • Table flattening — HTML and PDF tables become a stream of numbers with no column headers when converted to plain text. Queries about that data fail completely.
  • Boilerplate noise — Headers, footers, navigation bars, disclaimers, and cookie banners all end up in the index. They pollute retrieval with irrelevant matches.
  • No deduplication — The same document ingested twice (common after re-indexing runs) inflates recall scores in tests but returns identical chunks that waste context space.
  • No versioning — Source documents are updated but the index is not. Old versions live alongside new ones with no way to tell which is current.
Fixes
  • Use structure-aware parsers — Unstructured.io, LlamaParse, or Azure Document Intelligence for PDFs, not naive text extraction
  • Extract tables as structured JSON or Markdown, not raw text — then embed the table representation separately
  • Strip known boilerplate patterns at parse time using regex or a classification model
  • Compute a content hash for each document — skip re-ingestion if the hash is unchanged
  • Store doc_version and last_updated in metadata; implement TTL-based re-indexing for time-sensitive content

Stage 2 — Chunking Pitfalls

Chunking Pitfalls
  • Context cliff — Fixed-size chunks cut mid-sentence. The answer that starts at character 495 and ends at 520 is split across two chunks that will never be retrieved together.
  • Orphaned context — A chunk says "As mentioned above, the rate limit is 1,000 RPS" with no reference to what "above" is. Retrieved in isolation, it's useless.
  • Code block breaks — A function split across two chunks embeds differently in each half. Neither half retrieves for the right query.
  • One-size-fits-all chunking — Using 512 tokens for FAQ entries (too large) and legal contracts (too small) at the same time. Optimal chunk size varies by content type.
Fixes
  • Use semantic or structure-aware chunking — split on paragraph boundaries, headings, or semantic similarity breaks, not character counts
  • Apply contextual retrieval — prepend the document title, section heading, and a one-sentence summary to every chunk before embedding
  • Treat code blocks as atomic units — detect language fences and keep them whole
  • Tune chunk size per document type — run recall experiments at multiple sizes before committing to production values

Stages 3–4 — Indexing & Retrieval Pitfalls

Academic research (Seven Failure Points in RAG, 2024) identifies "missed the top-ranked documents" and "not in context" as the two most common failure points in production systems — both retrieval problems.

PitfallSymptomFix
Vocabulary mismatchUser asks about "pricing"; docs say "cost" — pure dense search fails bothHybrid dense + BM25 sparse retrieval
k too smallThe answer chunk is ranked 8th; system only retrieves top-5Retrieve 50–100 candidates, rerank to top 5–10
Stale indexDocs updated last week; index reflects last monthIncremental indexing triggered on document change events
Embedding model mismatchIndex built with one model; queries embedded with a different model after an upgradeVersion-pin the embedding model; re-index before swapping models
Answer not in documentsLLM confidently answers but the question isn't covered in the indexed corpusAdd a no-answer gate — if max retrieval score < threshold, return "I don't have this information"
Ignoring metadata filtersRetrieval returns docs from all regions/versions when query is scope-specificApply pre-filter metadata constraints before vector search

Stage 5 — Context Construction Pitfalls

The Lost-in-the-Middle Problem

LLMs pay more attention to information at the start and end of their context window. Critical chunks placed in the middle of a long prompt are systematically underweighted — the LLM fails to extract the answer even though it was technically provided.

This is not a prompt-writing issue — it is a fundamental attention distribution property of transformer models, confirmed across GPT-4, Claude, and Gemini in 2024 research.

Fix: Place the most relevant chunk first and last. Use reranking to select fewer, higher-quality chunks rather than stuffing the context. Or switch to contextual compression — summarise each chunk before injecting.
Context Window Overflow & Information Flooding

Retrieving too many chunks and injecting all of them degrades generation quality. Counter-intuitively, more context is not always better — signal-to-noise ratio in the context window directly determines LLM accuracy.

Fix: Target 3–5 high-precision chunks after reranking, not 20 loosely-relevant ones. Compress chunks with summarisation if the content is long.
Context Fragmentation

The answer requires two facts from different sections of a document — one retrieved, one not. The LLM sees a partial picture and either hallucinates the missing piece or refuses to answer. This is a chunking failure that manifests as a generation failure.

Fix: Parent-child retrieval — index small child chunks for retrieval precision, but inject the full parent section into context.

Stage 6 — Generation Pitfalls

Generation Pitfalls
  • Parametric override — The LLM "knows" an answer from training and ignores the context, even when context contradicts it. Especially common for well-known facts or recent events.
  • Hallucination on partial context — Retrieval provides a partial answer; LLM fills the gaps with confident invention rather than admitting uncertainty.
  • Instruction following failure — LLM generates a long essay when the user wanted a bullet list; formatting instructions in the system prompt were overridden by example patterns in the retrieved context.
  • Context leakage — The LLM reveals the raw chunk text verbatim, exposing internal document structure or metadata the user should not see.
Fixes
  • Explicitly instruct: "Answer ONLY from the provided context. If the context does not contain the answer, say so."
  • Add a no-answer path — prompt the model to output a structured {"answer": null, "reason": "not in documents"} rather than hallucinating
  • Separate system instructions from context in the prompt structure — user instructions should come after context, not before
  • Strip sensitive metadata from chunk text before injection; keep it only in the citation metadata object

Cross-Cutting Pitfalls

Some pitfalls span multiple stages and are harder to attribute to a single fix:

Testing on the Happy Path Only

RAG demos always use clean, well-formatted documents and well-phrased queries. Production brings messy PDFs, typos, ambiguous questions, and users who ask things that are adjacent to — but not in — the corpus. Testing only the happy path gives false confidence. Build a test set that includes unanswerable questions, adversarial queries, and documents with mixed quality.

Regenerating the Test Set Between Runs

A common anti-pattern: regenerate synthetic QA pairs from the documents each time you run evaluation. Because the test set changes, you cannot tell if a score improvement is a real gain or just sampling luck. Fix: freeze the test set in version control and never regenerate it during optimization.

No Observability

Without tracing each pipeline stage, a wrong answer is a black box. You cannot tell if it failed because the answer wasn't indexed, the right chunk wasn't retrieved, the LLM ignored the context, or the generation was hallucinated. Instrument every stage: log query embeddings, retrieved chunk IDs, reranker scores, and the final prompt sent to the LLM.

Skipping Reranking to Save Latency

Teams skip reranking in early iterations to reduce latency, then never add it because "it seemed fine in testing." In production, retrieval noise degrades answer quality over time as the corpus grows. Adding reranking after launch requires re-evaluating the entire pipeline.

Treating RAG as a One-Time Setup

A RAG system that works at launch degrades silently as source documents change, user query patterns shift, and embedding models are updated. Production RAG requires ongoing monitoring: retrieval recall, answer faithfulness, citation accuracy, and index freshness should all be tracked as continuous metrics.

Debugging a Failing RAG System

When a RAG system gives a wrong answer, use this five-question diagnostic to find the stage where it broke:

QuestionHow to checkFailure stage if No
Is the answer in the source documents at all?Manually search the raw document corpusMissing content — indexing gap
Did the right chunk get retrieved?Log retrieved chunk IDs; verify the answer chunk appears in top kRetrieval failure — improve hybrid / rerank
Did the chunk make it into the final prompt?Log the full prompt sent to the LLM; check if chunk was truncated or droppedContext construction failure — lost-in-middle or overflow
Did the LLM read the chunk?Ask the LLM directly: "What does [source chunk] say about X?"Generation failure — model ignored context
Is the chunk text correct after parsing?Retrieve the stored chunk text and inspect it for parse errorsIngestion failure — bad parsing

2025 Production Lessons

Long-context models do not replace good RAG

Many teams tried to skip RAG by stuffing entire document corpora into 128K-token context windows in 2024–2025. The result: significantly higher cost, the lost-in-the-middle problem at scale, and worse accuracy than a well-tuned RAG system. Long context is a complement, not a replacement.

Structure-aware ingestion is the highest-leverage improvement

Teams that invested in proper document parsing — handling PDFs, tables, images, and code blocks as structured objects — saw larger accuracy gains than teams that spent the same time optimizing retrieval algorithms. Fix the foundation first.

Robustness is iterative, not designed-in upfront

A RAG system's robustness evolves in production, not in planning. The failure modes that matter are the ones your real users surface. Build observability first, deploy to real traffic early, and iterate based on actual failure logs rather than synthetic benchmarks.

The no-answer path is not optional

Every RAG system that skipped implementing an explicit "I don't know" path created a trust problem in production. Users will ask questions outside the corpus. Without a no-answer path, the LLM hallucinates — and users notice, and trust collapses. Implement a retrieval confidence threshold and a graceful fallback before launch.

Checklist: Do You Understand This?

  • Can you name the six stages of a RAG pipeline and the most common failure at each?
  • Do you know why bad ingestion cannot be fixed by better retrieval?
  • Can you explain the lost-in-the-middle problem and two ways to mitigate it?
  • Do you know what to check first when a RAG system gives a wrong answer — and the five diagnostic questions to use?
  • Can you explain why regenerating the test set between evaluation runs is an anti-pattern?
  • Do you understand why a no-answer gate (retrieval confidence threshold) is essential in production?
  • Can you name three cross-cutting pitfalls that span multiple pipeline stages?
  • Do you understand why long-context models do not replace a well-tuned RAG system?