RAG Pipeline Anatomy
A RAG system has two distinct pipelines: ingestion (offline, builds the index) and retrieval (online, answers queries). Each stage has choices that meaningfully affect quality. This page walks through every stage with the key decisions at each step.
The Ingestion Pipeline
Offline ingestion pipeline — runs once (or on update)
Stage 1: Loading
Load your source documents into a format the pipeline can process. Common loaders:
- PDF:
pypdf,pdfplumber, orunstructured— unstructured handles tables and images better than simple text extractors - Web pages:
BeautifulSouportrafilatura— strip navigation, ads, and boilerplate - Databases: Query rows and format each row as a text document before ingestion
- Office documents:
python-docx(Word),openpyxl(Excel)
Preserve metadata at this stage: source URL, document title, creation date, author. You will attach this metadata to each chunk for filtering and attribution.
Stage 2: Chunking
Split documents into chunks small enough for meaningful embedding but large enough to contain useful context. Common strategies:
- Fixed-size with overlap: 512 tokens per chunk, 50–100 token overlap between adjacent chunks. Simple and consistent but may split mid-sentence or mid-concept.
- Sentence-based: Split on sentence boundaries. Preserves semantic units but chunk sizes vary widely.
- Paragraph-based: Split on double newlines. Natural for prose documents; good default for most cases.
- Section-based: Split on headers (h1, h2). Keeps content semantically grouped; requires well-structured source documents.
Chunk size trade-off: smaller chunks = more precise retrieval (the retrieved text directly answers the question) but less context per chunk. Larger chunks = more surrounding context but less retrieval precision. Start with 512–1024 tokens and tune based on retrieval quality.
Stage 3: Embedding
Convert each chunk to a dense vector using an embedding model. The embedding captures the semantic meaning of the text — chunks with similar meaning have similar vectors.
- Use the same embedding model for ingestion and query time — vectors must be in the same space
- Common models:
text-embedding-3-small(OpenAI),embed-english-v3.0(Cohere),BAAI/bge-small-en-v1.5(open-source, runs locally) - Embed in batches for efficiency — most embedding APIs accept lists of texts
Stage 4: Storing
Insert the vectors and their associated metadata into a vector database. Each record stores:
- The vector (the embedding)
- The original chunk text
- Metadata: source document, chunk index, section, date, etc.
Good metadata design enables filtered retrieval — finding only chunks from a specific document, date range, or category.
The Retrieval Pipeline
Online retrieval pipeline — runs per user question
Stage 5: Query Embedding
At query time, embed the user's question using the same embedding model used during ingestion. This converts the query into a vector in the same semantic space as the stored chunks.
Stage 6: Similarity Search
Search the vector database for the chunks whose vectors are most similar to the query vector. Most databases use cosine similarity or dot product. Retrieve the top-k results (typically k=3–10 depending on how much context you want to include and your context window budget).
Apply metadata filters here if needed: "only retrieve from documents dated after 2024-01-01" or "only from the legal-contracts collection."
Stage 7: Augmentation
Insert the retrieved chunks into Claude's context. A typical augmented prompt structure:
You are a helpful assistant. Answer the user's question using only the provided context.
If the answer is not in the context, say you don't know.
Context:
[Source: Employee Handbook, Section 4.2]
Annual leave entitlement is 25 days per year for all full-time employees...
[Source: HR Policy Update, Jan 2025]
Leave entitlement for part-time staff is prorated based on contracted hours...
Question: How many days of annual leave do I get?Stage 8: Generation
Claude generates an answer grounded in the retrieved context. Key prompt instructions that improve RAG output quality:
- "Answer using only the provided context" — reduces hallucination over retrieved text
- "If the context does not contain the answer, say so" — prevents fabrication for missing information
- "Cite the source for each claim" — enables attribution and user verification
Checklist: Do You Understand This?
- Ingestion: load → chunk → embed → store. Runs offline. Preserve metadata throughout.
- Chunking trade-off: smaller = more precise retrieval, larger = more context per chunk. Start at 512–1024 tokens.
- Embedding: same model for ingestion and query time. Batch for efficiency.
- Retrieval: embed query → similarity search (top-k) → insert chunks into context → generate
- Prompt instructions: "use only the provided context" + "say if the answer is not there" + "cite sources"