Intermediate

RAG Pipeline Anatomy

A RAG system has two distinct pipelines: ingestion (offline, builds the index) and retrieval (online, answers queries). Each stage has choices that meaningfully affect quality. This page walks through every stage with the key decisions at each step.

The Ingestion Pipeline

Loader

PDF / HTML / DB / DOCX

→

Cleaner

Strip boilerplate

→

Chunker

Fixed / semantic / hierarchical

→

Embedder

text-embedding model

→

Vector Store

Pinecone / Chroma / pgvector

Offline ingestion pipeline — runs once (or on update)

Stage 1: Loading

Load your source documents into a format the pipeline can process. Common loaders:

PDF: pypdf, pdfplumber, or unstructured — unstructured handles tables and images better than simple text extractors
Web pages: BeautifulSoup or trafilatura — strip navigation, ads, and boilerplate
Databases: Query rows and format each row as a text document before ingestion
Office documents: python-docx (Word), openpyxl (Excel)

Preserve metadata at this stage: source URL, document title, creation date, author. You will attach this metadata to each chunk for filtering and attribution.

Stage 2: Chunking

Split documents into chunks small enough for meaningful embedding but large enough to contain useful context. Common strategies:

Fixed-size with overlap: 512 tokens per chunk, 50–100 token overlap between adjacent chunks. Simple and consistent but may split mid-sentence or mid-concept.
Sentence-based: Split on sentence boundaries. Preserves semantic units but chunk sizes vary widely.
Paragraph-based: Split on double newlines. Natural for prose documents; good default for most cases.
Section-based: Split on headers (h1, h2). Keeps content semantically grouped; requires well-structured source documents.

Chunk size trade-off: smaller chunks = more precise retrieval (the retrieved text directly answers the question) but less context per chunk. Larger chunks = more surrounding context but less retrieval precision. Start with 512–1024 tokens and tune based on retrieval quality.

Stage 3: Embedding

Convert each chunk to a dense vector using an embedding model. The embedding captures the semantic meaning of the text — chunks with similar meaning have similar vectors.

Use the same embedding model for ingestion and query time — vectors must be in the same space
Common models: text-embedding-3-small (OpenAI), embed-english-v3.0 (Cohere), BAAI/bge-small-en-v1.5 (open-source, runs locally)
Embed in batches for efficiency — most embedding APIs accept lists of texts

Stage 4: Storing

Insert the vectors and their associated metadata into a vector database. Each record stores:

The vector (the embedding)
The original chunk text
Metadata: source document, chunk index, section, date, etc.

Good metadata design enables filtered retrieval — finding only chunks from a specific document, date range, or category.

The Retrieval Pipeline

User Query

Natural language question

→

Query Embed

Same model as ingestion

→

Similarity Search

Top-k nearest vectors

→

Augment Prompt

Chunks + original query

→

Generate

Claude answers with context

Online retrieval pipeline — runs per user question

Stage 5: Query Embedding

At query time, embed the user's question using the same embedding model used during ingestion. This converts the query into a vector in the same semantic space as the stored chunks.

Stage 6: Similarity Search

Search the vector database for the chunks whose vectors are most similar to the query vector. Most databases use cosine similarity or dot product. Retrieve the top-k results (typically k=3–10 depending on how much context you want to include and your context window budget).

Apply metadata filters here if needed: "only retrieve from documents dated after 2024-01-01" or "only from the legal-contracts collection."

Stage 7: Augmentation

Insert the retrieved chunks into Claude's context. A typical augmented prompt structure:

You are a helpful assistant. Answer the user's question using only the provided context.
If the answer is not in the context, say you don't know.

Context:
[Source: Employee Handbook, Section 4.2]
Annual leave entitlement is 25 days per year for all full-time employees...

[Source: HR Policy Update, Jan 2025]
Leave entitlement for part-time staff is prorated based on contracted hours...

Question: How many days of annual leave do I get?

Stage 8: Generation

Claude generates an answer grounded in the retrieved context. Key prompt instructions that improve RAG output quality:

"Answer using only the provided context" — reduces hallucination over retrieved text
"If the context does not contain the answer, say so" — prevents fabrication for missing information
"Cite the source for each claim" — enables attribution and user verification

Checklist: Do You Understand This?

Ingestion: load → chunk → embed → store. Runs offline. Preserve metadata throughout.
Chunking trade-off: smaller = more precise retrieval, larger = more context per chunk. Start at 512–1024 tokens.
Embedding: same model for ingestion and query time. Batch for efficiency.
Retrieval: embed query → similarity search (top-k) → insert chunks into context → generate
Prompt instructions: "use only the provided context" + "say if the answer is not there" + "cite sources"