Intermediate

Chunking & Embeddings

Your RAG pipeline is only as good as what you put into the vector store. Chunking — how you split documents before embedding — and your choice of embedding model are the two decisions with the highest leverage on retrieval quality. A poor chunking strategy breaks answers that span multiple segments; the wrong embedding model mis-ranks the most relevant passages.

Why Chunking Matters

When you index a document, you embed chunks — not the whole document. At query time, the retriever finds the chunks most similar to the query and injects them into the prompt. Two failure modes dominate:

Too small (under ~200 tokens)

Context lost — pronouns dangle, answers span multiple chunks

Too large (over ~1500 tokens)

Precision lost — embedding averages many ideas, relevant sentences buried

FAQ docs (200–400)

Articles (400–512)

Tech docs (512–800)

Legal/research (800–1500)

The optimal size depends on content type and query pattern — always measure retrieval recall before optimising

Chunks too small

Individual chunks lack context — a sentence about "it" is meaningless without surrounding paragraphs
The answer spans multiple chunks but only one is retrieved
High recall requires retrieving many chunks, inflating context and cost
Pronouns and cross-references lose their referents

Chunks too large

The embedding represents the average of many ideas — retrieval precision drops
Relevant sentences buried in a large chunk score lower than a small, focused chunk
More tokens injected per chunk means fewer chunks fit in the context window
The "context cliff" — response quality drops sharply around 2,500 injected tokens

Chunk Size & Overlap

Content type	Recommended chunk size	Rationale
FAQ / support docs	200–400 tokens	Each Q&A pair is self-contained; small chunks embed one idea precisely
General web content / articles	400–512 tokens (baseline)	Best balance of precision and context; RecursiveCharacterTextSplitter default
Technical documentation	512–800 tokens	Concepts are interlinked; slightly larger chunks preserve logical coherence
Legal / academic papers	800–1,500 tokens	Arguments span long passages; analytical queries need more context per chunk
Factoid queries (any content)	256–512 tokens	Specific facts embedded in focused short chunks retrieve more precisely
Analytical queries (any content)	1,024+ tokens	Reasoning requires broader context; larger chunks reduce fragmentation

Overlap: start at 10–20%, then question it

Overlap (repeating the last N tokens of a chunk at the start of the next) was the standard recommendation to avoid cutting an idea at a boundary. The canonical starting point is 50–100 tokens of overlap on a 512-token chunk (~10–20%).

However, a January 2026 systematic analysis found that overlap provided no measurable recall improvement and only increased index size and embedding cost. Practical guidance: use overlap as a default safety net, but measure your pipeline — if recall does not improve, remove it and cut costs.

Overlap matters most when your splitter cuts mid-sentence. If you use a semantic or sentence-aware splitter, overlap adds less value.

Chunking Strategies

Start with the simplest strategy that works. Move to more sophisticated approaches only when baseline recall measurements show a gap.

1. Fixed-size (character / token) splitting

Start here for every project

Split at a fixed token count. LangChain's RecursiveCharacterTextSplitter is the standard implementation — it tries progressively smaller separators to respect natural boundaries while staying within the token limit.

Use when: general documents, early prototyping, any content type as a baseline

Weakness: can cut sentences mid-thought; blind to document structure

Recall: 85–90% at 400 tokens (MTEB-style benchmarks)

2. Semantic chunking

Higher accuracy, higher cost

Embed every sentence, then group sentences whose embeddings are similar into a chunk. A new chunk starts when semantic similarity drops below a threshold — chunk boundaries align with topic shifts, not arbitrary token counts.

Use when: accuracy is the priority; documents cover many distinct topics; long-form narrative content

Weakness: requires embedding every sentence at indexing time (2–5× compute vs fixed-size)

Recall: 91–92% (LLMSemanticChunker: 91.9%)

3. Document-aware / structure-preserving chunking

Essential for structured content

Uses document structure — Markdown headers, HTML tags, PDF layout, code blocks, table boundaries — to define chunk boundaries. Tools like Unstructured.io detect content type and apply element-aware parsing before chunking.

Use when: PDFs with tables, Markdown docs, HTML pages, codebases. Domain accuracy improves 40%+ vs naive splitting on structured content

Weakness: requires document parsing infrastructure; more complex ingestion pipeline

4. Parent-child chunking (hierarchical)

Best of both worlds for precision vs context

Index small child chunks for precision retrieval, but return the larger parent chunk as context to the LLM. Example: index 128-token sentence-level chunks for embedding similarity search; when a chunk is retrieved, expand it to the full 512-token paragraph before injecting into the prompt.

Use when: retrieval precision is poor but injecting more context improves answer quality

Weakness: requires two-level index; more complex retrieval logic

5. Contextual retrieval (Anthropic, 2024–2025)

High-impact, low-complexity improvement

Before embedding each chunk, prepend a short context summary generated by an LLM: the document title, section heading path, and a 1–2 sentence description of what the chunk covers. This makes chunks self-contained — a chunk that says "the rate is 3.5%" becomes "[Mortgage Products / Fixed Rate Loans] The standard fixed 30-year rate is 3.5%..." Reduces retrieval failure by 49%, or 67% combined with reranking.

Use when: chunks are ambiguous without context (pronouns, tables, section-relative references)

Weakness: LLM call per chunk at indexing time (~$0.001–0.002 per 100 chunks with Claude Haiku)

6. Late chunking

Emerging — context-aware embeddings

Embed the entire document first using a long-context embedding model, then split the resulting embeddings into chunk-sized segments. Because the full document was processed before splitting, each chunk embedding carries context from surrounding text — pronouns resolve correctly, cross-references are captured. Requires long-context embedding models (JinaAI, Nomic) rather than standard 512-token models.

Strategy Chooser

Situation	Recommended strategy
Starting a new project, any content type	RecursiveCharacterTextSplitter, 400–512 tokens, 10% overlap
PDFs with tables, code, mixed layout	Document-aware (Unstructured.io) + fixed-size per element
Chunks are retrieving but answers are bad (context missing)	Contextual retrieval (prepend document context) or parent-child
Long narrative or research documents covering many topics	Semantic chunking — topic-aware boundaries
Precision is good but answers lack coherence	Parent-child: retrieve small child, inject large parent
Documents with many pronouns and cross-references	Late chunking or contextual retrieval
Maximum accuracy, cost is secondary	Semantic chunking + contextual retrieval + reranking

Embedding Models

Embedding models convert text to dense vectors. The quality of the embedding determines how well semantic similarity search finds relevant chunks. All models below support asymmetric retrieval (short query matched against long document chunk).

Model	MTEB score	Dimensions	Cost	Best for
Voyage AI voyage-3.5	~66	1,024	~$0.06 / 1M	Best retrieval-specific accuracy; trained on adversarial negatives — top choice for RAG
Voyage voyage-3.5-lite	66.1	512	Low	Best accuracy-cost balance for production RAG
Cohere embed-v4	65.2	1,024	$0.12 / 1M	Top accuracy benchmark, multilingual, enterprise
OpenAI text-embedding-3-large	64.6	3,072 (matryoshka)	$0.13 / 1M	Already on OpenAI stack; matryoshka allows dimension reduction
OpenAI text-embedding-3-small	62.3	1,536	$0.02 / 1M	Cost-sensitive, high-volume; 6.5× cheaper than large with minor accuracy drop
Nomic Embed Text V2	~63	768	Free (open-source)	Open-source, self-hosted, MoE architecture, multilingual, transparent training data
BGE-M3 (BAAI)	~62	1,024	Free (open-source)	Multi-functionality (dense + sparse + colbert); strong multilingual; self-hosted

How to choose an embedding model

Already on OpenAI: start with text-embedding-3-small ($0.02/1M); upgrade to 3-large only if recall measurements show a gap
Maximum retrieval accuracy, budget available: Voyage AI voyage-3.5 — purpose-built for retrieval, trained on adversarial negatives
Best accuracy-to-cost ratio in production: voyage-3.5-lite — 66.1% MTEB at one of the lowest price points
Self-hosted / open-source: Nomic Embed V2 (transparency, MoE architecture) or BGE-M3 (dense + sparse + colbert in one model)
Critical constraint: always embed queries and documents with the same model — cross-model similarity is meaningless

Matryoshka Embeddings

OpenAI's text-embedding-3 models use Matryoshka Representation Learning (MRL): the embedding is structured so that the first N dimensions are a valid, lower-quality embedding. You can truncate from 3,072 to 256 dimensions with a controlled accuracy–cost tradeoff:

Dimensions	MTEB score (MIRACL avg)	Storage vs full
3,072 (full)	54.9	1×
1,536	54.4 (−0.9%)	0.5×
512	52.0 (−5.3%)	0.17×
256	49.8 (−9.3%)	0.08×

Practically: using 1,536 dimensions instead of 3,072 cuts vector storage cost in half with under 1% accuracy loss. Useful when you have millions of chunks and vector storage cost is material.

Production Indexing Pipeline

The recommended baseline pipeline for most RAG systems:

Load

S3, Drive, API, local files

→

Parse

Extract clean text; preserve structure (Unstructured.io)

→

Chunk

RecursiveCharacterTextSplitter, 400–512 tokens

→

Contextualise

Prepend title + section path + 1-sentence summary

→

Embed

Batch embed (voyage-3.5-lite or text-embedding-3-small)

→

Store

Vectors + metadata + raw text → vector DB

Step 4 (contextualise) reduces retrieval failure by 49–67% — it is now standard practice, not optional

Also add: monitoring — log chunk count, embedding latency, failed parses; alert on index staleness. Re-index on document update.

Failure Modes

Chunking failures

Tables split across chunks: use document-aware splitting — a half-table chunk embeds as noise
Code blocks broken mid-function: treat code blocks as atomic units regardless of token count
Headers separated from body: ensure header + first paragraph stay in the same chunk
Context cliff: injecting 3,000+ tokens degrades answer quality sharply; stay under ~2,500 tokens total

Embedding failures

Mixing embedding models: querying with model A against an index built with model B produces meaningless similarity scores
Embedding stale chunks: re-index on document update — stale embeddings surface outdated content confidently
Token limit exceeded: most models have a 512–8,192 token input limit; chunks exceeding it are silently truncated
Dimension mismatch: changing embedding models requires full re-indexing

2025–2026 Developments

Contextual retrieval (Anthropic) is now standard practice — the 49–67% retrieval failure reduction has moved this from "advanced technique" to "default best practice."
Overlap questioned by 2026 research — January 2026 systematic analysis challenged the blanket recommendation for 10–20% overlap. Measure your pipeline; remove overlap if it doesn't move recall metrics.
Late chunking gaining traction — JinaAI and Nomic have released long-context embedding models (up to 8,192 tokens) that make late chunking practical.
Voyage AI now top RAG benchmark choice — embedding models trained specifically on retrieval tasks consistently outperform general-purpose embeddings on RAG benchmarks.
BGE-M3 unifies dense + sparse — enables hybrid retrieval without running two separate embedding pipelines.

Checklist: Do You Understand This?

Can you explain why both too-small and too-large chunks hurt RAG quality — in different ways?
Do you know the recommended starting chunk size for FAQ docs vs legal documents?
Can you describe contextual retrieval and why it reduces retrieval failures by 49–67%?
Do you understand the parent-child chunking pattern and when it beats standard single-level chunking?
Can you name three embedding models and state when you would choose each?
Do you know what Matryoshka embeddings are and why you might use fewer than the maximum dimensions?
Can you identify the three most dangerous embedding failure modes (mixing models, stale index, token truncation)?