Chunking & Embeddings
Your RAG pipeline is only as good as what you put into the vector store. Chunking β how you split documents before embedding β and your choice of embedding model are the two decisions with the highest leverage on retrieval quality. A poor chunking strategy breaks answers that span multiple segments; the wrong embedding model mis-ranks the most relevant passages.
Why Chunking Matters
When you index a document, you embed chunks β not the whole document. At query time, the retriever finds the chunks most similar to the query and injects them into the prompt. Two failure modes dominate:
The optimal size depends on content type and query pattern β always measure retrieval recall before optimising
Chunks too small
- Individual chunks lack context β a sentence about "it" is meaningless without surrounding paragraphs
- The answer spans multiple chunks but only one is retrieved
- High recall requires retrieving many chunks, inflating context and cost
- Pronouns and cross-references lose their referents
Chunks too large
- The embedding represents the average of many ideas β retrieval precision drops
- Relevant sentences buried in a large chunk score lower than a small, focused chunk
- More tokens injected per chunk means fewer chunks fit in the context window
- The "context cliff" β response quality drops sharply around 2,500 injected tokens
Chunk Size & Overlap
| Content type | Recommended chunk size | Rationale |
|---|---|---|
| FAQ / support docs | 200β400 tokens | Each Q&A pair is self-contained; small chunks embed one idea precisely |
| General web content / articles | 400β512 tokens (baseline) | Best balance of precision and context; RecursiveCharacterTextSplitter default |
| Technical documentation | 512β800 tokens | Concepts are interlinked; slightly larger chunks preserve logical coherence |
| Legal / academic papers | 800β1,500 tokens | Arguments span long passages; analytical queries need more context per chunk |
| Factoid queries (any content) | 256β512 tokens | Specific facts embedded in focused short chunks retrieve more precisely |
| Analytical queries (any content) | 1,024+ tokens | Reasoning requires broader context; larger chunks reduce fragmentation |
Overlap: start at 10β20%, then question it
Overlap (repeating the last N tokens of a chunk at the start of the next) was the standard recommendation to avoid cutting an idea at a boundary. The canonical starting point is 50β100 tokens of overlap on a 512-token chunk (~10β20%).
However, a January 2026 systematic analysis found that overlap provided no measurable recall improvement and only increased index size and embedding cost. Practical guidance: use overlap as a default safety net, but measure your pipeline β if recall does not improve, remove it and cut costs.
Overlap matters most when your splitter cuts mid-sentence. If you use a semantic or sentence-aware splitter, overlap adds less value.
Chunking Strategies
Start with the simplest strategy that works. Move to more sophisticated approaches only when baseline recall measurements show a gap.
1. Fixed-size (character / token) splitting
Start here for every project
Split at a fixed token count. LangChain's RecursiveCharacterTextSplitter is the standard implementation β it tries progressively smaller separators to respect natural boundaries while staying within the token limit.
Recall: 85β90% at 400 tokens (MTEB-style benchmarks)
2. Semantic chunking
Higher accuracy, higher cost
Embed every sentence, then group sentences whose embeddings are similar into a chunk. A new chunk starts when semantic similarity drops below a threshold β chunk boundaries align with topic shifts, not arbitrary token counts.
Recall: 91β92% (LLMSemanticChunker: 91.9%)
3. Document-aware / structure-preserving chunking
Essential for structured content
Uses document structure β Markdown headers, HTML tags, PDF layout, code blocks, table boundaries β to define chunk boundaries. Tools like Unstructured.io detect content type and apply element-aware parsing before chunking.
4. Parent-child chunking (hierarchical)
Best of both worlds for precision vs context
Index small child chunks for precision retrieval, but return the larger parent chunk as context to the LLM. Example: index 128-token sentence-level chunks for embedding similarity search; when a chunk is retrieved, expand it to the full 512-token paragraph before injecting into the prompt.
5. Contextual retrieval (Anthropic, 2024β2025)
High-impact, low-complexity improvement
Before embedding each chunk, prepend a short context summary generated by an LLM: the document title, section heading path, and a 1β2 sentence description of what the chunk covers. This makes chunks self-contained β a chunk that says "the rate is 3.5%" becomes "[Mortgage Products / Fixed Rate Loans] The standard fixed 30-year rate is 3.5%..." Reduces retrieval failure by 49%, or 67% combined with reranking.
6. Late chunking
Emerging β context-aware embeddings
Embed the entire document first using a long-context embedding model, then split the resulting embeddings into chunk-sized segments. Because the full document was processed before splitting, each chunk embedding carries context from surrounding text β pronouns resolve correctly, cross-references are captured. Requires long-context embedding models (JinaAI, Nomic) rather than standard 512-token models.
Strategy Chooser
| Situation | Recommended strategy |
|---|---|
| Starting a new project, any content type | RecursiveCharacterTextSplitter, 400β512 tokens, 10% overlap |
| PDFs with tables, code, mixed layout | Document-aware (Unstructured.io) + fixed-size per element |
| Chunks are retrieving but answers are bad (context missing) | Contextual retrieval (prepend document context) or parent-child |
| Long narrative or research documents covering many topics | Semantic chunking β topic-aware boundaries |
| Precision is good but answers lack coherence | Parent-child: retrieve small child, inject large parent |
| Documents with many pronouns and cross-references | Late chunking or contextual retrieval |
| Maximum accuracy, cost is secondary | Semantic chunking + contextual retrieval + reranking |
Embedding Models
Embedding models convert text to dense vectors. The quality of the embedding determines how well semantic similarity search finds relevant chunks. All models below support asymmetric retrieval (short query matched against long document chunk).
| Model | MTEB score | Dimensions | Cost | Best for |
|---|---|---|---|---|
| Voyage AI voyage-3.5 | ~66 | 1,024 | ~$0.06 / 1M | Best retrieval-specific accuracy; trained on adversarial negatives β top choice for RAG |
| Voyage voyage-3.5-lite | 66.1 | 512 | Low | Best accuracy-cost balance for production RAG |
| Cohere embed-v4 | 65.2 | 1,024 | $0.12 / 1M | Top accuracy benchmark, multilingual, enterprise |
| OpenAI text-embedding-3-large | 64.6 | 3,072 (matryoshka) | $0.13 / 1M | Already on OpenAI stack; matryoshka allows dimension reduction |
| OpenAI text-embedding-3-small | 62.3 | 1,536 | $0.02 / 1M | Cost-sensitive, high-volume; 6.5Γ cheaper than large with minor accuracy drop |
| Nomic Embed Text V2 | ~63 | 768 | Free (open-source) | Open-source, self-hosted, MoE architecture, multilingual, transparent training data |
| BGE-M3 (BAAI) | ~62 | 1,024 | Free (open-source) | Multi-functionality (dense + sparse + colbert); strong multilingual; self-hosted |
How to choose an embedding model
- Already on OpenAI: start with text-embedding-3-small ($0.02/1M); upgrade to 3-large only if recall measurements show a gap
- Maximum retrieval accuracy, budget available: Voyage AI voyage-3.5 β purpose-built for retrieval, trained on adversarial negatives
- Best accuracy-to-cost ratio in production: voyage-3.5-lite β 66.1% MTEB at one of the lowest price points
- Self-hosted / open-source: Nomic Embed V2 (transparency, MoE architecture) or BGE-M3 (dense + sparse + colbert in one model)
- Critical constraint: always embed queries and documents with the same model β cross-model similarity is meaningless
Matryoshka Embeddings
OpenAI's text-embedding-3 models use Matryoshka Representation Learning (MRL): the embedding is structured so that the first N dimensions are a valid, lower-quality embedding. You can truncate from 3,072 to 256 dimensions with a controlled accuracyβcost tradeoff:
| Dimensions | MTEB score (MIRACL avg) | Storage vs full |
|---|---|---|
| 3,072 (full) | 54.9 | 1Γ |
| 1,536 | 54.4 (β0.9%) | 0.5Γ |
| 512 | 52.0 (β5.3%) | 0.17Γ |
| 256 | 49.8 (β9.3%) | 0.08Γ |
Practically: using 1,536 dimensions instead of 3,072 cuts vector storage cost in half with under 1% accuracy loss. Useful when you have millions of chunks and vector storage cost is material.
Production Indexing Pipeline
The recommended baseline pipeline for most RAG systems:
Step 4 (contextualise) reduces retrieval failure by 49β67% β it is now standard practice, not optional
Failure Modes
Chunking failures
- Tables split across chunks: use document-aware splitting β a half-table chunk embeds as noise
- Code blocks broken mid-function: treat code blocks as atomic units regardless of token count
- Headers separated from body: ensure header + first paragraph stay in the same chunk
- Context cliff: injecting 3,000+ tokens degrades answer quality sharply; stay under ~2,500 tokens total
Embedding failures
- Mixing embedding models: querying with model A against an index built with model B produces meaningless similarity scores
- Embedding stale chunks: re-index on document update β stale embeddings surface outdated content confidently
- Token limit exceeded: most models have a 512β8,192 token input limit; chunks exceeding it are silently truncated
- Dimension mismatch: changing embedding models requires full re-indexing
2025β2026 Developments
- Contextual retrieval (Anthropic) is now standard practice β the 49β67% retrieval failure reduction has moved this from "advanced technique" to "default best practice."
- Overlap questioned by 2026 research β January 2026 systematic analysis challenged the blanket recommendation for 10β20% overlap. Measure your pipeline; remove overlap if it doesn't move recall metrics.
- Late chunking gaining traction β JinaAI and Nomic have released long-context embedding models (up to 8,192 tokens) that make late chunking practical.
- Voyage AI now top RAG benchmark choice β embedding models trained specifically on retrieval tasks consistently outperform general-purpose embeddings on RAG benchmarks.
- BGE-M3 unifies dense + sparse β enables hybrid retrieval without running two separate embedding pipelines.
Checklist: Do You Understand This?
- Can you explain why both too-small and too-large chunks hurt RAG quality β in different ways?
- Do you know the recommended starting chunk size for FAQ docs vs legal documents?
- Can you describe contextual retrieval and why it reduces retrieval failures by 49β67%?
- Do you understand the parent-child chunking pattern and when it beats standard single-level chunking?
- Can you name three embedding models and state when you would choose each?
- Do you know what Matryoshka embeddings are and why you might use fewer than the maximum dimensions?
- Can you identify the three most dangerous embedding failure modes (mixing models, stale index, token truncation)?