🧠 All Things AI
Intermediate

Chunking & Embeddings

Your RAG pipeline is only as good as what you put into the vector store. Chunking β€” how you split documents before embedding β€” and your choice of embedding model are the two decisions with the highest leverage on retrieval quality. A poor chunking strategy breaks answers that span multiple segments; the wrong embedding model mis-ranks the most relevant passages.

Why Chunking Matters

When you index a document, you embed chunks β€” not the whole document. At query time, the retriever finds the chunks most similar to the query and injects them into the prompt. Two failure modes dominate:

FAQ docs (200–400)
Articles (400–512)
Tech docs (512–800)
Legal/research (800–1500)
Too small (under ~200 tokens)
Context lost β€” pronouns dangle, answers span multiple chunks
Too large (over ~1500 tokens)
Precision lost β€” embedding averages many ideas, relevant sentences buried

The optimal size depends on content type and query pattern β€” always measure retrieval recall before optimising

Chunks too small

  • Individual chunks lack context β€” a sentence about "it" is meaningless without surrounding paragraphs
  • The answer spans multiple chunks but only one is retrieved
  • High recall requires retrieving many chunks, inflating context and cost
  • Pronouns and cross-references lose their referents

Chunks too large

  • The embedding represents the average of many ideas β€” retrieval precision drops
  • Relevant sentences buried in a large chunk score lower than a small, focused chunk
  • More tokens injected per chunk means fewer chunks fit in the context window
  • The "context cliff" β€” response quality drops sharply around 2,500 injected tokens

Chunk Size & Overlap

Content typeRecommended chunk sizeRationale
FAQ / support docs200–400 tokensEach Q&A pair is self-contained; small chunks embed one idea precisely
General web content / articles400–512 tokens (baseline)Best balance of precision and context; RecursiveCharacterTextSplitter default
Technical documentation512–800 tokensConcepts are interlinked; slightly larger chunks preserve logical coherence
Legal / academic papers800–1,500 tokensArguments span long passages; analytical queries need more context per chunk
Factoid queries (any content)256–512 tokensSpecific facts embedded in focused short chunks retrieve more precisely
Analytical queries (any content)1,024+ tokensReasoning requires broader context; larger chunks reduce fragmentation

Overlap: start at 10–20%, then question it

Overlap (repeating the last N tokens of a chunk at the start of the next) was the standard recommendation to avoid cutting an idea at a boundary. The canonical starting point is 50–100 tokens of overlap on a 512-token chunk (~10–20%).

However, a January 2026 systematic analysis found that overlap provided no measurable recall improvement and only increased index size and embedding cost. Practical guidance: use overlap as a default safety net, but measure your pipeline β€” if recall does not improve, remove it and cut costs.

Overlap matters most when your splitter cuts mid-sentence. If you use a semantic or sentence-aware splitter, overlap adds less value.

Chunking Strategies

Start with the simplest strategy that works. Move to more sophisticated approaches only when baseline recall measurements show a gap.

1. Fixed-size (character / token) splitting

Start here for every project

Split at a fixed token count. LangChain's RecursiveCharacterTextSplitter is the standard implementation β€” it tries progressively smaller separators to respect natural boundaries while staying within the token limit.

Use when: general documents, early prototyping, any content type as a baseline
Weakness: can cut sentences mid-thought; blind to document structure

Recall: 85–90% at 400 tokens (MTEB-style benchmarks)

2. Semantic chunking

Higher accuracy, higher cost

Embed every sentence, then group sentences whose embeddings are similar into a chunk. A new chunk starts when semantic similarity drops below a threshold β€” chunk boundaries align with topic shifts, not arbitrary token counts.

Use when: accuracy is the priority; documents cover many distinct topics; long-form narrative content
Weakness: requires embedding every sentence at indexing time (2–5Γ— compute vs fixed-size)

Recall: 91–92% (LLMSemanticChunker: 91.9%)

3. Document-aware / structure-preserving chunking

Essential for structured content

Uses document structure β€” Markdown headers, HTML tags, PDF layout, code blocks, table boundaries β€” to define chunk boundaries. Tools like Unstructured.io detect content type and apply element-aware parsing before chunking.

Use when: PDFs with tables, Markdown docs, HTML pages, codebases. Domain accuracy improves 40%+ vs naive splitting on structured content
Weakness: requires document parsing infrastructure; more complex ingestion pipeline

4. Parent-child chunking (hierarchical)

Best of both worlds for precision vs context

Index small child chunks for precision retrieval, but return the larger parent chunk as context to the LLM. Example: index 128-token sentence-level chunks for embedding similarity search; when a chunk is retrieved, expand it to the full 512-token paragraph before injecting into the prompt.

Use when: retrieval precision is poor but injecting more context improves answer quality
Weakness: requires two-level index; more complex retrieval logic

5. Contextual retrieval (Anthropic, 2024–2025)

High-impact, low-complexity improvement

Before embedding each chunk, prepend a short context summary generated by an LLM: the document title, section heading path, and a 1–2 sentence description of what the chunk covers. This makes chunks self-contained β€” a chunk that says "the rate is 3.5%" becomes "[Mortgage Products / Fixed Rate Loans] The standard fixed 30-year rate is 3.5%..." Reduces retrieval failure by 49%, or 67% combined with reranking.

Use when: chunks are ambiguous without context (pronouns, tables, section-relative references)
Weakness: LLM call per chunk at indexing time (~$0.001–0.002 per 100 chunks with Claude Haiku)

6. Late chunking

Emerging β€” context-aware embeddings

Embed the entire document first using a long-context embedding model, then split the resulting embeddings into chunk-sized segments. Because the full document was processed before splitting, each chunk embedding carries context from surrounding text β€” pronouns resolve correctly, cross-references are captured. Requires long-context embedding models (JinaAI, Nomic) rather than standard 512-token models.

Strategy Chooser

SituationRecommended strategy
Starting a new project, any content typeRecursiveCharacterTextSplitter, 400–512 tokens, 10% overlap
PDFs with tables, code, mixed layoutDocument-aware (Unstructured.io) + fixed-size per element
Chunks are retrieving but answers are bad (context missing)Contextual retrieval (prepend document context) or parent-child
Long narrative or research documents covering many topicsSemantic chunking β€” topic-aware boundaries
Precision is good but answers lack coherenceParent-child: retrieve small child, inject large parent
Documents with many pronouns and cross-referencesLate chunking or contextual retrieval
Maximum accuracy, cost is secondarySemantic chunking + contextual retrieval + reranking

Embedding Models

Embedding models convert text to dense vectors. The quality of the embedding determines how well semantic similarity search finds relevant chunks. All models below support asymmetric retrieval (short query matched against long document chunk).

ModelMTEB scoreDimensionsCostBest for
Voyage AI voyage-3.5~661,024~$0.06 / 1MBest retrieval-specific accuracy; trained on adversarial negatives β€” top choice for RAG
Voyage voyage-3.5-lite66.1512LowBest accuracy-cost balance for production RAG
Cohere embed-v465.21,024$0.12 / 1MTop accuracy benchmark, multilingual, enterprise
OpenAI text-embedding-3-large64.63,072 (matryoshka)$0.13 / 1MAlready on OpenAI stack; matryoshka allows dimension reduction
OpenAI text-embedding-3-small62.31,536$0.02 / 1MCost-sensitive, high-volume; 6.5Γ— cheaper than large with minor accuracy drop
Nomic Embed Text V2~63768Free (open-source)Open-source, self-hosted, MoE architecture, multilingual, transparent training data
BGE-M3 (BAAI)~621,024Free (open-source)Multi-functionality (dense + sparse + colbert); strong multilingual; self-hosted

How to choose an embedding model

  • Already on OpenAI: start with text-embedding-3-small ($0.02/1M); upgrade to 3-large only if recall measurements show a gap
  • Maximum retrieval accuracy, budget available: Voyage AI voyage-3.5 β€” purpose-built for retrieval, trained on adversarial negatives
  • Best accuracy-to-cost ratio in production: voyage-3.5-lite β€” 66.1% MTEB at one of the lowest price points
  • Self-hosted / open-source: Nomic Embed V2 (transparency, MoE architecture) or BGE-M3 (dense + sparse + colbert in one model)
  • Critical constraint: always embed queries and documents with the same model β€” cross-model similarity is meaningless

Matryoshka Embeddings

OpenAI's text-embedding-3 models use Matryoshka Representation Learning (MRL): the embedding is structured so that the first N dimensions are a valid, lower-quality embedding. You can truncate from 3,072 to 256 dimensions with a controlled accuracy–cost tradeoff:

DimensionsMTEB score (MIRACL avg)Storage vs full
3,072 (full)54.91Γ—
1,53654.4 (βˆ’0.9%)0.5Γ—
51252.0 (βˆ’5.3%)0.17Γ—
25649.8 (βˆ’9.3%)0.08Γ—

Practically: using 1,536 dimensions instead of 3,072 cuts vector storage cost in half with under 1% accuracy loss. Useful when you have millions of chunks and vector storage cost is material.

Production Indexing Pipeline

The recommended baseline pipeline for most RAG systems:

Load
S3, Drive, API, local files
β†’
Parse
Extract clean text; preserve structure (Unstructured.io)
β†’
Chunk
RecursiveCharacterTextSplitter, 400–512 tokens
β†’
Contextualise
Prepend title + section path + 1-sentence summary
β†’
Embed
Batch embed (voyage-3.5-lite or text-embedding-3-small)
β†’
Store
Vectors + metadata + raw text β†’ vector DB

Step 4 (contextualise) reduces retrieval failure by 49–67% β€” it is now standard practice, not optional

Also add: monitoring β€” log chunk count, embedding latency, failed parses; alert on index staleness. Re-index on document update.

Failure Modes

Chunking failures

  • Tables split across chunks: use document-aware splitting β€” a half-table chunk embeds as noise
  • Code blocks broken mid-function: treat code blocks as atomic units regardless of token count
  • Headers separated from body: ensure header + first paragraph stay in the same chunk
  • Context cliff: injecting 3,000+ tokens degrades answer quality sharply; stay under ~2,500 tokens total

Embedding failures

  • Mixing embedding models: querying with model A against an index built with model B produces meaningless similarity scores
  • Embedding stale chunks: re-index on document update β€” stale embeddings surface outdated content confidently
  • Token limit exceeded: most models have a 512–8,192 token input limit; chunks exceeding it are silently truncated
  • Dimension mismatch: changing embedding models requires full re-indexing

2025–2026 Developments

  • Contextual retrieval (Anthropic) is now standard practice β€” the 49–67% retrieval failure reduction has moved this from "advanced technique" to "default best practice."
  • Overlap questioned by 2026 research β€” January 2026 systematic analysis challenged the blanket recommendation for 10–20% overlap. Measure your pipeline; remove overlap if it doesn't move recall metrics.
  • Late chunking gaining traction β€” JinaAI and Nomic have released long-context embedding models (up to 8,192 tokens) that make late chunking practical.
  • Voyage AI now top RAG benchmark choice β€” embedding models trained specifically on retrieval tasks consistently outperform general-purpose embeddings on RAG benchmarks.
  • BGE-M3 unifies dense + sparse β€” enables hybrid retrieval without running two separate embedding pipelines.

Checklist: Do You Understand This?

  • Can you explain why both too-small and too-large chunks hurt RAG quality β€” in different ways?
  • Do you know the recommended starting chunk size for FAQ docs vs legal documents?
  • Can you describe contextual retrieval and why it reduces retrieval failures by 49–67%?
  • Do you understand the parent-child chunking pattern and when it beats standard single-level chunking?
  • Can you name three embedding models and state when you would choose each?
  • Do you know what Matryoshka embeddings are and why you might use fewer than the maximum dimensions?
  • Can you identify the three most dangerous embedding failure modes (mixing models, stale index, token truncation)?