Intermediate

Chunking Strategies

Chunking splits documents into the indexable pieces that get embedded and stored. The chunk size and splitting strategy determine both retrieval precision (do we get the exact relevant passage?) and context quality (does the retrieved chunk contain enough surrounding context to be useful?).

Document

Any format

→

Split Strategy

Fixed / semantic / hierarchical

→

Chunks

N-token pieces

→

Overlap

10–15% shared tokens

→

Embed Each Chunk

Chunking pipeline — strategy choice drives retrieval precision

Fixed-Size Chunking

Split text into chunks of N tokens with optional overlap between adjacent chunks. The simplest approach.

Pros: Predictable chunk sizes; easy to implement; consistent embedding behaviour
Cons: Ignores natural sentence and paragraph boundaries — may split in the middle of a sentence or concept; two halves of a concept in adjacent chunks reduces retrieval quality for that concept

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,      # tokens per chunk
    chunk_overlap=50,    # overlapping tokens between chunks
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

RecursiveCharacterTextSplitter tries to split on paragraph breaks first, then newlines, then sentence boundaries, then words — preserving as much semantic structure as possible within the fixed size constraint.

Semantic Chunking

Split on natural semantic boundaries: paragraphs, sections, headers, or topic shifts. Better than fixed-size for well-structured documents.

Paragraph-based: Split on double newlines (\n\n). Simple, preserves the natural unit of discourse. Chunk sizes vary widely — filter out chunks under 50 tokens (likely stray lines or headers) and over 2000 tokens (append the next paragraph or truncate).
Section-based: Split on markdown headers (#, ##) or HTML heading tags. Keeps a complete section together. Best for documentation, wikis, and well-structured reports.
Sentence-based: Split on sentence boundaries. More granular than paragraphs; useful for dense technical content where every sentence introduces a new fact.

Hierarchical Chunking

Maintain a parent-child relationship between chunks. Store small chunks for precise retrieval, but include the parent (larger) chunk in the context sent to Claude.

Split the document into large parent chunks (1000–2000 tokens) — one per section
Split each parent into small child chunks (128–256 tokens) for indexing
Embed and store the child chunks in the vector database with a reference to their parent
At retrieval time: retrieve child chunks by similarity, then return the full parent chunk to Claude

Hierarchical chunking improves retrieval recall (small chunks match specific queries better) while maintaining context quality (Claude gets the full section, not just one sentence). It is more complex to implement but significantly improves quality for long, structured documents.

Why Chunk Overlap Matters

Chunk overlap adds the last N tokens of one chunk to the beginning of the next. This prevents information loss at chunk boundaries:

A sentence split between two chunks may be fully retrievable from either chunk if overlap is large enough to include the complete sentence
Context that spans a boundary (e.g., "As mentioned above, the policy states...") is more likely to be self-contained with overlap
Typical overlap: 10–15% of chunk size. For a 512-token chunk, 50–75 token overlap is reasonable.
Overlap increases storage and indexing cost (more chunks overall) — balance against retrieval quality improvement

Choosing Chunk Size

256 tokens: Very specific retrieval — the retrieved chunk directly addresses the query. Less surrounding context; may miss nuance. Use for dense Q&A knowledge bases where every sentence contains distinct facts.
512 tokens: Good default for most use cases. Balances precision and context. Start here.
1024 tokens: More context per chunk; less precise retrieval. Use when queries benefit from broader context — policy documents, legal text, technical explanations.
Document-level (no chunking): Only when documents are short enough to fit in context individually. Useful for whole-document tasks (summarise, classify) but not for Q&A over a large knowledge base.

There is no universally correct chunk size. Measure retrieval quality on your evaluation dataset at different chunk sizes — the optimal size depends on your document structure and query patterns.

Checklist: Do You Understand This?

Fixed-size: predictable but may split mid-sentence; use RecursiveCharacterTextSplitter for best-effort boundary preservation
Semantic: split on paragraph/section boundaries — better for structured documents
Hierarchical: small chunks for retrieval + parent context for generation — best quality, more complex
Overlap: 10–15% of chunk size; prevents information loss at boundaries
Start with 512 tokens + 50-token overlap; measure retrieval quality and adjust based on evaluation results