Chunking Strategies

Chunking — splitting documents into smaller sections — is the most underrated step of a RAG pipeline. Bad chunking = bad retrieval results, no matter how good your embedding model is.

Why Chunking?

LLMs have limited context windows. Even though modern models can process 200K+ tokens, the rule holds: The more targeted the context, the better the answer. A chunk should be a coherent information unit.

The Four Chunking Strategies

1. Fixed-Size Chunking

Text is split into equal-sized blocks (e.g., 512 tokens).

✅ Easy to implement
✅ Predictable chunk sizes
❌ Breaks sentences and paragraphs
❌ Context lost at chunk boundaries

2. Recursive Character Splitting

Text is split hierarchically: first at headings, then paragraphs, then sentences.

✅ Respects document structure
✅ Better semantic coherence
❌ Variable chunk sizes
Best Practice: LangChain's RecursiveCharacterTextSplitter

3. Semantic Chunking

Embeddings of adjacent paragraphs are compared. When similarity drops below a threshold, a new chunk starts.

✅ Content-coherent chunks
✅ Adapts automatically
❌ Higher computational cost
❌ Threshold needs calibration

4. Document-Aware Chunking

Uses document structure: headings, tables, lists serve as natural boundaries.

✅ Ideal for structured documents (Markdown, HTML, LaTeX)
❌ Unstructured texts benefit little

Overlap — The Safety Net

Chunks should overlap by 10–20% so context isn't lost at boundaries.

Example: With 512 tokens per chunk and 64 tokens overlap, each chunk contains the last 64 tokens of the previous one.

Metadata Enrichment

Every chunk should carry metadata:

Source document: Filename, URL, author
Position: Chapter, page number, section
Timestamp: Creation and modification date
Tags: Topic, department, confidentiality level

Practical tip: Start with recursive character splitting (chunk_size=512, overlap=64). Then test with real queries and optimize iteratively. The perfect chunk size depends on your documents.