Chunking — splitting documents into smaller sections — is the most underrated step of a RAG pipeline. Bad chunking = bad retrieval results, no matter how good your embedding model is.
LLMs have limited context windows. Even though modern models can process 200K+ tokens, the rule holds: The more targeted the context, the better the answer. A chunk should be a coherent information unit.
Text is split into equal-sized blocks (e.g., 512 tokens).
Text is split hierarchically: first at headings, then paragraphs, then sentences.
Embeddings of adjacent paragraphs are compared. When similarity drops below a threshold, a new chunk starts.
Uses document structure: headings, tables, lists serve as natural boundaries.
Chunks should overlap by 10–20% so context isn't lost at boundaries.
Example: With 512 tokens per chunk and 64 tokens overlap, each chunk contains the last 64 tokens of the previous one.
Every chunk should carry metadata:
Practical tip: Start with recursive character splitting (chunk_size=512, overlap=64). Then test with real queries and optimize iteratively. The perfect chunk size depends on your documents.