Lesson 3 of 6·10 min read

Chunking Strategies

Chunking — splitting documents into smaller sections — is the most underrated step of a RAG pipeline. Bad chunking = bad retrieval results, no matter how good your embedding model is.

Why Chunking?

LLMs have limited context windows. Even though modern models can process 200K+ tokens, the rule holds: The more targeted the context, the better the answer. A chunk should be a coherent information unit.

The Four Chunking Strategies

1. Fixed-Size Chunking

Text is split into equal-sized blocks (e.g., 512 tokens).

  • ✅ Easy to implement
  • ✅ Predictable chunk sizes
  • ❌ Breaks sentences and paragraphs
  • ❌ Context lost at chunk boundaries

2. Recursive Character Splitting

Text is split hierarchically: first at headings, then paragraphs, then sentences.

  • ✅ Respects document structure
  • ✅ Better semantic coherence
  • ❌ Variable chunk sizes
  • Best Practice: LangChain's RecursiveCharacterTextSplitter

3. Semantic Chunking

Embeddings of adjacent paragraphs are compared. When similarity drops below a threshold, a new chunk starts.

  • ✅ Content-coherent chunks
  • ✅ Adapts automatically
  • ❌ Higher computational cost
  • ❌ Threshold needs calibration

4. Document-Aware Chunking

Uses document structure: headings, tables, lists serve as natural boundaries.

  • ✅ Ideal for structured documents (Markdown, HTML, LaTeX)
  • ❌ Unstructured texts benefit little

Overlap — The Safety Net

Chunks should overlap by 10–20% so context isn't lost at boundaries.

Example: With 512 tokens per chunk and 64 tokens overlap, each chunk contains the last 64 tokens of the previous one.

Metadata Enrichment

Every chunk should carry metadata:

  • Source document: Filename, URL, author
  • Position: Chapter, page number, section
  • Timestamp: Creation and modification date
  • Tags: Topic, department, confidentiality level

Practical tip: Start with recursive character splitting (chunk_size=512, overlap=64). Then test with real queries and optimize iteratively. The perfect chunk size depends on your documents.