The quality of your RAG system stands and falls with the quality of your embeddings. The right model, the right chunking strategy, and thoughtful metadata enrichment make the difference between "sometimes finds something" and "always finds the right thing."
| Model | Provider | Dimensions | Strengths |
|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | Highest quality, expensive |
| text-embedding-3-small | OpenAI | 1536 | Good price-performance ratio |
| embed-v4.0 | Cohere | 1024 | Multilingual, compressible |
| voyage-3 | Voyage AI | 1024 | Specialized for code and legal |
| Model | Dimensions | Strengths |
|---|---|---|
| BGE-large-en-v1.5 | 1024 | Top MTEB benchmark |
| E5-mistral-7b-instruct | 4096 | Instruction-based |
| GTE-large | 1024 | Alibaba, multilingual |
| nomic-embed-text-v1.5 | 768 | Compact, efficient |
Criteria:
1. Language → Multilingual model for DE/EN?
2. Domain → Specialized model (code, legal, medical)?
3. Budget → Commercial vs. open-source?
4. Latency → Smaller models = faster
5. Quality → Check benchmark results (MTEB)
# Simple but not optimal
chunks = split_text(text, chunk_size=1000, overlap=200)
# Splits at semantic boundaries
from langchain_experimental.text_splitter import SemanticChunker
chunker = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = chunker.split_text(document)
| Strategy | Description | When to use |
|---|---|---|
| Markdown header | Splits at headers, preserves hierarchy | Documentation, wiki |
| HTML sections | Splits at HTML elements | Web content |
| Paragraph-based | Splits at paragraphs | Prose texts |
| Code-aware | Splits at functions/classes | Source code |
| Sliding window | Fixed size with overlap | Fallback / default |
Too small (< 200 tokens):
✗ Context loss — individual sentences without connection
✗ More chunks = higher retrieval costs
Too large (> 2000 tokens):
✗ Noise — irrelevant information in the chunk
✗ Lower retrieval precision
Optimal (300-800 tokens):
✓ Enough context for comprehensibility
✓ Focused enough for precise retrieval
Chunks without metadata are like books without a table of contents. Metadata dramatically improves retrieval:
chunk = {
"text": "The new GDPR amendment affects...",
"metadata": {
"source": "compliance/gdpr-update-2026.pdf",
"page": 12,
"section": "Changes 2026",
"category": "compliance",
"date": "2026-01-15",
"author": "Legal Team",
"language": "en",
"keywords": ["GDPR", "data protection", "compliance"]
}
}
results = vectorstore.similarity_search(
query="GDPR amendments",
filter={"category": "compliance", "date": {"$gte": "2026-01-01"}},
k=5
)
Combines vector search (semantic) with keyword search (BM25) for better results:
Query: "GDPR Article 15 right of access"
Vector search: Finds semantically similar texts
BM25 search: Finds exact keyword matches
Hybrid (RRF): Combines both rankings → best results
Practical tip: Always test at least 3 chunking strategies with your real data. The "right" strategy depends heavily on your document type. Invest in metadata enrichment — it's the biggest lever for retrieval quality after chunking strategy.