Embedding Strategies

The quality of your RAG system stands and falls with the quality of your embeddings. The right model, the right chunking strategy, and thoughtful metadata enrichment make the difference between "sometimes finds something" and "always finds the right thing."

Embedding Models

Commercial Models

Model	Provider	Dimensions	Strengths
text-embedding-3-large	OpenAI	3072	Highest quality, expensive
text-embedding-3-small	OpenAI	1536	Good price-performance ratio
embed-v4.0	Cohere	1024	Multilingual, compressible
voyage-3	Voyage AI	1024	Specialized for code and legal

Open-Source Models

Model	Dimensions	Strengths
BGE-large-en-v1.5	1024	Top MTEB benchmark
E5-mistral-7b-instruct	4096	Instruction-based
GTE-large	1024	Alibaba, multilingual
nomic-embed-text-v1.5	768	Compact, efficient

Model Selection

Criteria:
1. Language → Multilingual model for DE/EN?
2. Domain → Specialized model (code, legal, medical)?
3. Budget → Commercial vs. open-source?
4. Latency → Smaller models = faster
5. Quality → Check benchmark results (MTEB)

Chunking Strategies

Fixed-Size Chunking

# Simple but not optimal
chunks = split_text(text, chunk_size=1000, overlap=200)

Semantic Chunking

# Splits at semantic boundaries
from langchain_experimental.text_splitter import SemanticChunker

chunker = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = chunker.split_text(document)

Document-Structure-Aware Chunking

Strategy	Description	When to use
Markdown header	Splits at headers, preserves hierarchy	Documentation, wiki
HTML sections	Splits at HTML elements	Web content
Paragraph-based	Splits at paragraphs	Prose texts
Code-aware	Splits at functions/classes	Source code
Sliding window	Fixed size with overlap	Fallback / default

Chunk Size Optimization

Too small (< 200 tokens):
  ✗ Context loss — individual sentences without connection
  ✗ More chunks = higher retrieval costs

Too large (> 2000 tokens):
  ✗ Noise — irrelevant information in the chunk
  ✗ Lower retrieval precision

Optimal (300-800 tokens):
  ✓ Enough context for comprehensibility
  ✓ Focused enough for precise retrieval

Metadata Enrichment

Chunks without metadata are like books without a table of contents. Metadata dramatically improves retrieval:

chunk = {
    "text": "The new GDPR amendment affects...",
    "metadata": {
        "source": "compliance/gdpr-update-2026.pdf",
        "page": 12,
        "section": "Changes 2026",
        "category": "compliance",
        "date": "2026-01-15",
        "author": "Legal Team",
        "language": "en",
        "keywords": ["GDPR", "data protection", "compliance"]
    }
}

Metadata Filtering in Search

results = vectorstore.similarity_search(
    query="GDPR amendments",
    filter={"category": "compliance", "date": {"$gte": "2026-01-01"}},
    k=5
)

Hybrid Search

Combines vector search (semantic) with keyword search (BM25) for better results:

Query: "GDPR Article 15 right of access"

Vector search: Finds semantically similar texts
BM25 search:   Finds exact keyword matches

Hybrid (RRF):  Combines both rankings → best results

Practical tip: Always test at least 3 chunking strategies with your real data. The "right" strategy depends heavily on your document type. Invest in metadata enrichment — it's the biggest lever for retrieval quality after chunking strategy.