Now we connect all building blocks into a working end-to-end pipeline. From document ingestion through retrieval to final answer generation.
Documents → Loader → Chunker → Embedder → Vector DB
↑
User Query → Embedder → Similarity Search ──────┘
↓
Top-K Chunks + Query → LLM → Answer
Different sources require different loaders:
// Pseudo-code for ingestion
const chunks = recursiveSplit(document, { chunkSize: 512, overlap: 64 })
const embeddings = await embedModel.embed(chunks.map(c => c.text))
await vectorDB.upsert(chunks.map((c, i) => ({
id: c.id,
vector: embeddings[i],
metadata: { source: c.source, page: c.page }
})))
After initial retrieval (Top-20), a reranker model evaluates relevance and returns the best 3–5 chunks.
System prompt:
"Answer the question based on the following context.
If the context doesn't contain the answer, say so honestly.
Cite relevant sources."
Context: [Top-K Chunks]
Question: [User Query]
| Component | Recommendation | Alternative |
|---|---|---|
| Orchestration | LangChain / LlamaIndex | Haystack |
| Embeddings | OpenAI text-embedding-3 | Voyage, BGE |
| Vector DB | pgvector / Qdrant | Pinecone, Weaviate |
| Reranker | Cohere Rerank v3 | Cross-Encoder |
| LLM | Claude Opus / GPT-5 | Mixtral (self-hosted) |
Practical tip: Build a minimal prototype first without reranking and query optimization. Measure quality, then optimize specifically.
Was ist der Zweck von Reranking in einer RAG-Pipeline?