Lesson 4 of 6·12 min read

Building a RAG Pipeline

Now we connect all building blocks into a working end-to-end pipeline. From document ingestion through retrieval to final answer generation.

The Architecture

Documents → Loader → Chunker → Embedder → Vector DB
                                                ↑
User Query → Embedder → Similarity Search ──────┘
                                    ↓
                         Top-K Chunks + Query → LLM → Answer

Step 1: Document Ingestion

Document Loader

Different sources require different loaders:

  • PDF: PyPDF2, Unstructured, Adobe Extract API
  • Web: BeautifulSoup, Crawl4AI, Firecrawl
  • Office: python-docx, openpyxl
  • Databases: SQL connector, API calls

Pre-Processing

  • Remove HTML tags
  • Convert tables to structured text
  • Eliminate headers/footers
  • Normalize encoding (UTF-8)

Step 2: Chunking + Embedding

// Pseudo-code for ingestion
const chunks = recursiveSplit(document, { chunkSize: 512, overlap: 64 })
const embeddings = await embedModel.embed(chunks.map(c => c.text))
await vectorDB.upsert(chunks.map((c, i) => ({
  id: c.id,
  vector: embeddings[i],
  metadata: { source: c.source, page: c.page }
})))

Step 3: Retrieval

Query Optimization

  • Query Rewriting: LLM reformulates the user question for better retrieval results
  • HyDE (Hypothetical Document Embedding): LLM generates a hypothetical answer whose embedding is used for search
  • Multi-Query: Generate multiple variants of the question and merge results

Reranking

After initial retrieval (Top-20), a reranker model evaluates relevance and returns the best 3–5 chunks.

Step 4: Generation

System prompt:
"Answer the question based on the following context.
If the context doesn't contain the answer, say so honestly.
Cite relevant sources."

Context: [Top-K Chunks]
Question: [User Query]

Practical Stack (2026)

ComponentRecommendationAlternative
OrchestrationLangChain / LlamaIndexHaystack
EmbeddingsOpenAI text-embedding-3Voyage, BGE
Vector DBpgvector / QdrantPinecone, Weaviate
RerankerCohere Rerank v3Cross-Encoder
LLMClaude Opus / GPT-5Mixtral (self-hosted)

Practical tip: Build a minimal prototype first without reranking and query optimization. Measure quality, then optimize specifically.

📝

Quiz

Question 1 of 3

Was ist der Zweck von Reranking in einer RAG-Pipeline?