Building a RAG Pipeline

Now we connect all building blocks into a working end-to-end pipeline. From document ingestion through retrieval to final answer generation.

The Architecture

Documents → Loader → Chunker → Embedder → Vector DB
                                                ↑
User Query → Embedder → Similarity Search ──────┘
                                    ↓
                         Top-K Chunks + Query → LLM → Answer

Step 1: Document Ingestion

Document Loader

Different sources require different loaders:

PDF: PyPDF2, Unstructured, Adobe Extract API
Web: BeautifulSoup, Crawl4AI, Firecrawl
Office: python-docx, openpyxl
Databases: SQL connector, API calls

Pre-Processing

Remove HTML tags
Convert tables to structured text
Eliminate headers/footers
Normalize encoding (UTF-8)

Step 2: Chunking + Embedding

// Pseudo-code for ingestion
const chunks = recursiveSplit(document, { chunkSize: 512, overlap: 64 })
const embeddings = await embedModel.embed(chunks.map(c => c.text))
await vectorDB.upsert(chunks.map((c, i) => ({
  id: c.id,
  vector: embeddings[i],
  metadata: { source: c.source, page: c.page }
})))

Step 3: Retrieval

Query Optimization

Query Rewriting: LLM reformulates the user question for better retrieval results
HyDE (Hypothetical Document Embedding): LLM generates a hypothetical answer whose embedding is used for search
Multi-Query: Generate multiple variants of the question and merge results

Reranking

After initial retrieval (Top-20), a reranker model evaluates relevance and returns the best 3–5 chunks.

Step 4: Generation

System prompt:
"Answer the question based on the following context.
If the context doesn't contain the answer, say so honestly.
Cite relevant sources."

Context: [Top-K Chunks]
Question: [User Query]

Practical Stack (2026)

Component	Recommendation	Alternative
Orchestration	LangChain / LlamaIndex	Haystack
Embeddings	OpenAI text-embedding-3	Voyage, BGE
Vector DB	pgvector / Qdrant	Pinecone, Weaviate
Reranker	Cohere Rerank v3	Cross-Encoder
LLM	Claude Opus / GPT-5	Mixtral (self-hosted)

Practical tip: Build a minimal prototype first without reranking and query optimization. Measure quality, then optimize specifically.