Lesson 6 of 6·11 min read

Production Deployment

A RAG pipeline in a notebook is a prototype. For production you need scaling, index management, incremental updates, cost optimization, and disaster recovery.

Scaling Vector Databases

Horizontal Scaling

StrategyDescriptionWhen to use
ShardingDistribute data across multiple nodes>10M vectors
ReplicasRead replicas for higher throughputMany concurrent queries
Tiered storageHot/warm/cold storage by access frequencyCost optimization

Sizing Guide

Vectors × Dimensions × 4 bytes = RAM requirement (minimum)

Example:
1M vectors × 1536 dimensions × 4 bytes = ~6 GB RAM
+ HNSW index overhead (~2x) = ~12 GB RAM
+ Safety margin (1.5x) = ~18 GB RAM recommended

Index Management

Choosing Index Types

IndexRAMQuery SpeedRecallBuild Speed
FlatHighSlow100%Fast
HNSWHighVery fast95-99%Slow
IVFMediumFast90-98%Medium
PQ (Product Quantization)LowFast85-95%Slow

Index Tuning

# HNSW parameters
hnsw_config = {
    "m": 16,              # Connections per node (8-64)
    "ef_construction": 200, # Build quality (100-500)
    "ef_search": 100       # Query quality (50-500)
}
# Higher values = better quality, but slower and more RAM

Incremental Updates

New documents must be integrated into the index without downtime:

class IncrementalIndexer:
    def __init__(self, vectorstore, embedder):
        self.vectorstore = vectorstore
        self.embedder = embedder
        self.processed_docs = set()

    async def process_new_documents(self, documents: list):
        new_docs = [d for d in documents if d.id not in self.processed_docs]
        if not new_docs:
            return

        # Create chunks
        chunks = self.splitter.split_documents(new_docs)

        # Process in batches (avoids memory spikes)
        for batch in chunk_list(chunks, batch_size=100):
            await self.vectorstore.aadd_documents(batch)

        self.processed_docs.update(d.id for d in new_docs)

    async def delete_outdated(self, doc_ids: list[str]):
        for doc_id in doc_ids:
            await self.vectorstore.adelete(filter={"source_id": doc_id})
            self.processed_docs.discard(doc_id)

Update Strategies

StrategyDescriptionLatency
Real-timeImmediate update on changeSeconds
BatchPeriodic update (e.g., hourly)Minutes
Blue-greenBuild new index, then switchZero downtime

Cost Optimization

Reduce Embedding Costs

Strategy                               Savings
─────────────────────────────────────────────
Smaller model (3-small instead of 3-large)  ~60%
Caching identical documents                 ~20-40%
Batch processing instead of single calls    ~10%
Open-source model (self-hosted)             ~80-90%

Optimize Inference Costs

  • Smaller context: Only top-3 instead of top-10 chunks to LLM
  • Model routing: Simple questions → cheaper model
  • Response caching: Cache frequent questions
  • Streaming: Reduces time-to-first-token

Monitoring

Metrics

MetricTargetAlert
Retrieval latency P95< 200ms> 500ms
End-to-end latency P95< 3s> 5s
Retrieval relevance> 0.85< 0.75
Faithfulness> 0.90< 0.80
Error rate< 1%> 5%
Token usage/query< 5000> 10000

Disaster Recovery

Backup Strategy

  • Vector store: Daily snapshots, weekly full backups
  • Raw documents: Always keep a copy of source documents
  • Index: Ensure re-build capability from raw documents
  • Config: Document embedding model version, chunk parameters

Recovery Procedure

1. Restore vector store from snapshot (fast)
2. If no snapshot: Re-embed from raw documents (hours)
3. Validation: Test golden dataset against restored pipeline
4. Monitoring: Closely monitor metrics in the first 24 hours

Practical tip: Plan for blue-green deployment from the start. It enables index updates and model switches without downtime. Always keep the source documents — a vector index is reproducible, the original data is not.

📝

Quiz

Question 1 of 3

Was ist der Hauptvorteil des Blue-Green-Deployment-Patterns für RAG-Pipelines?