Production Deployment

A RAG pipeline in a notebook is a prototype. For production you need scaling, index management, incremental updates, cost optimization, and disaster recovery.

Scaling Vector Databases

Horizontal Scaling

Strategy	Description	When to use
Sharding	Distribute data across multiple nodes	>10M vectors
Replicas	Read replicas for higher throughput	Many concurrent queries
Tiered storage	Hot/warm/cold storage by access frequency	Cost optimization

Sizing Guide

Vectors × Dimensions × 4 bytes = RAM requirement (minimum)

Example:
1M vectors × 1536 dimensions × 4 bytes = ~6 GB RAM
+ HNSW index overhead (~2x) = ~12 GB RAM
+ Safety margin (1.5x) = ~18 GB RAM recommended

Index Management

Choosing Index Types

Index	RAM	Query Speed	Recall	Build Speed
Flat	High	Slow	100%	Fast
HNSW	High	Very fast	95-99%	Slow
IVF	Medium	Fast	90-98%	Medium
PQ (Product Quantization)	Low	Fast	85-95%	Slow

Index Tuning

# HNSW parameters
hnsw_config = {
    "m": 16,              # Connections per node (8-64)
    "ef_construction": 200, # Build quality (100-500)
    "ef_search": 100       # Query quality (50-500)
}
# Higher values = better quality, but slower and more RAM

Incremental Updates

New documents must be integrated into the index without downtime:

class IncrementalIndexer:
    def __init__(self, vectorstore, embedder):
        self.vectorstore = vectorstore
        self.embedder = embedder
        self.processed_docs = set()

    async def process_new_documents(self, documents: list):
        new_docs = [d for d in documents if d.id not in self.processed_docs]
        if not new_docs:
            return

        # Create chunks
        chunks = self.splitter.split_documents(new_docs)

        # Process in batches (avoids memory spikes)
        for batch in chunk_list(chunks, batch_size=100):
            await self.vectorstore.aadd_documents(batch)

        self.processed_docs.update(d.id for d in new_docs)

    async def delete_outdated(self, doc_ids: list[str]):
        for doc_id in doc_ids:
            await self.vectorstore.adelete(filter={"source_id": doc_id})
            self.processed_docs.discard(doc_id)

Update Strategies

Strategy	Description	Latency
Real-time	Immediate update on change	Seconds
Batch	Periodic update (e.g., hourly)	Minutes
Blue-green	Build new index, then switch	Zero downtime

Cost Optimization

Reduce Embedding Costs

Strategy                               Savings
─────────────────────────────────────────────
Smaller model (3-small instead of 3-large)  ~60%
Caching identical documents                 ~20-40%
Batch processing instead of single calls    ~10%
Open-source model (self-hosted)             ~80-90%

Optimize Inference Costs

Smaller context: Only top-3 instead of top-10 chunks to LLM
Model routing: Simple questions → cheaper model
Response caching: Cache frequent questions
Streaming: Reduces time-to-first-token

Monitoring

Metrics

Metric	Target	Alert
Retrieval latency P95	< 200ms	> 500ms
End-to-end latency P95	< 3s	> 5s
Retrieval relevance	> 0.85	< 0.75
Faithfulness	> 0.90	< 0.80
Error rate	< 1%	> 5%
Token usage/query	< 5000	> 10000

Disaster Recovery

Backup Strategy

Vector store: Daily snapshots, weekly full backups
Raw documents: Always keep a copy of source documents
Index: Ensure re-build capability from raw documents
Config: Document embedding model version, chunk parameters

Recovery Procedure

1. Restore vector store from snapshot (fast)
2. If no snapshot: Re-embed from raw documents (hours)
3. Validation: Test golden dataset against restored pipeline
4. Monitoring: Closely monitor metrics in the first 24 hours

Practical tip: Plan for blue-green deployment from the start. It enables index updates and model switches without downtime. Always keep the source documents — a vector index is reproducible, the original data is not.