A RAG pipeline in a notebook is a prototype. For production you need scaling, index management, incremental updates, cost optimization, and disaster recovery.
| Strategy | Description | When to use |
|---|---|---|
| Sharding | Distribute data across multiple nodes | >10M vectors |
| Replicas | Read replicas for higher throughput | Many concurrent queries |
| Tiered storage | Hot/warm/cold storage by access frequency | Cost optimization |
Vectors × Dimensions × 4 bytes = RAM requirement (minimum)
Example:
1M vectors × 1536 dimensions × 4 bytes = ~6 GB RAM
+ HNSW index overhead (~2x) = ~12 GB RAM
+ Safety margin (1.5x) = ~18 GB RAM recommended
| Index | RAM | Query Speed | Recall | Build Speed |
|---|---|---|---|---|
| Flat | High | Slow | 100% | Fast |
| HNSW | High | Very fast | 95-99% | Slow |
| IVF | Medium | Fast | 90-98% | Medium |
| PQ (Product Quantization) | Low | Fast | 85-95% | Slow |
# HNSW parameters
hnsw_config = {
"m": 16, # Connections per node (8-64)
"ef_construction": 200, # Build quality (100-500)
"ef_search": 100 # Query quality (50-500)
}
# Higher values = better quality, but slower and more RAM
New documents must be integrated into the index without downtime:
class IncrementalIndexer:
def __init__(self, vectorstore, embedder):
self.vectorstore = vectorstore
self.embedder = embedder
self.processed_docs = set()
async def process_new_documents(self, documents: list):
new_docs = [d for d in documents if d.id not in self.processed_docs]
if not new_docs:
return
# Create chunks
chunks = self.splitter.split_documents(new_docs)
# Process in batches (avoids memory spikes)
for batch in chunk_list(chunks, batch_size=100):
await self.vectorstore.aadd_documents(batch)
self.processed_docs.update(d.id for d in new_docs)
async def delete_outdated(self, doc_ids: list[str]):
for doc_id in doc_ids:
await self.vectorstore.adelete(filter={"source_id": doc_id})
self.processed_docs.discard(doc_id)
| Strategy | Description | Latency |
|---|---|---|
| Real-time | Immediate update on change | Seconds |
| Batch | Periodic update (e.g., hourly) | Minutes |
| Blue-green | Build new index, then switch | Zero downtime |
Strategy Savings
─────────────────────────────────────────────
Smaller model (3-small instead of 3-large) ~60%
Caching identical documents ~20-40%
Batch processing instead of single calls ~10%
Open-source model (self-hosted) ~80-90%
| Metric | Target | Alert |
|---|---|---|
| Retrieval latency P95 | < 200ms | > 500ms |
| End-to-end latency P95 | < 3s | > 5s |
| Retrieval relevance | > 0.85 | < 0.75 |
| Faithfulness | > 0.90 | < 0.80 |
| Error rate | < 1% | > 5% |
| Token usage/query | < 5000 | > 10000 |
1. Restore vector store from snapshot (fast)
2. If no snapshot: Re-embed from raw documents (hours)
3. Validation: Test golden dataset against restored pipeline
4. Monitoring: Closely monitor metrics in the first 24 hours
Practical tip: Plan for blue-green deployment from the start. It enables index updates and model switches without downtime. Always keep the source documents — a vector index is reproducible, the original data is not.
Was ist der Hauptvorteil des Blue-Green-Deployment-Patterns für RAG-Pipelines?