Production Deployment

The path from Jupyter notebook to production API is long. LangServe, FastAPI integration, streaming, error handling, and scaling — here you'll learn to deploy LangChain applications production-ready.

LangServe

LangServe turns any LangChain chain into a REST API:

from fastapi import FastAPI
from langserve import add_routes

app = FastAPI(title="Agent API")

add_routes(app, rag_chain, path="/rag")
add_routes(app, agent_chain, path="/agent")

Automatically Generated Endpoints

Endpoint	Method	Description
`/rag/invoke`	POST	Synchronous call
`/rag/stream`	POST	Server-Sent Events streaming
`/rag/batch`	POST	Batch processing
`/rag/input_schema`	GET	Input schema (JSON Schema)
`/rag/playground`	GET	Interactive playground

Streaming

Streaming is critical for good user experience. LangServe supports various streaming modes:

# Server-Side
from langserve import add_routes
add_routes(app, chain, path="/chat")

# Client-Side
from langserve import RemoteRunnable

remote = RemoteRunnable("http://localhost:8000/chat")
async for chunk in remote.astream({"question": "What is RAG?"}):
    print(chunk, end="", flush=True)

Error Handling

Retry Strategy

from langchain_core.runnables import RunnableWithFallbacks

chain_with_fallback = primary_chain.with_fallbacks(
    [fallback_chain],
    exceptions_to_handle=(TimeoutError, RateLimitError)
)

Circuit Breaker

Temporarily disable the service on repeated failures:

5 errors in 60 seconds → Open circuit
Wait 30 seconds → Half-open (test 1 request)
Success → Close circuit

Scaling

Strategy	Description	When
Horizontal	Multiple instances behind load balancer	Many concurrent requests
Queue-based	Celery/Redis for async processing	Long-running agent tasks
Caching	Semantic cache for frequent queries	Recurring questions
Batch	Bundle requests and process in parallel	Batch processing

Cost Optimization

LLM costs can escalate quickly. Optimization strategies:

Model routing: Simple questions → cheaper model, complex → expensive model
Caching: Cache identical queries (semantic cache with embeddings)
Token limits: Cap maximum tokens per request
Prompt optimization: Shorter prompts = fewer tokens = lower costs
Monitoring: Track token usage per endpoint and set alerts

Deployment Checklist

Error handling and fallbacks configured
Streaming enabled
Rate limiting implemented
Monitoring and tracing (LangSmith) active
Cost guards configured
Health check endpoint present
Input validation implemented
Secrets managed securely (not in code)

Practical tip: Deploy a minimal version early. A simple endpoint with one chain is better than a perfect local notebook. Iterate in production — with tracing and evaluation as your safety net.