Lesson 6 of 6·11 min read

Production Deployment

The path from Jupyter notebook to production API is long. LangServe, FastAPI integration, streaming, error handling, and scaling — here you'll learn to deploy LangChain applications production-ready.

LangServe

LangServe turns any LangChain chain into a REST API:

from fastapi import FastAPI
from langserve import add_routes

app = FastAPI(title="Agent API")

add_routes(app, rag_chain, path="/rag")
add_routes(app, agent_chain, path="/agent")

Automatically Generated Endpoints

EndpointMethodDescription
/rag/invokePOSTSynchronous call
/rag/streamPOSTServer-Sent Events streaming
/rag/batchPOSTBatch processing
/rag/input_schemaGETInput schema (JSON Schema)
/rag/playgroundGETInteractive playground

Streaming

Streaming is critical for good user experience. LangServe supports various streaming modes:

# Server-Side
from langserve import add_routes
add_routes(app, chain, path="/chat")

# Client-Side
from langserve import RemoteRunnable

remote = RemoteRunnable("http://localhost:8000/chat")
async for chunk in remote.astream({"question": "What is RAG?"}):
    print(chunk, end="", flush=True)

Error Handling

Retry Strategy

from langchain_core.runnables import RunnableWithFallbacks

chain_with_fallback = primary_chain.with_fallbacks(
    [fallback_chain],
    exceptions_to_handle=(TimeoutError, RateLimitError)
)

Circuit Breaker

Temporarily disable the service on repeated failures:

  • 5 errors in 60 seconds → Open circuit
  • Wait 30 seconds → Half-open (test 1 request)
  • Success → Close circuit

Scaling

StrategyDescriptionWhen
HorizontalMultiple instances behind load balancerMany concurrent requests
Queue-basedCelery/Redis for async processingLong-running agent tasks
CachingSemantic cache for frequent queriesRecurring questions
BatchBundle requests and process in parallelBatch processing

Cost Optimization

LLM costs can escalate quickly. Optimization strategies:

  • Model routing: Simple questions → cheaper model, complex → expensive model
  • Caching: Cache identical queries (semantic cache with embeddings)
  • Token limits: Cap maximum tokens per request
  • Prompt optimization: Shorter prompts = fewer tokens = lower costs
  • Monitoring: Track token usage per endpoint and set alerts

Deployment Checklist

  • Error handling and fallbacks configured
  • Streaming enabled
  • Rate limiting implemented
  • Monitoring and tracing (LangSmith) active
  • Cost guards configured
  • Health check endpoint present
  • Input validation implemented
  • Secrets managed securely (not in code)

Practical tip: Deploy a minimal version early. A simple endpoint with one chain is better than a perfect local notebook. Iterate in production — with tracing and evaluation as your safety net.

📝

Quiz

Question 1 of 3

Was macht LangServe?