The path from Jupyter notebook to production API is long. LangServe, FastAPI integration, streaming, error handling, and scaling — here you'll learn to deploy LangChain applications production-ready.
LangServe turns any LangChain chain into a REST API:
from fastapi import FastAPI
from langserve import add_routes
app = FastAPI(title="Agent API")
add_routes(app, rag_chain, path="/rag")
add_routes(app, agent_chain, path="/agent")
| Endpoint | Method | Description |
|---|---|---|
/rag/invoke | POST | Synchronous call |
/rag/stream | POST | Server-Sent Events streaming |
/rag/batch | POST | Batch processing |
/rag/input_schema | GET | Input schema (JSON Schema) |
/rag/playground | GET | Interactive playground |
Streaming is critical for good user experience. LangServe supports various streaming modes:
# Server-Side
from langserve import add_routes
add_routes(app, chain, path="/chat")
# Client-Side
from langserve import RemoteRunnable
remote = RemoteRunnable("http://localhost:8000/chat")
async for chunk in remote.astream({"question": "What is RAG?"}):
print(chunk, end="", flush=True)
from langchain_core.runnables import RunnableWithFallbacks
chain_with_fallback = primary_chain.with_fallbacks(
[fallback_chain],
exceptions_to_handle=(TimeoutError, RateLimitError)
)
Temporarily disable the service on repeated failures:
| Strategy | Description | When |
|---|---|---|
| Horizontal | Multiple instances behind load balancer | Many concurrent requests |
| Queue-based | Celery/Redis for async processing | Long-running agent tasks |
| Caching | Semantic cache for frequent queries | Recurring questions |
| Batch | Bundle requests and process in parallel | Batch processing |
LLM costs can escalate quickly. Optimization strategies:
Practical tip: Deploy a minimal version early. A simple endpoint with one chain is better than a perfect local notebook. Iterate in production — with tracing and evaluation as your safety net.
Was macht LangServe?