You cannot operate a production agent you don't understand. LangSmith is LangChain's platform for tracing, evaluation, and debugging LLM applications. Observability isn't a nice-to-have — it's a prerequisite for production.
Every LangChain run is automatically traced when LangSmith is configured:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls_...
export LANGCHAIN_PROJECT=my-agent
| Element | Details |
|---|---|
| LLM Calls | Input, output, token usage, latency, model |
| Tool Calls | Which tool, which parameters, result |
| Chain Steps | Every step of a chain with input/output |
| Retriever | Queries, found documents, relevance scores |
| Errors | Stack traces, retry attempts, fallbacks |
Run: "Customer Support Agent"
├── Chain: "rag_chain"
│ ├── Retriever: "vector_search" (3 documents, 120ms)
│ ├── LLM: "claude-sonnet" (450 tokens, 890ms)
│ └── Parser: "json_output" (2ms)
├── Tool: "create_ticket" (Success, 340ms)
└── LLM: "claude-sonnet" (Final Response, 230 tokens)
LangSmith enables systematic evaluation of your chains:
from langsmith import Client
client = Client()
dataset = client.create_dataset("customer-queries")
client.create_examples(
inputs=[{"query": "Where is my order?"}],
outputs=[{"expected": "Order status with tracking link"}],
dataset_id=dataset.id
)
from langsmith.evaluation import evaluate
results = evaluate(
my_chain.invoke,
data="customer-queries",
evaluators=[
correctness_evaluator,
relevance_evaluator,
helpfulness_evaluator
]
)
LangSmith Hub enables central prompt management:
Automated tests on prompt or code changes:
Practical tip: Enable tracing from day one. The costs are minimal, but without traces you're debugging blind. Create a test dataset with at least 50 real user questions — that's your gold standard for evaluations.