Lesson 5 of 6·10 min read

LangSmith & Observability

You cannot operate a production agent you don't understand. LangSmith is LangChain's platform for tracing, evaluation, and debugging LLM applications. Observability isn't a nice-to-have — it's a prerequisite for production.

Tracing

Every LangChain run is automatically traced when LangSmith is configured:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls_...
export LANGCHAIN_PROJECT=my-agent

What Gets Traced?

ElementDetails
LLM CallsInput, output, token usage, latency, model
Tool CallsWhich tool, which parameters, result
Chain StepsEvery step of a chain with input/output
RetrieverQueries, found documents, relevance scores
ErrorsStack traces, retry attempts, fallbacks

Trace Hierarchy

Run: "Customer Support Agent"
├── Chain: "rag_chain"
│   ├── Retriever: "vector_search" (3 documents, 120ms)
│   ├── LLM: "claude-sonnet" (450 tokens, 890ms)
│   └── Parser: "json_output" (2ms)
├── Tool: "create_ticket" (Success, 340ms)
└── LLM: "claude-sonnet" (Final Response, 230 tokens)

Evaluation

LangSmith enables systematic evaluation of your chains:

Create Datasets

from langsmith import Client

client = Client()
dataset = client.create_dataset("customer-queries")
client.create_examples(
    inputs=[{"query": "Where is my order?"}],
    outputs=[{"expected": "Order status with tracking link"}],
    dataset_id=dataset.id
)

Define Evaluators

from langsmith.evaluation import evaluate

results = evaluate(
    my_chain.invoke,
    data="customer-queries",
    evaluators=[
        correctness_evaluator,
        relevance_evaluator,
        helpfulness_evaluator
    ]
)

Prompt Versioning

LangSmith Hub enables central prompt management:

  • Versioning: Every prompt change is versioned
  • A/B testing: Test different prompt versions against each other
  • Rollback: Instantly revert to a previous version
  • Sharing: Share prompts across teams and collaborate

Regression Testing

Automated tests on prompt or code changes:

  1. Create baseline: Measure current performance on a dataset
  2. Make change: Adjust prompt, model, or chain
  3. Regression test: Evaluate the same dataset again
  4. Compare: LangSmith shows improvements and regressions

Practical tip: Enable tracing from day one. The costs are minimal, but without traces you're debugging blind. Create a test dataset with at least 50 real user questions — that's your gold standard for evaluations.