A multi-agent system without monitoring is like a car without a dashboard — you don't know if it's working until it's too late. Observability goes beyond simple logging: you need to understand what each agent does, how long it takes, and what it costs.
| Pillar | What Is Captured | Tools |
|---|---|---|
| Logs | What happened? (Textual records) | n8n Execution Log, Loki |
| Metrics | How much? How fast? (Numbers over time) | Prometheus, Grafana |
| Traces | What path did the request take? (End-to-end path) | OpenTelemetry, Jaeger |
Implement a consistent log format for all agents:
{
"timestamp": "2026-02-20T14:30:00Z",
"pipeline_id": "abc-123",
"agent": "researcher",
"action": "execute",
"status": "completed",
"duration_ms": 4523,
"input_tokens": 250,
"output_tokens": 1200,
"model": "gpt-4o",
"cost_usd": 0.0185,
"metadata": { "sources_found": 5, "confidence": 87 }
}
| Level | Usage | Example |
|---|---|---|
| DEBUG | Agent input/output (development only) | Full prompt and response |
| INFO | Successful agent execution | "Researcher completed in 4.5s" |
| WARN | Retry or fallback triggered | "Writer retry 2/3 after timeout" |
| ERROR | Agent failure, DLQ entry | "Reviewer failed: invalid JSON" |
| FATAL | Pipeline aborted | "Circuit breaker open for all agents" |
| Metric | Description | Target |
|---|---|---|
| Agent latency (p50/p95/p99) | How long does an agent take? | p95 < 10s |
| Pipeline latency | End-to-end duration of entire pipeline | < 30s |
| Success rate | Proportion of successful executions | > 99% |
| Retry rate | How often are retries needed? | < 5% |
| Fallback rate | How often does the fallback kick in? | < 1% |
| Token consumption | Input + output tokens per pipeline | Budget-dependent |
# Agent latency
agent_execution_duration_seconds{agent="researcher", status="success"} 4.523
# Token consumption
agent_tokens_total{agent="writer", type="input"} 250
agent_tokens_total{agent="writer", type="output"} 1200
# Error counter
agent_errors_total{agent="reviewer", error_type="timeout"} 3
Cost transparency is critical in multi-agent systems — each agent consumes tokens.
| Agent | Model | Avg. Tokens/Run | Cost/Run | Runs/Day | Cost/Day |
|---|---|---|---|---|---|
| Researcher | GPT-4o | 1,500 | $0.023 | 500 | $11.50 |
| Writer | GPT-4o | 2,000 | $0.030 | 500 | $15.00 |
| Reviewer | GPT-4o-mini | 800 | $0.002 | 500 | $1.00 |
| Total | $27.50 |
For end-to-end tracing across all agents:
Pipeline Start
└── Orchestrator (span: 28.5s)
├── Researcher Agent (span: 4.5s)
│ ├── LLM Call (span: 3.8s) [model: gpt-4o, tokens: 1500]
│ └── DB Write (span: 0.2s)
├── Writer Agent (span: 8.2s)
│ ├── DB Read (span: 0.1s)
│ ├── LLM Call (span: 7.5s) [model: gpt-4o, tokens: 2000]
│ └── DB Write (span: 0.3s)
└── Reviewer Agent (span: 3.1s)
├── DB Read (span: 0.1s)
└── LLM Call (span: 2.8s) [model: gpt-4o-mini, tokens: 800]
Practical tip: Start with three metrics: agent latency, success rate, and cost per pipeline. These three alone reveal 80% of problems. Add OpenTelemetry tracing when you have more than 5 agents and cross-pipeline debugging becomes necessary.