Lesson 4 of 5·11 min read

Monitoring and Observability

AI systems in production are black boxes if you don't observe them. Unlike traditional software, AI models can silently degrade — without throwing a single error. Monitoring is your insurance.

The Three Pillars of Observability

1. Metrics (What's happening?)

Quantitative data about system behavior:

Infrastructure metrics:

  • GPU utilization (target: 70–85%)
  • Memory usage (HBM and RAM)
  • Network throughput and latency
  • Request queue length

AI-specific metrics:

  • Latency (P50/P95/P99): How fast does the model respond? (Target: P95 < 2s)
  • Tokens per second: Model throughput
  • Error rate: Failed requests (target: < 0.1%)
  • Cost per request: What does a single request cost?

Quality metrics:

  • User feedback score: Thumbs up/down per response
  • Hallucination rate: How often does the model fabricate facts? (manual sampling)
  • Task completion rate: How often does the AI solve the user's task?

2. Logging (What happened?)

Structured logs for debugging and audit:

Log every AI request:

  • Timestamp, user ID, session ID
  • Input prompt (anonymized if PII)
  • Model name and version
  • Output, token count, latency
  • Cost per request

Log levels:

  • INFO: Every successful request
  • WARN: Slow requests (> P95), high token counts
  • ERROR: Failed requests, timeouts, rate limit hits

3. Alerting (When to react?)

Automatic notifications for anomalies:

Critical alerts (react immediately):

  • Error rate > 5% over 5 minutes
  • Latency P95 > 10 seconds
  • GPU utilization > 95% for 10 minutes
  • Costs > 150% of daily budget

Warning alerts (check within 1h):

  • Latency increase > 50% vs. baseline
  • User feedback score drops by 20%
  • Unusually high token consumption

Dashboard Recommendation

A good AI dashboard shows at a glance:

  1. Request volume (trend + current rate)
  2. Latency distribution (histogram P50/P95/P99)
  3. Error rate (time series, last 24h)
  4. Costs (cumulative today, end-of-month projection)
  5. Model distribution (which model used how often)

Tools: Grafana + Prometheus (open source), Datadog (enterprise), Langfuse (AI-specific, open source).

Golden rule: What you don't measure, you can't improve. Start with 5 metrics and expand incrementally.