Lesson 5 of 6·10 min read

Incident Response & Rollback

When an AI agent in production makes a critical error, every second counts. OpenClaw provides automated incident detection, standardized shutdown procedures, and safe rollback strategies.

Automated Incident Detection

OpenClaw detects incidents automatically based on multiple signals:

Incident Types

SeverityTypeExampleAuto Action
P0 — CriticalAgent failureAgent stops respondingAuto-shutdown
P0 — CriticalData leakPII in public responseImmediate block
P1 — HighAlignment crashScore drops below 0.5Auto-pause
P1 — HighCost explosion10x normal consumptionRate limiting
P2 — MediumQuality dropError rate above 10%Alert + investigation
P3 — LowPerformance degradationLatency 2x above normalAlert

Detection Rules

# incident-detection.yml
detection:
  rules:
    - name: mass-hallucination
      condition: hallucination_rate > 15% over 15m
      severity: P1
      auto_action: pause_agent
      description: "Unusually high hallucination rate"

    - name: loop-detection
      condition: same_tool_call > 10 within single_trace
      severity: P1
      auto_action: kill_trace
      description: "Agent in infinite loop"

    - name: unauthorized-data-access
      condition: data_access outside_policy_boundary
      severity: P0
      auto_action: shutdown_agent
      description: "Data access outside policy"

    - name: cascading-failure
      condition: error_count > 3 agents within 5m
      severity: P0
      auto_action: system_wide_pause
      description: "Cascading failures across multiple agents"

Agent Shutdown Procedures

Graceful Shutdown

Graceful Shutdown: support-agent-v3
────────────────────────────────────
1. ✅ New requests rejected (redirect to fallback)
2. ✅ Running interactions completed (max 60s)
3. ✅ Open tool calls terminated
4. ✅ State persisted (for later analysis)
5. ✅ Shutdown event logged
6. ✅ Stakeholders notified
Duration: ~45 seconds

Emergency Shutdown (Kill Switch)

Emergency Shutdown: support-agent-v3
──────────────────────────────────────
1. ✅ Immediate abort of ALL interactions
2. ✅ All API connections severed
3. ✅ Fallback message to all active users
4. ✅ Emergency event logged
5. ✅ P0 alert to on-call + management
Duration: <5 seconds

Rollback Strategies

Prompt Rollback

# Show current prompt version history
openclaw agent prompt-history support-agent-v3

# Rollback to previous version
openclaw agent rollback support-agent-v3 --to-version v3.0

# Verify rollback
openclaw test run --suite support-agent-regression --quick

Configuration Rollback

OpenClaw stores every configuration state as a snapshot:

TimestampVersionChangeScore
2026-02-18 14:00v3.1.4Temperature: 0.7 → 0.30.91
2026-02-17 10:00v3.1.3New tool: order_lookup0.93
2026-02-15 16:00v3.1.2Prompt update0.89
2026-02-10 09:00v3.1.1Model: gpt-4o-mini → gpt-4o0.94
# Rollback to a specific snapshot
openclaw agent rollback support-agent-v3 --to-snapshot 2026-02-17T10:00

Multi-Agent Rollback

For system-wide issues, OpenClaw can roll back all agents simultaneously:

# System-wide rollback to last stable state
openclaw system rollback --to-last-stable

# Rollback with automatic regression tests
openclaw system rollback --to-last-stable --verify

Post-Mortem Workflows

Automated Post-Mortem Creation

After every P0/P1 incident, OpenClaw generates a post-mortem template:

Post-Mortem: PII Leak in Support Agent
═══════════════════════════════════════
Date:          2026-02-18
Severity:      P0 — Critical
Duration:      12 minutes (14:23 – 14:35)
Impact:        3 customer interactions affected
Detected by:   OpenClaw PII scanner (automatic)
Resolved by:   Auto-shutdown + prompt rollback

Timeline:
  14:20  Prompt update v3.1.4 deployed
  14:23  First trace with PII in output
  14:24  OpenClaw PII alert triggered
  14:25  Auto-shutdown initiated
  14:27  On-call engineer notified
  14:30  Root cause identified (prompt regression)
  14:33  Rollback to v3.1.3 performed
  14:35  Agent back online, tests passed

Root Cause:
  Prompt update v3.1.4 accidentally removed the
  instruction for PII avoidance in responses.

Action Items:
  ☐ Introduce prompt review process (four-eyes principle)
  ☐ Add PII regression test to test suite
  ☐ Implement pre-deployment check for PII rules

Key takeaway: A good incident response plan is created before the incident — not during it. Configure shutdown procedures and rollback strategies today so you're ready to act tomorrow.