Lesson 4 of 6·10 min read

Error Handling & Fallbacks

In multi-agent systems, errors can and will occur — LLM timeouts, invalid outputs, rate limits, network failures. A robust system doesn't crash; it degrades gracefully. This chapter covers the essential patterns for resilient n8n multi-agent pipelines.

Circuit Breaker Pattern

The circuit breaker prevents a faulty agent from blocking the entire system:

         ┌─────────────┐
         │   CLOSED     │ ← Normal: Requests pass through
         │ (Errors < 3) │
         └──────┬──────┘
                │ 3 consecutive errors
                ▼
         ┌─────────────┐
         │    OPEN      │ ← Blocked: Requests are rejected immediately
         │ (30s pause)  │
         └──────┬──────┘
                │ After 30 seconds
                ▼
         ┌─────────────┐
         │ HALF-OPEN    │ ← Test: One request is allowed through
         │ (1 attempt)  │
         └─────────────┘

Implementation in n8n

{
  "function": "const state = $getWorkflowStaticData('global');\nconst agent = 'researcher';\nconst failures = state[agent + '_failures'] || 0;\nconst lastFailure = state[agent + '_last_failure'] || 0;\nconst now = Date.now();\n\nif (failures >= 3 && now - lastFailure < 30000) {\n  return [{ json: { circuit: 'OPEN', fallback: true } }];\n}\nreturn [{ json: { circuit: 'CLOSED', proceed: true } }];"
}

Retry Strategies

Not every error requires an immediate fallback — many are transient:

StrategyWait TimeSuitable For
Immediate retry0 secondsNetwork glitches
Fixed delay5 secondsRate limits
Exponential backoff1s → 2s → 4s → 8sAPI overload
Exponential + jitter1s±0.5 → 2s±1 → 4s±2Many parallel agents

n8n Retry Configuration

n8n offers native retry options for each node:

  • Retry on Fail: Enable per node
  • Max Tries: 3 (recommended)
  • Wait Between Tries: Exponential backoff (1000ms base)

Fallback Agents

When the primary agent fails despite retries, a fallback takes over:

Primary AgentFallback AgentStrategy
GPT-4o ResearcherClaude ResearcherModel switch
Deep Research AgentQuick Research AgentQuality tier
AI WriterTemplate WriterDeterministic
Custom AgentCached ResponseLast good response

Fallback Chain in n8n

Primary Agent → (Error?) → Fallback Agent 1 → (Error?) → Fallback Agent 2 → (Error?) → Default Response

Use the n8n Error Trigger Node to automatically start the fallback after a failed sub-workflow.

Dead Letter Queues

Requests that fail after all retries and fallbacks land in the Dead Letter Queue (DLQ):

CREATE TABLE dead_letter_queue (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  pipeline_id UUID NOT NULL,
  agent_name TEXT NOT NULL,
  input_data JSONB NOT NULL,
  error_message TEXT,
  retry_count INTEGER DEFAULT 0,
  created_at TIMESTAMPTZ DEFAULT now(),
  resolved_at TIMESTAMPTZ,
  resolution TEXT -- 'retried', 'manual', 'discarded'
);

Alerting

Configure alerts for critical errors:

EventChannelPriority
Circuit breaker opensSlack + emailHigh
DLQ entrySlackMedium
3× fallback in 10 minPagerDutyCritical
Agent latency > 30sDashboardLow

Practical tip: Implement simple retries first (3×, exponential backoff) and a default fallback. This catches 95% of errors. Add circuit breakers and DLQ when your system is running in production and you see real error patterns.