Error Handling & Fallbacks

In multi-agent systems, errors can and will occur — LLM timeouts, invalid outputs, rate limits, network failures. A robust system doesn't crash; it degrades gracefully. This chapter covers the essential patterns for resilient n8n multi-agent pipelines.

Circuit Breaker Pattern

The circuit breaker prevents a faulty agent from blocking the entire system:

         ┌─────────────┐
         │   CLOSED     │ ← Normal: Requests pass through
         │ (Errors < 3) │
         └──────┬──────┘
                │ 3 consecutive errors
                ▼
         ┌─────────────┐
         │    OPEN      │ ← Blocked: Requests are rejected immediately
         │ (30s pause)  │
         └──────┬──────┘
                │ After 30 seconds
                ▼
         ┌─────────────┐
         │ HALF-OPEN    │ ← Test: One request is allowed through
         │ (1 attempt)  │
         └─────────────┘

Implementation in n8n

{
  "function": "const state = $getWorkflowStaticData('global');\nconst agent = 'researcher';\nconst failures = state[agent + '_failures'] || 0;\nconst lastFailure = state[agent + '_last_failure'] || 0;\nconst now = Date.now();\n\nif (failures >= 3 && now - lastFailure < 30000) {\n  return [{ json: { circuit: 'OPEN', fallback: true } }];\n}\nreturn [{ json: { circuit: 'CLOSED', proceed: true } }];"
}

Retry Strategies

Not every error requires an immediate fallback — many are transient:

Strategy	Wait Time	Suitable For
Immediate retry	0 seconds	Network glitches
Fixed delay	5 seconds	Rate limits
Exponential backoff	1s → 2s → 4s → 8s	API overload
Exponential + jitter	1s±0.5 → 2s±1 → 4s±2	Many parallel agents

n8n Retry Configuration

n8n offers native retry options for each node:

Retry on Fail: Enable per node
Max Tries: 3 (recommended)
Wait Between Tries: Exponential backoff (1000ms base)

Fallback Agents

When the primary agent fails despite retries, a fallback takes over:

Primary Agent	Fallback Agent	Strategy
GPT-4o Researcher	Claude Researcher	Model switch
Deep Research Agent	Quick Research Agent	Quality tier
AI Writer	Template Writer	Deterministic
Custom Agent	Cached Response	Last good response

Fallback Chain in n8n

Primary Agent → (Error?) → Fallback Agent 1 → (Error?) → Fallback Agent 2 → (Error?) → Default Response

Use the n8n Error Trigger Node to automatically start the fallback after a failed sub-workflow.

Dead Letter Queues

Requests that fail after all retries and fallbacks land in the Dead Letter Queue (DLQ):

CREATE TABLE dead_letter_queue (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  pipeline_id UUID NOT NULL,
  agent_name TEXT NOT NULL,
  input_data JSONB NOT NULL,
  error_message TEXT,
  retry_count INTEGER DEFAULT 0,
  created_at TIMESTAMPTZ DEFAULT now(),
  resolved_at TIMESTAMPTZ,
  resolution TEXT -- 'retried', 'manual', 'discarded'
);

Alerting

Configure alerts for critical errors:

Event	Channel	Priority
Circuit breaker opens	Slack + email	High
DLQ entry	Slack	Medium
3× fallback in 10 min	PagerDuty	Critical
Agent latency > 30s	Dashboard	Low

Practical tip: Implement simple retries first (3×, exponential backoff) and a default fallback. This catches 95% of errors. Add circuit breakers and DLQ when your system is running in production and you see real error patterns.