In multi-agent systems, errors can and will occur — LLM timeouts, invalid outputs, rate limits, network failures. A robust system doesn't crash; it degrades gracefully. This chapter covers the essential patterns for resilient n8n multi-agent pipelines.
The circuit breaker prevents a faulty agent from blocking the entire system:
┌─────────────┐
│ CLOSED │ ← Normal: Requests pass through
│ (Errors < 3) │
└──────┬──────┘
│ 3 consecutive errors
▼
┌─────────────┐
│ OPEN │ ← Blocked: Requests are rejected immediately
│ (30s pause) │
└──────┬──────┘
│ After 30 seconds
▼
┌─────────────┐
│ HALF-OPEN │ ← Test: One request is allowed through
│ (1 attempt) │
└─────────────┘
{
"function": "const state = $getWorkflowStaticData('global');\nconst agent = 'researcher';\nconst failures = state[agent + '_failures'] || 0;\nconst lastFailure = state[agent + '_last_failure'] || 0;\nconst now = Date.now();\n\nif (failures >= 3 && now - lastFailure < 30000) {\n return [{ json: { circuit: 'OPEN', fallback: true } }];\n}\nreturn [{ json: { circuit: 'CLOSED', proceed: true } }];"
}
Not every error requires an immediate fallback — many are transient:
| Strategy | Wait Time | Suitable For |
|---|---|---|
| Immediate retry | 0 seconds | Network glitches |
| Fixed delay | 5 seconds | Rate limits |
| Exponential backoff | 1s → 2s → 4s → 8s | API overload |
| Exponential + jitter | 1s±0.5 → 2s±1 → 4s±2 | Many parallel agents |
n8n offers native retry options for each node:
When the primary agent fails despite retries, a fallback takes over:
| Primary Agent | Fallback Agent | Strategy |
|---|---|---|
| GPT-4o Researcher | Claude Researcher | Model switch |
| Deep Research Agent | Quick Research Agent | Quality tier |
| AI Writer | Template Writer | Deterministic |
| Custom Agent | Cached Response | Last good response |
Primary Agent → (Error?) → Fallback Agent 1 → (Error?) → Fallback Agent 2 → (Error?) → Default Response
Use the n8n Error Trigger Node to automatically start the fallback after a failed sub-workflow.
Requests that fail after all retries and fallbacks land in the Dead Letter Queue (DLQ):
CREATE TABLE dead_letter_queue (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
pipeline_id UUID NOT NULL,
agent_name TEXT NOT NULL,
input_data JSONB NOT NULL,
error_message TEXT,
retry_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT now(),
resolved_at TIMESTAMPTZ,
resolution TEXT -- 'retried', 'manual', 'discarded'
);
Configure alerts for critical errors:
| Event | Channel | Priority |
|---|---|---|
| Circuit breaker opens | Slack + email | High |
| DLQ entry | Slack | Medium |
| 3× fallback in 10 min | PagerDuty | Critical |
| Agent latency > 30s | Dashboard | Low |
Practical tip: Implement simple retries first (3×, exponential backoff) and a default fallback. This catches 95% of errors. Add circuit breakers and DLQ when your system is running in production and you see real error patterns.