Alignment Score & Guardrails

How do you ensure your AI agents do what they're supposed to — and nothing more? OpenClaw calculates an alignment score for each agent and provides configurable guardrails that prevent undesired behavior in real time.

Defining Alignment Metrics

The alignment score is composed of several measurable dimensions:

Dimension	Description	Weight
Task Fidelity	Does the agent fulfill its defined task?	30%
Policy Compliance	Does the agent follow defined guidelines?	25%
Output Safety	Are outputs safe and appropriate?	20%
Scope Adherence	Does the agent stay within its scope?	15%
Consistency	Are responses consistent over time?	10%

Configuration

# alignment-config.yml
alignment:
  agent: support-agent-v3
  dimensions:
    task_fidelity:
      weight: 0.30
      evaluator: llm-judge
      prompt: "Did the agent correctly answer the customer inquiry?"
      sample_rate: 0.2

    policy_compliance:
      weight: 0.25
      evaluator: rule-based
      rules:
        - no_competitor_mentions
        - no_price_promises
        - escalate_legal_questions
        - use_approved_language

    output_safety:
      weight: 0.20
      evaluator: classifier
      checks: [toxicity, bias, pii_leak, hallucination]

    scope_adherence:
      weight: 0.15
      evaluator: intent-classifier
      allowed_intents: [support, billing, product_info, escalation]
      forbidden_intents: [medical_advice, legal_advice, financial_advice]

    consistency:
      weight: 0.10
      evaluator: embedding-similarity
      baseline: last_30_days
      threshold: 0.85

Automatic Alignment Scoring

OpenClaw calculates the score continuously and shows the trend:

Alignment Score: Support Agent v3.1
═══════════════════════════════════
Current:       0.91 / 1.00
7-day trend:   ████████████████████░ 0.91 (stable)
30-day trend:  ██████████████████░░░ 0.89 → 0.91 (↑)

Breakdown:
  Task Fidelity:     0.94 ✅
  Policy Compliance:  0.88 ⚠️ (2 violations this week)
  Output Safety:     0.96 ✅
  Scope Adherence:   0.90 ✅
  Consistency:       0.87 ⚠️ (slight drift)

Guardrail Configuration

Guardrails are real-time filters positioned between the agent and the end user:

Input Guardrails (before the agent)

input_guardrails:
  - name: prompt-injection-detection
    type: classifier
    action: block
    message: "This request cannot be processed."

  - name: topic-filter
    type: keyword + semantic
    blocked_topics: [weapons, illegal_activity, self_harm]
    action: block

  - name: pii-input-scan
    type: pii-detector
    action: mask_and_continue

Output Guardrails (after the agent)

output_guardrails:
  - name: hallucination-check
    type: grounded-check
    sources: [knowledge_base]
    threshold: 0.8
    action: fallback_response

  - name: toxicity-filter
    type: classifier
    threshold: 0.1
    action: block_and_alert

  - name: pii-output-scan
    type: pii-detector
    action: mask_before_delivery

  - name: competitor-mention-check
    type: keyword
    blocked_terms: [CompetitorA, CompetitorB]
    action: rephrase

Drift Detection

Alignment drift is a gradual deterioration of agent behavior:

Causes of Drift

Model updates — The LLM provider updates the model
Context changes — New documents in the knowledge base
Usage changes — New user groups with different queries
Prompt erosion — Incremental adjustments degrade quality

Drift Alerting

drift_detection:
  baseline_period: 30d
  check_interval: 1h
  alerts:
    - dimension: any
      drop: ">0.05"  # 5% drop
      severity: warning
    - dimension: any
      drop: ">0.10"  # 10% drop
      severity: critical
      action: auto_pause_agent

Key takeaway: Alignment is not a one-time setup — it requires continuous monitoring. Agents can slowly drift over weeks without individual interactions standing out. Only systematic scoring makes this drift visible.