Lesson 5 of 6·10 min read

Quality Assurance & Testing

AI agents are non-deterministic — the same input can produce different outputs. OpenClaw provides a QA framework that enables automated quality checks, regression testing, and A/B testing for agent configurations.

Automated Quality Checks

Evaluation Pipelines

OpenClaw runs automatic quality checks on every trace:

# quality-checks.yml
evaluations:
  - name: factual-accuracy
    type: llm-judge
    model: gpt-4o
    prompt: "Is the agent's response factually correct?"
    threshold: 0.9
    sample_rate: 0.1  # 10% of all traces

  - name: tone-consistency
    type: classifier
    labels: [professional, casual, inappropriate]
    expected: professional
    threshold: 0.95

  - name: response-completeness
    type: checklist
    required_elements: [greeting, answer, next_steps]
    threshold: 1.0

  - name: hallucination-detection
    type: grounded-check
    sources: [knowledge_base, tool_outputs]
    threshold: 0.95

Calculating Quality Score

OpenClaw aggregates individual checks into a quality score:

Quality Score: Support Agent v3.1
─────────────────────────────────
Factual Accuracy:       94.2% ✅
Tone Consistency:       97.8% ✅
Response Completeness:  89.1% ⚠️
Hallucination Rate:     2.1%  ✅
───────────────────────────────────
Overall Score:          93.8/100
Trend:                  +1.2 vs. previous week

Regression Testing for Prompts

Defining Test Suites

from openclaw.testing import TestSuite, TestCase

suite = TestSuite("support-agent-regression")

suite.add_case(TestCase(
    name="billing-question",
    input="How can I download my invoice?",
    expected_intent="billing",
    expected_contains=["invoice", "dashboard", "download"],
    expected_not_contains=["I'm not sure", "I don't know"],
    max_tokens=500
))

suite.add_case(TestCase(
    name="escalation-trigger",
    input="I want to cancel my contract immediately and get my money back!",
    expected_intent="escalation",
    expected_action="transfer_to_human",
    max_latency_ms=3000
))

CI/CD Integration

# .github/workflows/agent-test.yml
- name: Run Agent Regression Tests
  run: |
    openclaw test run --suite support-agent-regression
    openclaw test run --suite order-agent-regression
  env:
    OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_API_KEY }}

Regression tests run automatically on:

  • Prompt changes — Every system prompt change is tested
  • Model updates — Switching to a new model version
  • Tool changes — New or modified tool definitions
  • Guardrail updates — Changes to safety rules

A/B Testing Agent Configurations

OpenClaw enables controlled experiments with agent variants:

# ab-test.yml
experiment:
  name: "prompt-v3-vs-v4"
  agent: support-agent
  traffic_split: 50/50
  duration: 7d

  variants:
    - name: control
      prompt_version: v3.1
    - name: treatment
      prompt_version: v4.0

  metrics:
    primary: task_completion_rate
    secondary: [latency_p95, cost_per_interaction, user_satisfaction]

  stopping_rules:
    - metric: error_rate
      threshold: ">5%"
      action: stop_experiment

Results Analysis

OpenClaw automatically calculates statistical significance:

MetricControl (v3.1)Treatment (v4.0)Differencep-value
Task Completion94.2%96.8%+2.6%0.003 ✅
P95 Latency2,800ms2,600ms-200ms0.041 ✅
Cost/InteractionEUR 0.012EUR 0.014+EUR 0.0020.12 ❌
User Satisfaction4.1/54.3/5+0.20.028 ✅

Performance Benchmarks

OpenClaw tracks performance benchmarks over time:

  • Baseline definition — Define a baseline at agent launch
  • Continuous benchmarking — Automatic measurement against the baseline
  • Degradation alerts — Warning when performance drops by more than X%
  • Version comparison — Side-by-side comparison of different agent versions

Best Practice: Create at least 20 regression test cases per agent covering the most common scenarios. Run these automatically on every prompt change. An agent without tests is an agent that will silently degrade.