AI agents are non-deterministic — the same input can produce different outputs. OpenClaw provides a QA framework that enables automated quality checks, regression testing, and A/B testing for agent configurations.
OpenClaw runs automatic quality checks on every trace:
# quality-checks.yml
evaluations:
- name: factual-accuracy
type: llm-judge
model: gpt-4o
prompt: "Is the agent's response factually correct?"
threshold: 0.9
sample_rate: 0.1 # 10% of all traces
- name: tone-consistency
type: classifier
labels: [professional, casual, inappropriate]
expected: professional
threshold: 0.95
- name: response-completeness
type: checklist
required_elements: [greeting, answer, next_steps]
threshold: 1.0
- name: hallucination-detection
type: grounded-check
sources: [knowledge_base, tool_outputs]
threshold: 0.95
OpenClaw aggregates individual checks into a quality score:
Quality Score: Support Agent v3.1
─────────────────────────────────
Factual Accuracy: 94.2% ✅
Tone Consistency: 97.8% ✅
Response Completeness: 89.1% ⚠️
Hallucination Rate: 2.1% ✅
───────────────────────────────────
Overall Score: 93.8/100
Trend: +1.2 vs. previous week
from openclaw.testing import TestSuite, TestCase
suite = TestSuite("support-agent-regression")
suite.add_case(TestCase(
name="billing-question",
input="How can I download my invoice?",
expected_intent="billing",
expected_contains=["invoice", "dashboard", "download"],
expected_not_contains=["I'm not sure", "I don't know"],
max_tokens=500
))
suite.add_case(TestCase(
name="escalation-trigger",
input="I want to cancel my contract immediately and get my money back!",
expected_intent="escalation",
expected_action="transfer_to_human",
max_latency_ms=3000
))
# .github/workflows/agent-test.yml
- name: Run Agent Regression Tests
run: |
openclaw test run --suite support-agent-regression
openclaw test run --suite order-agent-regression
env:
OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_API_KEY }}
Regression tests run automatically on:
OpenClaw enables controlled experiments with agent variants:
# ab-test.yml
experiment:
name: "prompt-v3-vs-v4"
agent: support-agent
traffic_split: 50/50
duration: 7d
variants:
- name: control
prompt_version: v3.1
- name: treatment
prompt_version: v4.0
metrics:
primary: task_completion_rate
secondary: [latency_p95, cost_per_interaction, user_satisfaction]
stopping_rules:
- metric: error_rate
threshold: ">5%"
action: stop_experiment
OpenClaw automatically calculates statistical significance:
| Metric | Control (v3.1) | Treatment (v4.0) | Difference | p-value |
|---|---|---|---|---|
| Task Completion | 94.2% | 96.8% | +2.6% | 0.003 ✅ |
| P95 Latency | 2,800ms | 2,600ms | -200ms | 0.041 ✅ |
| Cost/Interaction | EUR 0.012 | EUR 0.014 | +EUR 0.002 | 0.12 ❌ |
| User Satisfaction | 4.1/5 | 4.3/5 | +0.2 | 0.028 ✅ |
OpenClaw tracks performance benchmarks over time:
Best Practice: Create at least 20 regression test cases per agent covering the most common scenarios. Run these automatically on every prompt change. An agent without tests is an agent that will silently degrade.