"The RAG pipeline works" isn't a sufficient statement. Without systematic evaluation, you don't know whether your pipeline hallucinates, uses irrelevant contexts, or delivers correct answers. RAG evaluation is complex — but indispensable.
| Metric | What is measured? | Question |
|---|---|---|
| Faithfulness | Is the answer faithful to the context? | Does the answer invent information? |
| Answer Relevance | Does the answer address the question? | Is the answer useful? |
| Context Precision | Are retrieved contexts relevant? | Is noise minimized? |
| Context Recall | Were all necessary contexts found? | Is important information missing? |
Question ──▶ Retriever ──▶ Context ──▶ LLM ──▶ Answer
│ │ │
│ Context Precision ◀───┘ │
│ Context Recall ◀──────┘ │
│ │
│ Answer Relevance ◀─────────────────────────┘
│ Faithfulness ◀──────── Context + Answer ───┘
RAGAS (Retrieval Augmented Generation Assessment) is the standard framework for RAG evaluation:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
# Test dataset
eval_dataset = {
"question": ["What is GDPR?", "How do you calculate ROI?"],
"answer": [generated_answer_1, generated_answer_2],
"contexts": [retrieved_contexts_1, retrieved_contexts_2],
"ground_truth": ["GDPR is...", "ROI = ..."]
}
results = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.92, ...}
A golden dataset contains questions with expected answers and relevant contexts:
golden_dataset = [
{
"question": "What notice period applies during probation?",
"expected_answer": "2 weeks",
"expected_sources": ["hr/employment-contract.pdf#page=4"],
"category": "hr"
},
# ... at least 50-100 entries
]
def test_rag_pipeline():
results = []
for item in golden_dataset:
answer = rag_pipeline.invoke(item["question"])
results.append({
"question": item["question"],
"expected": item["expected_answer"],
"actual": answer,
"faithfulness": evaluate_faithfulness(answer, retrieved_context),
"relevance": evaluate_relevance(answer, item["question"])
})
avg_faithfulness = mean([r["faithfulness"] for r in results])
avg_relevance = mean([r["relevance"] for r in results])
assert avg_faithfulness > 0.85, f"Faithfulness too low: {avg_faithfulness}"
assert avg_relevance > 0.80, f"Relevance too low: {avg_relevance}"
An LLM evaluates the quality of RAG answers:
judge_prompt = """
Rate the following answer on a scale of 1-5:
Question: {question}
Context: {context}
Answer: {answer}
Criteria:
- Correctness (1-5): Does the answer match the context?
- Completeness (1-5): Are all relevant aspects covered?
- Clarity (1-5): Is the answer clearly formulated?
Return a JSON rating.
"""
Practical tip: Create a golden dataset with at least 50 questions from your real use case. Run evaluations after every pipeline change (prompts, chunk size, model). Automated evaluation is your safety net against regressions.