Lesson 5 of 6·10 min read

Evaluation & Quality

"The RAG pipeline works" isn't a sufficient statement. Without systematic evaluation, you don't know whether your pipeline hallucinates, uses irrelevant contexts, or delivers correct answers. RAG evaluation is complex — but indispensable.

RAG Evaluation Dimensions

The Four Core Metrics

MetricWhat is measured?Question
FaithfulnessIs the answer faithful to the context?Does the answer invent information?
Answer RelevanceDoes the answer address the question?Is the answer useful?
Context PrecisionAre retrieved contexts relevant?Is noise minimized?
Context RecallWere all necessary contexts found?Is important information missing?

Evaluation Framework

Question ──▶ Retriever ──▶ Context ──▶ LLM ──▶ Answer
  │                          │                    │
  │    Context Precision ◀───┘                    │
  │    Context Recall ◀──────┘                    │
  │                                               │
  │    Answer Relevance ◀─────────────────────────┘
  │    Faithfulness ◀──────── Context + Answer ───┘

RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is the standard framework for RAG evaluation:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

# Test dataset
eval_dataset = {
    "question": ["What is GDPR?", "How do you calculate ROI?"],
    "answer": [generated_answer_1, generated_answer_2],
    "contexts": [retrieved_contexts_1, retrieved_contexts_2],
    "ground_truth": ["GDPR is...", "ROI = ..."]
}

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
# {'faithfulness': 0.87, 'answer_relevancy': 0.92, ...}

Automated Testing

Create a Golden Dataset

A golden dataset contains questions with expected answers and relevant contexts:

golden_dataset = [
    {
        "question": "What notice period applies during probation?",
        "expected_answer": "2 weeks",
        "expected_sources": ["hr/employment-contract.pdf#page=4"],
        "category": "hr"
    },
    # ... at least 50-100 entries
]

Automated Pipeline Tests

def test_rag_pipeline():
    results = []
    for item in golden_dataset:
        answer = rag_pipeline.invoke(item["question"])
        results.append({
            "question": item["question"],
            "expected": item["expected_answer"],
            "actual": answer,
            "faithfulness": evaluate_faithfulness(answer, retrieved_context),
            "relevance": evaluate_relevance(answer, item["question"])
        })

    avg_faithfulness = mean([r["faithfulness"] for r in results])
    avg_relevance = mean([r["relevance"] for r in results])

    assert avg_faithfulness > 0.85, f"Faithfulness too low: {avg_faithfulness}"
    assert avg_relevance > 0.80, f"Relevance too low: {avg_relevance}"

LLM-as-Judge

An LLM evaluates the quality of RAG answers:

judge_prompt = """
Rate the following answer on a scale of 1-5:

Question: {question}
Context: {context}
Answer: {answer}

Criteria:
- Correctness (1-5): Does the answer match the context?
- Completeness (1-5): Are all relevant aspects covered?
- Clarity (1-5): Is the answer clearly formulated?

Return a JSON rating.
"""

Advantages of LLM-as-Judge

  • Scalable: Hundreds of evaluations without human reviewers
  • Consistent: Uniform evaluation criteria
  • Fast: Results in minutes instead of days

Limitations

  • LLM bias: The evaluating LLM has its own biases
  • Hallucination: The judge can hallucinate itself
  • Calibration: Regularly compare with human evaluations

Practical tip: Create a golden dataset with at least 50 questions from your real use case. Run evaluations after every pipeline change (prompts, chunk size, model). Automated evaluation is your safety net against regressions.