Lesson 4 of 5·10 min read

Evaluation and Benchmarking

A fine-tuned model feels better — but is it really? Without systematic evaluation, you're flying blind. This guide shows how to make model quality measurable and comparable.

Metrics by Use Case

Text Generation

MetricWhat It MeasuresTool
BLEUN-gram match with referenceSacreBLEU
ROUGERecall-based overlaprouge-score
BERTScoreSemantic similaritybert-score
Human EvalHuman judgment (gold standard)Custom

Classification

  • Accuracy, Precision, Recall, F1 Score
  • Confusion Matrix for error analysis
  • AUC-ROC for threshold optimization

Domain-Specific

  • Medicine: Accuracy on medical benchmarks (MedQA, PubMedQA)
  • Legal: Precision in contract analysis
  • Code: Pass@K on HumanEval, functionality tests

The Evaluation Setup

1. Prepare Test Set

  • Never use training data for testing (data leakage!)
  • 80/10/10 split: Training / Validation / Test
  • Test set should reflect real-world distribution
  • Include edge cases and adversarial examples

2. Define Baseline

Always compare against:

  • The base model (without fine-tuning)
  • The best prompt engineering approach
  • RAG-based solution if applicable
  • Previous fine-tuning (for updates)

3. A/B Testing

                   ┌─ Model A (Baseline) ──────────────┐
Traffic (50/50) ──┤                                     ├─ Compare
                   └─ Model B (Fine-Tuned) ────────────┘

Metrics: Accuracy, latency, user satisfaction, costs
Duration: At least 1 week with statistically significant traffic

4. Regression Tests

Fine-tuning on task A can reduce performance on task B (catastrophic forgetting):

  • Before training: Create benchmark on tasks A, B, C
  • After training: Run all benchmarks again
  • Threshold: Maximum 5% degradation on other tasks
  • Fix: Diverse training data, multi-task training, regularization

Experiment Tracking

Every training run must be documented:

ParameterExample
ModelLlama 3.1 70B
MethodQLoRA (r=16, alpha=32)
Datav2.3, 1,500 examples
Epochs3
Learning Rate2e-4
ResultF1: 0.87, BLEU: 0.42

Tools: Weights & Biases, MLflow, Neptune

Common Evaluation Mistakes

  • ❌ Only looking at loss curve (says little about real quality)
  • ❌ No baseline comparison (everything "feels" good)
  • ❌ Test set too small or not representative
  • ❌ No regression tests after updates

Practical tip: Create an "Evaluation Playbook" with 50 test cases you run after every training. Automate what's possible — but human evaluation remains essential for style and tone.