Evaluation and Benchmarking

A fine-tuned model feels better — but is it really? Without systematic evaluation, you're flying blind. This guide shows how to make model quality measurable and comparable.

Metrics by Use Case

Text Generation

Metric	What It Measures	Tool
BLEU	N-gram match with reference	SacreBLEU
ROUGE	Recall-based overlap	rouge-score
BERTScore	Semantic similarity	bert-score
Human Eval	Human judgment (gold standard)	Custom

Classification

Accuracy, Precision, Recall, F1 Score
Confusion Matrix for error analysis
AUC-ROC for threshold optimization

Domain-Specific

Medicine: Accuracy on medical benchmarks (MedQA, PubMedQA)
Legal: Precision in contract analysis
Code: Pass@K on HumanEval, functionality tests

The Evaluation Setup

1. Prepare Test Set

Never use training data for testing (data leakage!)
80/10/10 split: Training / Validation / Test
Test set should reflect real-world distribution
Include edge cases and adversarial examples

2. Define Baseline

Always compare against:

The base model (without fine-tuning)
The best prompt engineering approach
RAG-based solution if applicable
Previous fine-tuning (for updates)

3. A/B Testing

                   ┌─ Model A (Baseline) ──────────────┐
Traffic (50/50) ──┤                                     ├─ Compare
                   └─ Model B (Fine-Tuned) ────────────┘

Metrics: Accuracy, latency, user satisfaction, costs
Duration: At least 1 week with statistically significant traffic

4. Regression Tests

Fine-tuning on task A can reduce performance on task B (catastrophic forgetting):

Before training: Create benchmark on tasks A, B, C
After training: Run all benchmarks again
Threshold: Maximum 5% degradation on other tasks
Fix: Diverse training data, multi-task training, regularization

Experiment Tracking

Every training run must be documented:

Parameter	Example
Model	Llama 3.1 70B
Method	QLoRA (r=16, alpha=32)
Data	v2.3, 1,500 examples
Epochs	3
Learning Rate	2e-4
Result	F1: 0.87, BLEU: 0.42

Tools: Weights & Biases, MLflow, Neptune

Common Evaluation Mistakes

❌ Only looking at loss curve (says little about real quality)
❌ No baseline comparison (everything "feels" good)
❌ Test set too small or not representative
❌ No regression tests after updates

Practical tip: Create an "Evaluation Playbook" with 50 test cases you run after every training. Automate what's possible — but human evaluation remains essential for style and tone.