A fine-tuned model feels better — but is it really? Without systematic evaluation, you're flying blind. This guide shows how to make model quality measurable and comparable.
| Metric | What It Measures | Tool |
|---|---|---|
| BLEU | N-gram match with reference | SacreBLEU |
| ROUGE | Recall-based overlap | rouge-score |
| BERTScore | Semantic similarity | bert-score |
| Human Eval | Human judgment (gold standard) | Custom |
Always compare against:
┌─ Model A (Baseline) ──────────────┐
Traffic (50/50) ──┤ ├─ Compare
└─ Model B (Fine-Tuned) ────────────┘
Metrics: Accuracy, latency, user satisfaction, costs
Duration: At least 1 week with statistically significant traffic
Fine-tuning on task A can reduce performance on task B (catastrophic forgetting):
Every training run must be documented:
| Parameter | Example |
|---|---|
| Model | Llama 3.1 70B |
| Method | QLoRA (r=16, alpha=32) |
| Data | v2.3, 1,500 examples |
| Epochs | 3 |
| Learning Rate | 2e-4 |
| Result | F1: 0.87, BLEU: 0.42 |
Tools: Weights & Biases, MLflow, Neptune
Practical tip: Create an "Evaluation Playbook" with 50 test cases you run after every training. Automate what's possible — but human evaluation remains essential for style and tone.