Your fine-tuning quality stands or falls with training data. "Garbage in, garbage out" applies here more than anywhere else. Good training data is the difference between a useful and a useless model.
{"messages": [{"role": "system", "content": "You are a brand copywriter..."}, {"role": "user", "content": "Write a product text for..."}, {"role": "assistant", "content": "Discover..."}]}
{"messages": [{"role": "system", "content": "You are a brand copywriter..."}, {"role": "user", "content": "Describe our new..."}, {"role": "assistant", "content": "Innovation..."}]}
{"instruction": "Write a product text", "input": "Product: Smart Watch X1", "output": "The Smart Watch X1..."}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}, {"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
| Criterion | Description | Check |
|---|---|---|
| Correctness | Are the answers factually correct? | Expert review |
| Consistency | Same style and tone across all examples? | Style guide |
| Diversity | Do examples cover different scenarios? | Coverage matrix |
| Relevance | Do examples match the target use case? | Use-case alignment |
| Length | Do answers match the desired output length? | Token count |
| Method | Quality | Cost | Scalability |
|---|---|---|---|
| Own experts | Very high | High | Low |
| Annotation services (Scale AI, Toloka) | High | Medium | High |
| LLM-assisted annotation | Medium | Low | Very high |
| Community/Crowdsourcing | Variable | Low | High |
Best practice: LLM draft + expert review — AI creates a draft, a human reviews and corrects.
| Goal | Minimum | Recommended | Maximum |
|---|---|---|---|
| Style adaptation | 50 examples | 200–500 | 1,000 |
| Task specialization | 100 examples | 500–2,000 | 10,000 |
| Domain training | 500 examples | 2,000–10,000 | 100,000+ |
Practical tip: Start with 100 high-quality examples. Train, evaluate, then add targeted additions. 100 perfect examples beat 10,000 mediocre ones.