Preparing Training Data

Your fine-tuning quality stands or falls with training data. "Garbage in, garbage out" applies here more than anywhere else. Good training data is the difference between a useful and a useless model.

Formats

JSONL (JSON Lines) — Standard Format

{"messages": [{"role": "system", "content": "You are a brand copywriter..."}, {"role": "user", "content": "Write a product text for..."}, {"role": "assistant", "content": "Discover..."}]}
{"messages": [{"role": "system", "content": "You are a brand copywriter..."}, {"role": "user", "content": "Describe our new..."}, {"role": "assistant", "content": "Innovation..."}]}

Alpaca Format (Open Source)

{"instruction": "Write a product text", "input": "Product: Smart Watch X1", "output": "The Smart Watch X1..."}

ShareGPT Format (Multi-Turn)

{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}, {"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Ensuring Data Quality

The 5 Quality Criteria

Criterion	Description	Check
Correctness	Are the answers factually correct?	Expert review
Consistency	Same style and tone across all examples?	Style guide
Diversity	Do examples cover different scenarios?	Coverage matrix
Relevance	Do examples match the target use case?	Use-case alignment
Length	Do answers match the desired output length?	Token count

Common Quality Issues

❌ Copy-paste from ChatGPT (model learns generic style)
❌ Contradictory answers to similar questions
❌ Too little diversity (only one topic, only one answer structure)
❌ Formatting inconsistencies (sometimes Markdown, sometimes plain text)

Annotation — Who Labels?

Method	Quality	Cost	Scalability
Own experts	Very high	High	Low
Annotation services (Scale AI, Toloka)	High	Medium	High
LLM-assisted annotation	Medium	Low	Very high
Community/Crowdsourcing	Variable	Low	High

Best practice: LLM draft + expert review — AI creates a draft, a human reviews and corrects.

Augmentation — More from Less Data

Paraphrasing: LLM creates variants of existing examples
Back-translation: Translate to another language and back
Scenario variation: Same task with different contexts
Difficulty scaling: Simple and complex variants of each task

How Much Data Do I Need?

Goal	Minimum	Recommended	Maximum
Style adaptation	50 examples	200–500	1,000
Task specialization	100 examples	500–2,000	10,000
Domain training	500 examples	2,000–10,000	100,000+

Practical tip: Start with 100 high-quality examples. Train, evaluate, then add targeted additions. 100 perfect examples beat 10,000 mediocre ones.