Lesson 2 of 5·10 min read

Preparing Training Data

Your fine-tuning quality stands or falls with training data. "Garbage in, garbage out" applies here more than anywhere else. Good training data is the difference between a useful and a useless model.

Formats

JSONL (JSON Lines) — Standard Format

{"messages": [{"role": "system", "content": "You are a brand copywriter..."}, {"role": "user", "content": "Write a product text for..."}, {"role": "assistant", "content": "Discover..."}]}
{"messages": [{"role": "system", "content": "You are a brand copywriter..."}, {"role": "user", "content": "Describe our new..."}, {"role": "assistant", "content": "Innovation..."}]}

Alpaca Format (Open Source)

{"instruction": "Write a product text", "input": "Product: Smart Watch X1", "output": "The Smart Watch X1..."}

ShareGPT Format (Multi-Turn)

{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}, {"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Ensuring Data Quality

The 5 Quality Criteria

CriterionDescriptionCheck
CorrectnessAre the answers factually correct?Expert review
ConsistencySame style and tone across all examples?Style guide
DiversityDo examples cover different scenarios?Coverage matrix
RelevanceDo examples match the target use case?Use-case alignment
LengthDo answers match the desired output length?Token count

Common Quality Issues

  • ❌ Copy-paste from ChatGPT (model learns generic style)
  • ❌ Contradictory answers to similar questions
  • ❌ Too little diversity (only one topic, only one answer structure)
  • ❌ Formatting inconsistencies (sometimes Markdown, sometimes plain text)

Annotation — Who Labels?

MethodQualityCostScalability
Own expertsVery highHighLow
Annotation services (Scale AI, Toloka)HighMediumHigh
LLM-assisted annotationMediumLowVery high
Community/CrowdsourcingVariableLowHigh

Best practice: LLM draft + expert review — AI creates a draft, a human reviews and corrects.

Augmentation — More from Less Data

  • Paraphrasing: LLM creates variants of existing examples
  • Back-translation: Translate to another language and back
  • Scenario variation: Same task with different contexts
  • Difficulty scaling: Simple and complex variants of each task

How Much Data Do I Need?

GoalMinimumRecommendedMaximum
Style adaptation50 examples200–5001,000
Task specialization100 examples500–2,00010,000
Domain training500 examples2,000–10,000100,000+

Practical tip: Start with 100 high-quality examples. Train, evaluate, then add targeted additions. 100 perfect examples beat 10,000 mediocre ones.