Dataset Preparation

When preparing your dataset for fine-tuning, keep these key principles in mind:

Dataset Size and Quality

Aim for a minimum of 500-1000 representative examples
More high-quality examples may help increase performance
Examples in the training dataset should be similar to the requests at inference time
The dataset should contain a diverse set of examples

Train-Test Split Best Practices

When creating train and evaluation splits for your dataset:

For larger datasets: Use 80% for training and 20% for evaluation
For smaller datasets: Use 90% for training and 10% for evaluation
Create a split column in your dataset with values:
- train for training data
- evaluation for evaluation data

How Evaluation Sets are Used

The evaluation set serves several important purposes:

Helps track model performance on unseen data during training
Used to select the best checkpoint for your final model
If no evaluation set is provided:
- All data will be used for training
- Only training metrics will be reported
- The final training checkpoint will be used for inference
When an evaluation set is provided:
- The model is never trained on the evaluation set
- Progress is tracked on both training and evaluation data
- The checkpoint with the best evaluation performance (lowest loss) is selected
- Helps prevent overfitting (lack of generalization) by monitoring the evaluation loss

Instruction Formatting

Experimentation shows that using model-specific instruction (chat) templates significantly boosts performance. To learn more about how to use instruction templates with your data to improve performance, see our guide on Chat Templates.

Large Dataset Handling

When working with large datasets (>100MB):

Use cloud storage solutions like Amazon S3, BigQuery, etc. instead of file uploads
Consider using pretokenized datasets for more efficient processing
Take advantage of external data source connections for streaming large datasets

Common Pitfalls to Avoid

Data Quality Issues
- Inconsistent formatting across examples
- Missing or incomplete responses
- Poor quality or irrelevant examples
Dataset Size Problems
- Too few examples for effective learning
- Dataset too large for file upload (use external data sources instead)
Format Errors
- Incorrect JSON/JSONL formatting
- Missing required columns
- Inconsistent data types

Fine-tuning Best Practices

​Dataset Preparation

​Dataset Size and Quality

​Train-Test Split Best Practices

​How Evaluation Sets are Used

​Instruction Formatting

​Large Dataset Handling

​Common Pitfalls to Avoid