Dataset Preparation

When preparing your dataset for fine-tuning, keep these key principles in mind:

Dataset Size and Quality

  • Aim for a minimum of 500-1000 representative examples
  • More high-quality examples may help increase performance
  • Examples in the training dataset should be similar to the requests at inference time
  • The dataset should contain a diverse set of examples

Train-Test Split Best Practices

When creating train and evaluation splits for your dataset:

  • For larger datasets: Use 80% for training and 20% for evaluation
  • For smaller datasets: Use 90% for training and 10% for evaluation
  • Create a split column in your dataset with values:
    • train for training data
    • evaluation for evaluation data

How Evaluation Sets are Used

The evaluation set serves several important purposes:

  • Helps track model performance on unseen data during training
  • Used to select the best checkpoint for your final model
  • If no evaluation set is provided:
    • All data will be used for training
    • Only training metrics will be reported
    • The final training checkpoint will be used for inference
  • When an evaluation set is provided:
    • The model is never trained on the evaluation set
    • Progress is tracked on both training and evaluation data
    • The checkpoint with the best evaluation performance (lowest loss) is selected
    • Helps prevent overfitting (lack of generalization) by monitoring the evaluation loss

Instruction Formatting

Experimentation shows that using model-specific instruction (chat) templates significantly boosts performance. To learn more about how to use instruction templates with your data to improve performance, see our guide on Chat Templates.

Large Dataset Handling

When working with large datasets (>100MB):

  • Use cloud storage solutions like Amazon S3, BigQuery, etc. instead of file uploads
  • Consider using pretokenized datasets for more efficient processing
  • Take advantage of external data source connections for streaming large datasets

Common Pitfalls to Avoid

  1. Data Quality Issues

    • Inconsistent formatting across examples
    • Missing or incomplete responses
    • Poor quality or irrelevant examples
  2. Dataset Size Problems

    • Too few examples for effective learning
    • Dataset too large for file upload (use external data sources instead)
  3. Format Errors

    • Incorrect JSON/JSONL formatting
    • Missing required columns
    • Inconsistent data types