Skip to main content
Fine-tuning Best Practices Dataset Preparation
Dataset Preparation
When preparing your dataset for fine-tuning, keep these key principles in mind:
Dataset Structure
The dataset must be in a format compatible with Predibase
For SFT tasks, the dataset must contain two columns: prompt
and completion
For other tasks, the dataset must contain the columns specified in the task
documentation
Please see our dataset preparation guide for more
details.
Dataset Size and Quality
Aim for a minimum of 500-1000 representative examples
More high-quality examples may help increase performance
Examples in the training dataset should be similar to the requests at
inference time
The dataset should contain a diverse set of examples
Train-Test Split Best Practices
When creating train and evaluation splits for your dataset:
For larger datasets: Use 80% for training and 20% for evaluation
For smaller datasets: Use 90% for training and 10% for evaluation
Create a split
column in your dataset with values:
train
for training data
evaluation
for evaluation data
How Evaluation Sets are Used
The evaluation set serves several important purposes:
Helps track model performance on unseen data during training
Used to select the best checkpoint for your final model
If no evaluation set is provided:
All data will be used for training
Only training metrics will be reported
The final training checkpoint will be used for inference
When an evaluation set is provided:
The model is never trained on the evaluation set
Progress is tracked on both training and evaluation data
The checkpoint with the best evaluation performance (lowest loss) is
selected
Helps prevent overfitting (lack of generalization) by monitoring the
evaluation loss
Experimentation shows that using model-specific instruction (chat) templates
significantly boosts performance. To learn more about how to use instruction
templates with your data to improve performance, see our guide on
Chat Templates .
Large Dataset Handling
When working with large datasets (>100MB):
Use cloud storage solutions like Amazon S3, BigQuery, etc. instead of file
uploads
Consider using pretokenized datasets for more efficient processing
Take advantage of external data source connections for streaming large
datasets
Common Pitfalls to Avoid
Data Quality Issues
Inconsistent formatting across examples
Missing or incomplete responses
Poor quality or irrelevant examples
Dataset Size Problems
Too few examples for effective learning
Dataset too large for file upload (use external data sources instead)
Format Errors
Incorrect JSON/JSONL formatting
Missing required columns
Inconsistent data types