Predibase home page
Search...
⌘K
Ask AI
Support
Sign In
Sign In
Search...
Navigation
Fine-tuning Best Practices
Dataset Preparation
Documentation
Python SDK
REST API
User Guides
Fine-tuning Best Practices
Overview
Adapters
Dataset Preparation
Synthetic Data
Reinforcement Learning
Hyperparameter Tuning
On this page
Dataset Preparation
Dataset Size and Quality
Train-Test Split Best Practices
How Evaluation Sets are Used
Instruction Formatting
Large Dataset Handling
Common Pitfalls to Avoid
Fine-tuning Best Practices
Dataset Preparation
Guidelines and best practices for preparing data and fine-tuning models
Dataset Preparation
When preparing your dataset for fine-tuning, keep these key principles in mind:
Dataset Size and Quality
Aim for a minimum of 500-1000 representative examples
More high-quality examples may help increase performance
Examples in the training dataset should be similar to the requests at inference time
The dataset should contain a diverse set of examples
Train-Test Split Best Practices
When creating train and evaluation splits for your dataset:
For larger datasets: Use 80% for training and 20% for evaluation
For smaller datasets: Use 90% for training and 10% for evaluation
Create a
split
column in your dataset with values:
train
for training data
evaluation
for evaluation data
How Evaluation Sets are Used
The evaluation set serves several important purposes:
Helps track model performance on unseen data during training
Used to select the best checkpoint for your final model
If no evaluation set is provided:
All data will be used for training
Only training metrics will be reported
The final training checkpoint will be used for inference
When an evaluation set is provided:
The model is never trained on the evaluation set
Progress is tracked on both training and evaluation data
The checkpoint with the best evaluation performance (lowest loss) is selected
Helps prevent overfitting (lack of generalization) by monitoring the evaluation loss
Instruction Formatting
Experimentation shows that using model-specific instruction (chat) templates significantly boosts performance. To learn more about how to use instruction templates with your data to improve performance, see our guide on
Chat Templates
.
Large Dataset Handling
When working with large datasets (>100MB):
Use cloud storage solutions like Amazon S3, BigQuery, etc. instead of file uploads
Consider using pretokenized datasets for more efficient processing
Take advantage of external data source connections for streaming large datasets
Common Pitfalls to Avoid
Data Quality Issues
Inconsistent formatting across examples
Missing or incomplete responses
Poor quality or irrelevant examples
Dataset Size Problems
Too few examples for effective learning
Dataset too large for file upload (use external data sources instead)
Format Errors
Incorrect JSON/JSONL formatting
Missing required columns
Inconsistent data types
Adapters
Synthetic Data
Assistant
Responses are generated using AI and may contain mistakes.