Fine-tuning Best Practices
Hyperparameter Tuning
Learn how to effectively tune hyperparameters for optimal model performance
Overview
Effective hyperparameter tuning is crucial for achieving optimal performance when fine-tuning language models. This guide covers key hyperparameters, their impact on training, and best practices for tuning them.
Key Hyperparameters
Learning Rate
The learning rate controls how much the model updates its weights in response to training data.
- Too high: Training may become unstable or diverge
- Too low: Training may be slow or get stuck in suboptimal solutions
- Best practice: Start with 2e-4 for LoRA and gradually decrease if training is unstable
Batch Size
The number of training examples processed in each training step.
- Larger batches: More stable gradients but require more memory
- Smaller batches: Less memory but potentially noisier training
- Best practice: Use the largest batch size that fits in your available
memory. For LoRA, use an
effective_batch_size
of 16 or 32.
Number of Epochs
How many times the model sees the entire training dataset over the course of training.
- Too few: Model may underfit and not learn the task well
- Too many: Model may overfit or waste computational resources
- Best practice: Start with 3 epochs and adjust based on validation performance
LoRA Rank
For LoRA fine-tuning, the rank determines the complexity of adaptations.
- Higher rank: More expressive but requires more memory
- Lower rank: More efficient but may limit adaptation capability
- Best practice: Start with rank 16 for most tasks, increase for complex tasks
Systematic Tuning Process
-
Start with Defaults
- Begin with our recommended default values
- These work well for most common use cases
- See the Configuration Reference for defaults
-
Monitor Key Metrics
- Training loss
- Validation loss
- Task-specific metrics (accuracy, F1, etc.) through offline evaluation
-
Adjust One at a Time
- Change only one hyperparameter at a time
- Document the impact of each change
- Keep track of what works and what doesn’t
-
Consider Resource Constraints
- Balance performance gains against computational costs
- Monitor memory usage and training time
Common Pitfalls
- Overfitting: Watch for validation metrics diverging from training metrics
- Unstable Training: If loss spikes or becomes NaN, reduce learning rate
- Premature Optimization: Don’t tune hyperparameters before having a good baseline
Advanced Techniques
Learning Rate Scheduling
- Use learning rate warmup to stabilize early training
- Consider learning rate decay to fine-tune final performance
- Example configuration:
Gradient Clipping
- Helps prevent exploding gradients
- Particularly useful with larger learning rates
- Example configuration:
Support coming soon.
Next Steps
- Review the SFT Configuration Reference for detailed parameter descriptions. You can also find the parameters for continued pretraining, GRPO, and augmentation in their respective configuration files.
- Review the Hyperparameter Tuning Guide for more detailed information on how to tune hyperparameters.
- Experiment with different configurations on a small subset of your data