Hyperparameter Tuning

Overview

Effective hyperparameter tuning is crucial for achieving optimal performance when fine-tuning language models. This guide covers key hyperparameters, their impact on training, and best practices for tuning them.

Key Hyperparameters

Learning Rate

The learning rate controls how much the model updates its weights in response to training data.

Too high: Training may become unstable or diverge
Too low: Training may be slow or get stuck in suboptimal solutions
Best practice: Start with 2e-4 for LoRA and gradually decrease if training is unstable

Batch Size

The number of training examples processed in each training step.

Larger batches: More stable gradients but require more memory
Smaller batches: Less memory but potentially noisier training
Best practice: Use the largest batch size that fits in your available memory. For LoRA, use an effective_batch_size of 16 or 32.

Number of Epochs

How many times the model sees the entire training dataset over the course of training.

Too few: Model may underfit and not learn the task well
Too many: Model may overfit or waste computational resources
Best practice: Start with 3 epochs and adjust based on validation performance

LoRA Rank

For LoRA fine-tuning, the rank determines the complexity of adaptations.

Higher rank: More expressive but requires more memory
Lower rank: More efficient but may limit adaptation capability
Best practice: Start with rank 16 for most tasks, increase for complex tasks

Systematic Tuning Process

Start with Defaults
- Begin with our recommended default values
- These work well for most common use cases
- See the Configuration Reference for defaults
Monitor Key Metrics
- Training loss
- Validation loss
- Task-specific metrics (accuracy, F1, etc.) through offline evaluation
Adjust One at a Time
- Change only one hyperparameter at a time
- Document the impact of each change
- Keep track of what works and what doesn’t
Consider Resource Constraints
- Balance performance gains against computational costs
- Monitor memory usage and training time

Common Pitfalls

Overfitting: Watch for validation metrics diverging from training metrics
Unstable Training: If loss spikes or becomes NaN, reduce learning rate
Premature Optimization: Don’t tune hyperparameters before having a good baseline

Advanced Techniques

Learning Rate Scheduling

Use learning rate warmup to stabilize early training
Consider learning rate decay to fine-tune final performance
Example configuration:

config = SFTConfig(
  lr_scheduler_type="cosine",
  warmup_ratio=0.1,
)

Gradient Clipping

Helps prevent exploding gradients
Particularly useful with larger learning rates
Example configuration:

Support coming soon.

Next Steps

Review the SFT Configuration Reference for detailed parameter descriptions. You can also find the parameters for continued pretraining, GRPO, and augmentation in their respective configuration files.
Review the Hyperparameter Tuning Guide for more detailed information on how to tune hyperparameters.
Experiment with different configurations on a small subset of your data

Fine-tuning Best Practices

​Overview

​Key Hyperparameters

​Learning Rate

​Batch Size

​Number of Epochs

​LoRA Rank

​Systematic Tuning Process

​Common Pitfalls

​Advanced Techniques

​Learning Rate Scheduling

​Gradient Clipping

​Next Steps