Skip to main content
Fine-tuning Best Practices Synthetic Data
Creating a High-Quality Seed Dataset
The foundation of effective synthetic data generation is a well-prepared seed
dataset. Follow these guidelines to ensure your seed data leads to high-quality
synthetic examples.
Representative Sample Selection
When selecting or creating your seed examples:
Initial Selection : Hand-pick examples from existing datasets or carefully
craft new ones
Diversity : Include at least one example of each variation or pattern you
want the model to learn
Balance : Avoid over-representation of any particular class or feature
Minimum Size : Start with at least 10 high-quality examples
Data Quality Standards
Maintain high quality standards across your seed dataset:
Accuracy : Verify that all examples are correct and error-free
Consistency : Use uniform formatting and structure across all examples
Completeness : Ensure each example contains all necessary information
Example Structure
Format individual examples effectively:
Overview : Include a 1-2 line description at the top of each example
explaining the task
Zero-Shot Format : Keep examples self-contained without demonstrations or
multi-step instructions
Clear Boundaries : Maintain clear separation between prompt and completion
Consistent Style : Use consistent writing style and terminology
Iterative Development Process
Synthetic data generation works best as an iterative process. Here’s how to
approach it:
Initial Testing Phase
Model Selection
Generate small batches with different base models
Compare quality across models
Select the most promising model for further testing
Parameter Tuning
Start with small sample sizes
Test different generation strategies
Adjust parameters based on results
Refinement Loop
Quality Review
Manually review generated samples
Identify common issues or patterns
Document successful and unsuccessful examples
Dataset Improvement
Remove or fix poor examples
Refine seed dataset based on findings
Add examples to address gaps
Regeneration
Generate new samples with improved seed data
Compare results with previous iterations
Continue refining until quality goals are met
Validation Strategy
Implement a robust validation process:
Quality Metrics
Compare synthetic data statistics with real data
Check for unwanted patterns or biases
Verify diversity of generated examples
Use Case Testing
Test synthetic data in your intended application
Measure performance against objectives
Compare results with real data performance
Continuous Monitoring
Track quality over multiple generations
Monitor for degradation or drift
Adjust process based on findings
Cost Optimization
Optimize your synthetic data generation costs:
Model Selection
Start with cost-effective models
Use more expensive models only when needed
Balance quality vs. cost
Strategy Selection
Use single_pass
for initial testing
Reserve mixture_of_agents
for when higher quality is needed
Consider cost-benefit ratio for your use case
Common Pitfalls to Avoid
Dataset Issues
Too few seed examples
Inconsistent formatting
Poor quality examples
Biased or unbalanced selection
Quality Control
Not reviewing generated samples
Missing validation steps
Ignoring error patterns
Not documenting issues
Next Steps
After implementing these best practices:
Generate synthetic data
Prepare for fine-tuning
Start the fine-tuning process
Evaluate your results