Synthetic Data

Creating a High-Quality Seed Dataset

The foundation of effective synthetic data generation is a well-prepared seed dataset. Follow these guidelines to ensure your seed data leads to high-quality synthetic examples.

Representative Sample Selection

When selecting or creating your seed examples:

Initial Selection: Hand-pick examples from existing datasets or carefully craft new ones
Diversity: Include at least one example of each variation or pattern you want the model to learn
Balance: Avoid over-representation of any particular class or feature
Minimum Size: Start with at least 10 high-quality examples

Data Quality Standards

Maintain high quality standards across your seed dataset:

Accuracy: Verify that all examples are correct and error-free
Consistency: Use uniform formatting and structure across all examples
Completeness: Ensure each example contains all necessary information

Example Structure

Format individual examples effectively:

Overview: Include a 1-2 line description at the top of each example explaining the task
Zero-Shot Format: Keep examples self-contained without demonstrations or multi-step instructions
Clear Boundaries: Maintain clear separation between prompt and completion
Consistent Style: Use consistent writing style and terminology

Iterative Development Process

Synthetic data generation works best as an iterative process. Here’s how to approach it:

Initial Testing Phase

Model Selection
- Generate small batches with different base models
- Compare quality across models
- Select the most promising model for further testing
Parameter Tuning
- Start with small sample sizes
- Test different generation strategies
- Adjust parameters based on results

Quality Review
- Manually review generated samples
- Identify common issues or patterns
- Document successful and unsuccessful examples
Dataset Improvement
- Remove or fix poor examples
- Refine seed dataset based on findings
- Add examples to address gaps
Regeneration
- Generate new samples with improved seed data
- Compare results with previous iterations
- Continue refining until quality goals are met

Validation Strategy

Implement a robust validation process:

Quality Metrics
- Compare synthetic data statistics with real data
- Check for unwanted patterns or biases
- Verify diversity of generated examples
Use Case Testing
- Test synthetic data in your intended application
- Measure performance against objectives
- Compare results with real data performance
Continuous Monitoring
- Track quality over multiple generations
- Monitor for degradation or drift
- Adjust process based on findings

Cost Optimization

Optimize your synthetic data generation costs:

Model Selection
- Start with cost-effective models
- Use more expensive models only when needed
- Balance quality vs. cost
Strategy Selection
- Use single_pass for initial testing
- Reserve mixture_of_agents for when higher quality is needed
- Consider cost-benefit ratio for your use case

Common Pitfalls to Avoid

Dataset Issues
- Too few seed examples
- Inconsistent formatting
- Poor quality examples
- Biased or unbalanced selection
Quality Control
- Not reviewing generated samples
- Missing validation steps
- Ignoring error patterns
- Not documenting issues

Next Steps

After implementing these best practices:

Fine-tuning Best Practices

Creating a High-Quality Seed Dataset

Representative Sample Selection

Data Quality Standards

Example Structure

Iterative Development Process

Initial Testing Phase

Refinement Loop

Validation Strategy

Cost Optimization

Common Pitfalls to Avoid

Next Steps

Fine-tuning Best Practices

​Creating a High-Quality Seed Dataset

​Representative Sample Selection

​Data Quality Standards

​Example Structure

​Iterative Development Process

​Initial Testing Phase

​Refinement Loop

​Validation Strategy

​Cost Optimization

​Common Pitfalls to Avoid

​Next Steps

Creating a High-Quality Seed Dataset

Representative Sample Selection

Data Quality Standards

Example Structure

Iterative Development Process

Initial Testing Phase

Refinement Loop

Validation Strategy

Cost Optimization

Common Pitfalls to Avoid

Next Steps