Synthetic Data Generation
Generate high-quality training data from just a few examples
Don’t have enough data for fine-tuning? Predibase offers synthetic data generation at no extra cost in both the UI and SDK, requiring as few as 10 examples to get started.
Overview
Synthetic data generation creates new training examples that mimic the patterns and style of your original data. This helps:
- Improve model accuracy through exposure to more scenarios
- Reduce overfitting with diverse training examples
- Balance class distributions by generating examples of sparse classes
- Address model weaknesses with targeted additional data
How It Works
Our data augmentation process uses OpenAI to generate high-quality synthetic data based on your original examples. The augmented dataset combines both original and synthetic examples and returns a new dataset with three columns:
prompt
: The input text used to generate a responsecompletion
: The generated response or continuationsource
: Origin of the example (original
orsynthetic
)
Our current implementation generates new, synthetic prompts
and their
corresponding completions
based on your original examples.
Generation Strategies
Predibase supports two approaches:
1. Single Pass
A straightforward method that creates data directly reflecting provided examples:
- Uses 1 LLM call per synthetic example
- Direct reflection of seed examples
- Faster and more cost-effective
2. Mixture of Agents
A sophisticated chain of LLM calls for higher-quality examples:
- Generate k prompt-completion candidates from each seed
- Critique candidates based on inferred context and criteria
- Synthesize final examples from candidates and critiques
Uses 6 LLM calls per synthetic example for higher quality output.
Implementation Guide
Prerequisites
- Seed dataset (minimum 10 rows) formatted for supervised fine-tuning in prompt-completion format.
- OpenAI API Key
You will incur OpenAI API costs when generating synthetic data. We recommend starting with:
augmentation_strategy: "single_pass"
- Smaller
num_samples_to_generate
(e.g. 50) - More cost-effective OpenAI model (e.g.
gpt-4o-mini
)
See more details for how to configure the AugmentationConfig
in our
AugmentationConfig
documentation.
Supported Generation Models
The following OpenAI models are supported for synthetic data generation:
gpt-4o-mini
gpt-4o-2024-08-06
gpt-4-turbo
gpt-4-1106-preview
gpt-4-0125-preview
Quick Start
Initialize the Predibase Client
Connect and Reference Seed Dataset
Connect a dataset or obtain a reference to an existing dataset. If you don’t have a dataset uploaded, you can upload it from file or load it through Pandas and upload it via from a Pandas data frame.
Generate Synthetic Data
Once the synthetic data generation is complete, you will be able to see it under the
Datasets
tab in the UI.
Optional Operations
Download the dataset:
Load into pandas:
At this point, you can choose to modify or filter the dataset as you see fit. You can also upload it back to Predibase for fine-tuning.
Upload modified dataset:
Next Steps
- Get detailed guidance on creating synthetic data
- Prepare your dataset for fine-tuning
- Start fine-tuning with your augmented dataset
- Evaluate results to measure effectiveness