Overview
Synthetic data generation creates new training examples that mimic the patterns and style of your original data. This helps:- Improve model accuracy through exposure to more scenarios
- Reduce overfitting with diverse training examples
- Balance class distributions by generating examples of sparse classes
- Address model weaknesses with targeted additional data
How It Works
Our data augmentation process uses OpenAI to generate high-quality synthetic data based on your original examples. The augmented dataset combines both original and synthetic examples and returns a new dataset with three columns:prompt
: The input text used to generate a responsecompletion
: The generated response or continuationsource
: Origin of the example (original
orsynthetic
)
prompts
and their
corresponding completions
based on your original examples.
Generation Strategies
Predibase supports two approaches:1. Single Pass
A straightforward method that creates data directly reflecting provided examples:- Uses 1 LLM call per synthetic example
- Direct reflection of seed examples
- Faster and more cost-effective
2. Mixture of Agents
A sophisticated chain of LLM calls for higher-quality examples:- Generate k prompt-completion candidates from each seed
- Critique candidates based on inferred context and criteria
- Synthesize final examples from candidates and critiques
Implementation Guide
Prerequisites
- Seed dataset (minimum 10 rows) formatted for supervised fine-tuning in prompt-completion format.
- OpenAI API Key
You will incur OpenAI API costs when generating synthetic data. We recommend
starting with:
augmentation_strategy: "single_pass"
- Smaller
num_samples_to_generate
(e.g. 50) - More cost-effective OpenAI model (e.g.
gpt-4o-mini
)
AugmentationConfig
in our
AugmentationConfig
documentation.Supported Generation Models
The following OpenAI models are supported for synthetic data generation:gpt-4o-mini
gpt-4o-2024-08-06
gpt-4-turbo
gpt-4-1106-preview
gpt-4-0125-preview
Quick Start
1
Initialize the Predibase Client
2
Connect and Reference Seed Dataset
Connect a dataset or obtain a reference to an existing dataset. If you don’t
have a dataset uploaded, you can upload it
from file or load it through
Pandas and upload it via
from a Pandas data frame.
3
Generate Synthetic Data
Datasets
tab in the UI.4
Optional Operations
Download the dataset:Load into pandas:At this point, you can choose to modify or filter the dataset as you see fit. You can also
upload it back to Predibase for fine-tuning.Upload modified dataset:
Next Steps
- Get detailed guidance on creating synthetic data
- Prepare your dataset for fine-tuning
- Start fine-tuning with your augmented dataset
- Evaluate results to measure effectiveness