Generate high-quality training data from just a few examples
Don’t have enough data for fine-tuning? Predibase offers synthetic data generation at no extra cost in both the UI and SDK, requiring as few as 10 examples to get started.
Synthetic data generation creates new training examples that mimic the patterns and style of your original data. This helps:
Our data augmentation process uses OpenAI to generate high-quality synthetic data based on your original examples. The augmented dataset combines both original and synthetic examples and returns a new dataset with three columns:
prompt
: The input text used to generate a responsecompletion
: The generated response or continuationsource
: Origin of the example (original
or synthetic
)Our current implementation generates new, synthetic prompts
and their
corresponding completions
based on your original examples.
Predibase supports two approaches:
A straightforward method that creates data directly reflecting provided examples:
A sophisticated chain of LLM calls for higher-quality examples:
Uses 6 LLM calls per synthetic example for higher quality output.
You will incur OpenAI API costs when generating synthetic data. We recommend starting with:
augmentation_strategy: "single_pass"
num_samples_to_generate
(e.g. 50)gpt-4o-mini
)See more details for how to configure the AugmentationConfig
in our
AugmentationConfig
documentation.
The following OpenAI models are supported for synthetic data generation:
gpt-4o-mini
gpt-4o-2024-08-06
gpt-4-turbo
gpt-4-1106-preview
gpt-4-0125-preview
Initialize the Predibase Client
Connect and Reference Seed Dataset
Connect a dataset or obtain a reference to an existing dataset. If you don’t have a dataset uploaded, you can upload it from file or load it through Pandas and upload it via from a Pandas data frame.
Generate Synthetic Data
Once the synthetic data generation is complete, you will be able to see it under the
Datasets
tab in the UI.
Optional Operations
Download the dataset:
Load into pandas:
At this point, you can choose to modify or filter the dataset as you see fit. You can also upload it back to Predibase for fine-tuning.
Upload modified dataset:
Generate high-quality training data from just a few examples
Don’t have enough data for fine-tuning? Predibase offers synthetic data generation at no extra cost in both the UI and SDK, requiring as few as 10 examples to get started.
Synthetic data generation creates new training examples that mimic the patterns and style of your original data. This helps:
Our data augmentation process uses OpenAI to generate high-quality synthetic data based on your original examples. The augmented dataset combines both original and synthetic examples and returns a new dataset with three columns:
prompt
: The input text used to generate a responsecompletion
: The generated response or continuationsource
: Origin of the example (original
or synthetic
)Our current implementation generates new, synthetic prompts
and their
corresponding completions
based on your original examples.
Predibase supports two approaches:
A straightforward method that creates data directly reflecting provided examples:
A sophisticated chain of LLM calls for higher-quality examples:
Uses 6 LLM calls per synthetic example for higher quality output.
You will incur OpenAI API costs when generating synthetic data. We recommend starting with:
augmentation_strategy: "single_pass"
num_samples_to_generate
(e.g. 50)gpt-4o-mini
)See more details for how to configure the AugmentationConfig
in our
AugmentationConfig
documentation.
The following OpenAI models are supported for synthetic data generation:
gpt-4o-mini
gpt-4o-2024-08-06
gpt-4-turbo
gpt-4-1106-preview
gpt-4-0125-preview
Initialize the Predibase Client
Connect and Reference Seed Dataset
Connect a dataset or obtain a reference to an existing dataset. If you don’t have a dataset uploaded, you can upload it from file or load it through Pandas and upload it via from a Pandas data frame.
Generate Synthetic Data
Once the synthetic data generation is complete, you will be able to see it under the
Datasets
tab in the UI.
Optional Operations
Download the dataset:
Load into pandas:
At this point, you can choose to modify or filter the dataset as you see fit. You can also upload it back to Predibase for fine-tuning.
Upload modified dataset: