Don’t have enough data for fine-tuning? Predibase offers synthetic data generation at no extra cost in both the UI and SDK, requiring as few as 10 examples to get started.Documentation Index
Fetch the complete documentation index at: https://docs.predibase.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Synthetic data generation creates new training examples that mimic the patterns and style of your original data. This helps:- Improve model accuracy through exposure to more scenarios
- Reduce overfitting with diverse training examples
- Balance class distributions by generating examples of sparse classes
- Address model weaknesses with targeted additional data
How It Works
Our data augmentation process uses OpenAI to generate high-quality synthetic data based on your original examples. The augmented dataset combines both original and synthetic examples and returns a new dataset with three columns:prompt: The input text used to generate a responsecompletion: The generated response or continuationsource: Origin of the example (originalorsynthetic)
prompts and their
corresponding completions based on your original examples.
Generation Strategies
Predibase supports two approaches:1. Single Pass
A straightforward method that creates data directly reflecting provided examples:- Uses 1 LLM call per synthetic example
- Direct reflection of seed examples
- Faster and more cost-effective
2. Mixture of Agents
A sophisticated chain of LLM calls for higher-quality examples:- Generate k prompt-completion candidates from each seed
- Critique candidates based on inferred context and criteria
- Synthesize final examples from candidates and critiques
Implementation Guide
Prerequisites
- Seed dataset (minimum 10 rows) formatted for supervised fine-tuning in prompt-completion format.
- OpenAI API Key
You will incur OpenAI API costs when generating synthetic data. We recommend
starting with:
augmentation_strategy: "single_pass"- Smaller
num_samples_to_generate(e.g. 50) - More cost-effective OpenAI model (e.g.
gpt-4o-mini)
AugmentationConfig in our
AugmentationConfig
documentation.Supported Generation Models
The following OpenAI models are supported for synthetic data generation:gpt-4o-minigpt-4o-2024-08-06gpt-4-turbogpt-4-1106-previewgpt-4-0125-preview
Quick Start
Connect and Reference Seed Dataset
Connect a dataset or obtain a reference to an existing dataset. If you don’t
have a dataset uploaded, you can upload it
from file or load it through
Pandas and upload it via
from a Pandas data frame.
Generate Synthetic Data
Datasets tab in the UI.Next Steps
- Get detailed guidance on creating synthetic data
- Prepare your dataset for fine-tuning
- Start fine-tuning with your augmented dataset
- Evaluate results to measure effectiveness