Synthetic Data (Beta)
Don't have a ton of data to get started with fine-tuning? No problem! Predibase offers a method for generating a synthetic dataset in the UI and SDK (pb.datasets.augment
) using as few as 10 rows of seed data.
What is Synthetic Data Generation?
Synthetic data generation involves creating artificial data points that statistically resemble your real data, which can help with:
- Improving model accuracy by exposing the model to a wider range of scenarios
- Reducing overfitting by providing diverse training examples
- Balancing class distributions by creating examples of sparse classes
- Generating additional data to improve weaknesses in your model
How it works
Our data augmentation process uses your original examples to generate high-quality synthetic data with OpenAI.
The resulting augmented dataset combines both the original and synthetic examples into a single dataset. This dataset will contain three columns:
- prompt: The input or starting text that was used to generate a response.
- completion: The generated response or continuation of the prompt.
- source: A label that indicates the origin of each row in the dataset. This column will have two possible values:
- original: This value means the row is part of the original seed examples.
- synthetic: This value means the row was created during the augmentation process and is a synthetic example.
Predibase currently supports two strategies for synthetic data generation: mixture_of_agents
and single_pass
.
Single Pass
This method is straightforward and focuses on creating data that directly reflects the provided example. It uses a single prompt that takes in a seed example and generates a new synthetic prompt and completion pair.
This approach uses 1 LLM call to generate one new synthetic example.
Mixture Of Agents
This method involves using a chain of LLM calls to generate a single high-quality synthetic example that closely resembles the seed example distribution.
The algorithm follows a 3 step process:
- Generate k prompt and completion candidates from a single seed example
- Critique the k generated examples based on a rich set of criteria, inferred context and dataset characteristics
- Synthesize the candidate examples and critiques to generate a final, high quality example
This is repeated for each seed example in the seed dataset.Each of these steps may involve one or more LLM calls and use chain of thought prompting beneath the surface to improve reasoning at each step.
This approach uses 6 LLM calls to generate one new synthetic example.
Example
Prerequisites
- Seed dataset (at least 10 rows, formatted for instruction fine-tuning)
- OpenAI API Key (which is used for the synthetic examples generation)
Note that you will incur OpenAI API costs when calling this augment function. We recommend starting with augmentation_strategy: "single_pass"
, a smaller num_samples_to_generate
and a cheaper OpenAI model.
Setup: Initialize your Predibase client
import os
import pandas as pd
from predibase import Predibase, AugmentationConfig
pb = Predibase(api_token="<YOUR_PREDIBASE_API_KEY>")
Connect and reference seed dataset
Connect a dataset or obtain a reference to an existing dataset. If you don't have a dataset uploaded, you can upload it from file or load it through Pandas and upload it via from a Pandas data frame.
seed_dataset = pb.datasets.get("file_uploads/seed_dataset_20_rows")
Call the augmentation function
Once the synthetic examples generation completes, the augmented dataset will be uploaded to Predibase (see Data tab).
dataset = pb.datasets.augment(
AugmentationConfig(
# gpt-4-turbo, gpt-4-1106-preview, gpt-4-0125-preview, gpt-4o, gpt-4o-2024-08-06, gpt-4o-mini
base_model="gpt-4o-mini",
# Augmentation Strategies currently supported are: 'mixture_of_agents', 'single_pass'
# augmentation_strategy="mixture_of_agents"
# Number of samples to generate: Recommended values (25, 50, 100, 200, 500, 1000)
num_samples_to_generate=50,
# num_seed_samples="all", # Optional - number of examples to use as seed samples from the dataset
# task_context="", # Optional - provide a context for the task to help the model. Inferred from data if not provided
),
dataset=seed_dataset,
# Optional, if not provided a name will be generated
# name="my_synthetic_data"
openai_api_key="<YOUR_OPENAI_API_KEY>",
)
(Optional) Download the augmented dataset to a local file to inspect/modify
pb.datasets.download(dataset_ref=dataset, dest="augmented_seed_dataset_20_rows.jsonl")
(Optional) Load the augmented dataset into a Pandas DataFrame to inspect and make modifications
augmented_df = pd.read_json("augmented_seed_dataset_20_rows.jsonl", lines=True)
(Optional) Reupload the modified dataset to Predibase for training
modified_dataset: Dataset = pb.datasets.from_pandas_dataframe(augmented_df, "augmented_seed_dataset_20_rows")
How to create a good seed dataset
Creating a high-quality seed dataset is crucial for generating effective synthetic data. A well-prepared seed dataset serves as the foundation for synthetic data generation, ensuring that the resulting data is representative, diverse, and useful for your intended applications. Here are key steps and considerations for creating a good seed dataset:
- Select Representative Samples:
- Initial Selection: Start by hand-picking from an existing dataset or carefully crafting these examples.
- Diversity: Capture the full range of variations in the target data. You want at least 1 example of all variations.
- Balance: Avoid over-representation of any class or feature.
- Ensure Data Quality:
- Accuracy: Verify data correctness and remove errors.
- Consistency: Maintain uniform data formatting and structure across good examples.
- Individual Example Preparation:
- Overview: Write 1-2 lines at the top of each seed example in your dataset that gives a high-level overview of the task.
- Zero Shot: Don't provide demonstrations of how to do the task as part of any individual example. You want each example to be zero shot (just the prompt and completion).
Iteration and validation of synthetic data
To ensure the highest quality synthetic data, it is crucial to adopt an iterative approach that involves continuous refinement and validation of your seed dataset and generation process.
- Feedback Loop:
- Initially, generate a few synthetic data samples using different base models to identify which model shows the most promise.
- Based on the initial outputs, choose the best-performing model and gradually increase the sample sizes.
- Continuously review the generated samples, correcting mistakes or removing poor examples to improve the seed dataset.
- Use the refined seed dataset to generate new synthetic samples, iterating this process to enhance quality progressively.
- Validation:
- Compare the synthetic data with real-world data to ensure accuracy and relevance.
- Assess the performance of the synthetic data against your objectives, making necessary adjustments to the seed dataset and the generation process.
- Use metrics and benchmarks pertinent to your specific use case to validate the effectiveness of the synthetic data.
By following this iterative approach, you can refine your seed dataset and synthetic data generation process, leading to higher quality and more useful data outcomes.