Synthetic Data Generation

Don’t have enough data for fine-tuning? Predibase offers synthetic data generation at no extra cost in both the UI and SDK, requiring as few as 10 examples to get started.

Overview

Synthetic data generation creates new training examples that mimic the patterns and style of your original data. This helps:

Improve model accuracy through exposure to more scenarios
Reduce overfitting with diverse training examples
Balance class distributions by generating examples of sparse classes
Address model weaknesses with targeted additional data

How It Works

Our data augmentation process uses OpenAI to generate high-quality synthetic data based on your original examples. The augmented dataset combines both original and synthetic examples and returns a new dataset with three columns:

prompt: The input text used to generate a response
completion: The generated response or continuation
source: Origin of the example (original or synthetic)

Our current implementation generates new, synthetic prompts and their corresponding completions based on your original examples.

Generation Strategies

Predibase supports two approaches:

1. Single Pass

A straightforward method that creates data directly reflecting provided examples:

Uses 1 LLM call per synthetic example
Direct reflection of seed examples
Faster and more cost-effective

2. Mixture of Agents

A sophisticated chain of LLM calls for higher-quality examples:

Generate k prompt-completion candidates from each seed
Critique candidates based on inferred context and criteria
Synthesize final examples from candidates and critiques

Uses 6 LLM calls per synthetic example for higher quality output.

Implementation Guide

Prerequisites

Seed dataset (minimum 10 rows) formatted for supervised fine-tuning in prompt-completion format.
OpenAI API Key

You will incur OpenAI API costs when generating synthetic data. We recommend starting with:

augmentation_strategy: "single_pass"
Smaller num_samples_to_generate (e.g. 50)
More cost-effective OpenAI model (e.g. gpt-4o-mini)

See more details for how to configure the AugmentationConfig in our AugmentationConfig documentation.

Supported Generation Models

The following OpenAI models are supported for synthetic data generation:

gpt-4o-mini
gpt-4o-2024-08-06
gpt-4-turbo
gpt-4-1106-preview
gpt-4-0125-preview

Quick Start

Initialize the Predibase Client

import os
import pandas as pd
from predibase import Predibase, AugmentationConfig

pb = Predibase(api_token="<YOUR_PREDIBASE_API_KEY>")

Connect and Reference Seed Dataset

Connect a dataset or obtain a reference to an existing dataset. If you don’t have a dataset uploaded, you can upload it from file or load it through Pandas and upload it via from a Pandas data frame.

seed_dataset = pb.datasets.get("file_uploads/seed_dataset_20_rows")

Generate Synthetic Data

dataset = pb.datasets.augment(
    AugmentationConfig(
        # Model options: gpt-4-turbo, gpt-4-1106-preview, gpt-4-0125-preview,
        # gpt-4o, gpt-4o-2024-08-06, gpt-4o-mini
        base_model="gpt-4o-mini",
        # Strategy: 'mixture_of_agents' or 'single_pass'
        # augmentation_strategy="mixture_of_agents",
        # Recommended values: 25, 50, 100, 200, 500, 1000
        num_samples_to_generate=50,
        # Optional: number of seed samples to use
        # num_seed_samples="all",
        # Optional: task context (inferred if not provided)
        # task_context="",
    ),
    dataset=seed_dataset,
    # Optional: custom name
    # name="my_synthetic_data",
    openai_api_key="<YOUR_OPENAI_API_KEY>",
)

Once the synthetic data generation is complete, you will be able to see it under the Datasets tab in the UI.

Optional Operations

Download the dataset:

pb.datasets.download(
    dataset_ref=dataset, # can be a string or a Dataset object
    dest="augmented_seed_dataset_20_rows.jsonl"
)

Load into pandas:

augmented_df = pd.read_json(
    "augmented_seed_dataset_20_rows.jsonl",
    lines=True
)

At this point, you can choose to modify or filter the dataset as you see fit. You can also upload it back to Predibase for fine-tuning.Upload modified dataset:

modified_dataset = pb.datasets.from_pandas_dataframe(
    augmented_df,
    "augmented_seed_dataset_20_rows"
)

Next Steps

Get detailed guidance on creating synthetic data
Prepare your dataset for fine-tuning
Start fine-tuning with your augmented dataset
Evaluate results to measure effectiveness

Documentation Index

​Overview

​How It Works

​Generation Strategies

​1. Single Pass

​2. Mixture of Agents

​Implementation Guide

​Prerequisites

​Supported Generation Models

​Quick Start

​Next Steps