Don’t have enough data for fine-tuning? Predibase offers synthetic data generation at no extra cost in both the UI and SDK, requiring as few as 10 examples to get started.

Overview

Synthetic data generation creates new training examples that mimic the patterns and style of your original data. This helps:

  • Improve model accuracy through exposure to more scenarios
  • Reduce overfitting with diverse training examples
  • Balance class distributions by generating examples of sparse classes
  • Address model weaknesses with targeted additional data

How It Works

Our data augmentation process uses OpenAI to generate high-quality synthetic data based on your original examples. The augmented dataset combines both original and synthetic examples and returns a new dataset with three columns:

  • prompt: The input text used to generate a response
  • completion: The generated response or continuation
  • source: Origin of the example (original or synthetic)

Our current implementation generates new, synthetic prompts and their corresponding completions based on your original examples.

Generation Strategies

Predibase supports two approaches:

1. Single Pass

A straightforward method that creates data directly reflecting provided examples:

  • Uses 1 LLM call per synthetic example
  • Direct reflection of seed examples
  • Faster and more cost-effective

2. Mixture of Agents

A sophisticated chain of LLM calls for higher-quality examples:

  1. Generate k prompt-completion candidates from each seed
  2. Critique candidates based on inferred context and criteria
  3. Synthesize final examples from candidates and critiques

Uses 6 LLM calls per synthetic example for higher quality output.

Implementation Guide

Prerequisites

You will incur OpenAI API costs when generating synthetic data. We recommend starting with:

  • augmentation_strategy: "single_pass"
  • Smaller num_samples_to_generate (e.g. 50)
  • More cost-effective OpenAI model (e.g. gpt-4o-mini)

See more details for how to configure the AugmentationConfig in our AugmentationConfig documentation.

Supported Generation Models

The following OpenAI models are supported for synthetic data generation:

  • gpt-4o-mini
  • gpt-4o-2024-08-06
  • gpt-4-turbo
  • gpt-4-1106-preview
  • gpt-4-0125-preview

Quick Start

1

Initialize the Predibase Client

import os
import pandas as pd
from predibase import Predibase, AugmentationConfig

pb = Predibase(api_token="<YOUR_PREDIBASE_API_KEY>")
2

Connect and Reference Seed Dataset

Connect a dataset or obtain a reference to an existing dataset. If you don’t have a dataset uploaded, you can upload it from file or load it through Pandas and upload it via from a Pandas data frame.

seed_dataset = pb.datasets.get("file_uploads/seed_dataset_20_rows")
3

Generate Synthetic Data

dataset = pb.datasets.augment(
    AugmentationConfig(
        # Model options: gpt-4-turbo, gpt-4-1106-preview, gpt-4-0125-preview,
        # gpt-4o, gpt-4o-2024-08-06, gpt-4o-mini
        base_model="gpt-4o-mini",
        # Strategy: 'mixture_of_agents' or 'single_pass'
        # augmentation_strategy="mixture_of_agents",
        # Recommended values: 25, 50, 100, 200, 500, 1000
        num_samples_to_generate=50,
        # Optional: number of seed samples to use
        # num_seed_samples="all",
        # Optional: task context (inferred if not provided)
        # task_context="",
    ),
    dataset=seed_dataset,
    # Optional: custom name
    # name="my_synthetic_data",
    openai_api_key="<YOUR_OPENAI_API_KEY>",
)

Once the synthetic data generation is complete, you will be able to see it under the Datasets tab in the UI.

4

Optional Operations

Download the dataset:

pb.datasets.download(
    dataset_ref=dataset, # can be a string or a Dataset object
    dest="augmented_seed_dataset_20_rows.jsonl"
)

Load into pandas:

augmented_df = pd.read_json(
    "augmented_seed_dataset_20_rows.jsonl",
    lines=True
)

At this point, you can choose to modify or filter the dataset as you see fit. You can also upload it back to Predibase for fine-tuning.

Upload modified dataset:

modified_dataset = pb.datasets.from_pandas_dataframe(
    augmented_df,
    "augmented_seed_dataset_20_rows"
)

Next Steps