Skip to main content

pb.datasets.augment (beta)

pb.datasets.augment

Create a new dataset with synthetic examples using an existing dataset and OpenAI

OpenAI costs

Note that you will incur OpenAI API costs when calling this augment function. We recommend starting with augmentation_strategy: "single_pass", a smaller num_samples_to_generate and a cheaper OpenAI model.

Parameters:

   config: AugmentationConfig
Specify base model to use and number of examples to generate

   dataset: str
Seed dataset to use for augmentation

   name: str
The name of the generated dataset

   openai_api_key: str
Your OpenAI API key

Returns:

   Dataset

Example:

    import os
import pandas as pd

from predibase import Predibase, AugmentationConfig

# Initialize Predibase client
pb = Predibase(api_token="<YOUR_PREDIBASE_API_KEY>")

# Step 1: Grab a reference to a dataset already uploaded to Predibase
# If you don't have a dataset uploaded, load it through Pandas and upload it to
# Predibase via pb.datasets.from_pandas_dataframe() or pb.datasets.from_file()
seed_dataset = pb.datasets.get("file_uploads/seed_dataset_20_rows")

# Step 2. Call the augmentation function
# Once it finishes, it will upload the augmented dataset to Predibase (see Data tab)
dataset = pb.datasets.augment(
AugmentationConfig(
# gpt-4-turbo, gpt-4-1106-preview, gpt-4-0125-preview, gpt-4o, gpt-4o-2024-08-06, gpt-4o-mini
base_model="gpt-4o-mini",
# Augmentation Strategies currently supported are: 'mixture_of_agents', 'single_pass'
# augmentation_strategy="mixture_of_agents"
# Number of samples to generate: Recommended values (25, 50, 100, 200, 500, 1000)
num_samples_to_generate=50,
# num_seed_samples="all", # Optional - number of examples to use as seed samples from the dataset
# task_context="", # Optional - provide a context for the task to help the model. Inferred from data if not provided
),
dataset=seed_dataset,
# Optional, if not provided a name will be generated
# name="my_synthetic_data"
openai_api_key="<YOUR_OPENAI_API_KEY>",
)

# 3. [Optional] Download the augmented dataset to a local file to inspect/modify
pb.datasets.download(dataset_ref=dataset, dest="augmented_seed_dataset_20_rows.jsonl")

# 4. [Optional] Load the augmented dataset into a Pandas DataFrame if you want to make changes
augmented_df = pd.read_json("augmented_seed_dataset_20_rows.jsonl", lines=True)

# 5. [Optional] Reupload the modified dataset to Predibase for training
modified_dataset = pb.datasets.from_pandas_dataframe(augmented_df, "augmented_seed_dataset_20_rows")