pb.datasets.augment (beta)
pb.datasets.augment
Create a new dataset with synthetic examples using an existing dataset and OpenAI
OpenAI costs
Note that you will incur OpenAI API costs when calling this augment function. We recommend starting with augmentation_strategy: "single_pass"
, a smaller num_samples_to_generate
and a cheaper OpenAI model.
Parameters:
config: AugmentationConfig
Specify base model to use and number of examples to generate
dataset: str
Seed dataset to use for augmentation
name: str
The name of the generated dataset
openai_api_key: str
Your OpenAI API key
Returns:
Dataset
Example:
import os
import pandas as pd
from predibase import Predibase, AugmentationConfig
# Initialize Predibase client
pb = Predibase(api_token="<YOUR_PREDIBASE_API_KEY>")
# Step 1: Grab a reference to a dataset already uploaded to Predibase
# If you don't have a dataset uploaded, load it through Pandas and upload it to
# Predibase via pb.datasets.from_pandas_dataframe() or pb.datasets.from_file()
seed_dataset = pb.datasets.get("file_uploads/seed_dataset_20_rows")
# Step 2. Call the augmentation function
# Once it finishes, it will upload the augmented dataset to Predibase (see Data tab)
dataset = pb.datasets.augment(
AugmentationConfig(
# gpt-4-turbo, gpt-4-1106-preview, gpt-4-0125-preview, gpt-4o, gpt-4o-2024-08-06, gpt-4o-mini
base_model="gpt-4o-mini",
# Augmentation Strategies currently supported are: 'mixture_of_agents', 'single_pass'
# augmentation_strategy="mixture_of_agents"
# Number of samples to generate: Recommended values (25, 50, 100, 200, 500, 1000)
num_samples_to_generate=50,
# num_seed_samples="all", # Optional - number of examples to use as seed samples from the dataset
# task_context="", # Optional - provide a context for the task to help the model. Inferred from data if not provided
),
dataset=seed_dataset,
# Optional, if not provided a name will be generated
# name="my_synthetic_data"
openai_api_key="<YOUR_OPENAI_API_KEY>",
)
# 3. [Optional] Download the augmented dataset to a local file to inspect/modify
pb.datasets.download(dataset_ref=dataset, dest="augmented_seed_dataset_20_rows.jsonl")
# 4. [Optional] Load the augmented dataset into a Pandas DataFrame if you want to make changes
augmented_df = pd.read_json("augmented_seed_dataset_20_rows.jsonl", lines=True)
# 5. [Optional] Reupload the modified dataset to Predibase for training
modified_dataset = pb.datasets.from_pandas_dataframe(augmented_df, "augmented_seed_dataset_20_rows")