The Datasets API provides methods for uploading, retrieving, and managing
datasets for fine-tuning.
Upload Dataset
From File
pb.datasets.from_file(
path: str, # Path to the file to upload
name: str, # Name for the dataset
) -> Dataset
Upload a file from which a dataset will be created.
Parameters
- path: str - Path to the file to upload
- name: str - Name for the dataset
Returns
- Dataset - The created dataset object
Example 1: Upload a JSONL file
# Upload a JSONL file as a dataset
dataset = pb.datasets.from_file(
path="./data/news_articles.jsonl",
name="tldr_news",
)
Example 2: Upload a CSV file
# Upload a CSV file as a dataset
dataset = pb.datasets.from_file(
path="./data/customer_qa.csv",
name="customer_questions",
)
From Pandas DataFrame
pb.datasets.from_pandas_dataframe(
df: pd.DataFrame, # Pandas DataFrame to upload
name: str, # Name for the dataset
) -> Dataset
Upload a Pandas DataFrame from which a dataset will be created.
Parameters
- df: pd.DataFrame - Pandas DataFrame to upload
- name: str - Name for the dataset
Returns
- Dataset - The created dataset object
Example
import pandas as pd
# Create a sample DataFrame
data = {
"prompt": ["Summarize this article:", "Explain this concept:"],
"completion": ["This is a summary.", "This is an explanation."]
}
df = pd.DataFrame(data)
# Upload the DataFrame as a dataset
dataset = pb.datasets.from_pandas_dataframe(
df=df,
name="sample_dataset",
)
Get Dataset
pb.datasets.get(
dataset_ref: str # Name of the dataset
) -> Dataset
Retrieves and returns a dataset by its reference name or ID.
Parameters
- dataset_ref: str - Name of the dataset
Returns
- Dataset - The requested dataset object
Example
# Get a dataset by name
dataset = pb.datasets.get("tldr_news")
# Print dataset details
print(f"Dataset: {dataset.name}")
print(f"Description: {dataset.description}")
print(f"Number of rows: {dataset.num_rows}")
print(f"Created at: {dataset.created_at}")
Download Dataset
pb.datasets.download(
dataset_ref: str, # Name of the dataset
dest: str = None, # Local destination path
) -> str
Download a dataset to your local machine. The dataset will be saved in JSONL format.
Parameters
- dataset_ref: str - Name of the dataset
- dest: str, optional - Local destination to download the dataset
Returns
- str - Path to the downloaded dataset
Example
# Download a dataset as JSONL
pb.datasets.download(
dataset_ref="tldr_news",
dest="./downloaded_datasets/",
)
print(f"Downloaded dataset to: {download_path}")
Augment Dataset
pb.datasets.augment(
config: dict | AugmentationConfig, # The configuration for the augmentation
dataset: Dataset, # The dataset to augment
name: str, # The name of the generated dataset, must be unique in the connection
openai_api_key: str, # The OpenAI API key. If not provided, will be read from `OPENAI_API_KEY`
) -> Dataset
Generate synthetic training examples by augmenting an existing dataset using large language models (LLMs). This method creates new, diverse examples that maintain the style and characteristics of your original dataset while expanding its size and variety.
Parameters
- config: dict | AugmentationConfig - The configuration for the augmentation
- dataset: Dataset - The dataset to augment
- name: str, optional - The name of the generated dataset, must be unique in the connection
- openai_api_key: str, optional - The OpenAI API key. If not provided, will be read from
OPENAI_API_KEY
Returns
- Dataset - The augmented dataset object
Example
from predibase import AugmentationConfig
# Create an augmentation configuration
config = AugmentationConfig(
base_model="gpt-4-turbo", # Required: The OpenAI model to use
num_samples_to_generate=500, # Optional: Number of synthetic examples to generate
num_seed_samples=10, # Optional: Number of seed samples to use
augmentation_strategy="mixture_of_agents", # Optional: Augmentation strategy
task_context="Generate diverse examples for customer service questions" # Optional: Task context
)
# Get the source dataset
source_dataset = pb.datasets.get("customer_questions")
# Augment the dataset
augmented_dataset = pb.datasets.augment(
config=config,
dataset=source_dataset,
name="augmented_customer_questions",
openai_api_key="your-openai-api-key" # Optional: If not set in environment
)
# Print dataset details
print(f"Created augmented dataset: {augmented_dataset.name}")
print(f"Number of rows: {augmented_dataset.num_rows}")
print(f"Created at: {augmented_dataset.created_at}")
For more details about the available configuration options, see the AugmentationConfig documentation.