Datasets

The Datasets API provides methods for uploading, retrieving, and managing datasets for fine-tuning.

Upload Dataset

From File

pb.datasets.from_file(
    path: str,                    # Path to the file to upload
    name: str,                    # Name for the dataset
) -> Dataset

Upload a file from which a dataset will be created. Parameters

path: str - Path to the file to upload
name: str - Name for the dataset

Returns

Dataset - The created dataset object

Example 1: Upload a JSONL file

# Upload a JSONL file as a dataset
dataset = pb.datasets.from_file(
    path="./data/news_articles.jsonl",
    name="tldr_news",
)

Example 2: Upload a CSV file

# Upload a CSV file as a dataset
dataset = pb.datasets.from_file(
    path="./data/customer_qa.csv",
    name="customer_questions",
)

From Pandas DataFrame

pb.datasets.from_pandas_dataframe(
    df: pd.DataFrame,             # Pandas DataFrame to upload
    name: str,                    # Name for the dataset
) -> Dataset

Upload a Pandas DataFrame from which a dataset will be created. Parameters

df: pd.DataFrame - Pandas DataFrame to upload
name: str - Name for the dataset

Returns

Dataset - The created dataset object

Example

import pandas as pd

# Create a sample DataFrame
data = {
    "prompt": ["Summarize this article:", "Explain this concept:"],
    "completion": ["This is a summary.", "This is an explanation."]
}
df = pd.DataFrame(data)

# Upload the DataFrame as a dataset
dataset = pb.datasets.from_pandas_dataframe(
    df=df,
    name="sample_dataset",  
)

Get Dataset

pb.datasets.get(
    dataset_ref: str             # Name of the dataset
) -> Dataset

Retrieves and returns a dataset by its reference name or ID. Parameters

dataset_ref: str - Name of the dataset

Returns

Dataset - The requested dataset object

Example

# Get a dataset by name
dataset = pb.datasets.get("tldr_news")

# Print dataset details
print(f"Dataset: {dataset.name}")
print(f"Description: {dataset.description}")
print(f"Number of rows: {dataset.num_rows}")
print(f"Created at: {dataset.created_at}")

Download Dataset

pb.datasets.download(
    dataset_ref: str,     # Name of the dataset
    dest: str = None,     # Local destination path
) -> str

Download a dataset to your local machine. The dataset will be saved in JSONL format. Parameters

dataset_ref: str - Name of the dataset
dest: str, optional - Local destination to download the dataset

Returns

str - Path to the downloaded dataset

Example

# Download a dataset as JSONL
pb.datasets.download(
    dataset_ref="tldr_news",
    dest="./downloaded_datasets/",
)
print(f"Downloaded dataset to: {download_path}")

Augment Dataset

pb.datasets.augment(
    config: dict | AugmentationConfig,  # The configuration for the augmentation
    dataset: Dataset,                   # The dataset to augment
    name: str,                          # The name of the generated dataset, must be unique in the connection
    openai_api_key: str,                # The OpenAI API key. If not provided, will be read from `OPENAI_API_KEY`
) -> Dataset

Generate synthetic training examples by augmenting an existing dataset using large language models (LLMs). This method creates new, diverse examples that maintain the style and characteristics of your original dataset while expanding its size and variety. Parameters

config: dict | AugmentationConfig - The configuration for the augmentation
dataset: Dataset - The dataset to augment
name: str, optional - The name of the generated dataset, must be unique in the connection
openai_api_key: str, optional - The OpenAI API key. If not provided, will be read from OPENAI_API_KEY

Returns

Dataset - The augmented dataset object

Example

from predibase import AugmentationConfig

# Create an augmentation configuration
config = AugmentationConfig(
    base_model="gpt-4-turbo",  # Required: The OpenAI model to use
    num_samples_to_generate=500,  # Optional: Number of synthetic examples to generate
    num_seed_samples=10,  # Optional: Number of seed samples to use
    augmentation_strategy="mixture_of_agents",  # Optional: Augmentation strategy
    task_context="Generate diverse examples for customer service questions"  # Optional: Task context
)

# Get the source dataset
source_dataset = pb.datasets.get("customer_questions")

# Augment the dataset
augmented_dataset = pb.datasets.augment(
    config=config,
    dataset=source_dataset,
    name="augmented_customer_questions",
    openai_api_key="your-openai-api-key"  # Optional: If not set in environment
)

# Print dataset details
print(f"Created augmented dataset: {augmented_dataset.name}")
print(f"Number of rows: {augmented_dataset.num_rows}")
print(f"Created at: {augmented_dataset.created_at}")

For more details about the available configuration options, see the AugmentationConfig documentation.

Inference

Fine-Tuning

Upload Dataset

From File

From Pandas DataFrame

Get Dataset

Download Dataset

Augment Dataset

Inference

Fine-Tuning

​Upload Dataset

​From File

​From Pandas DataFrame

​Get Dataset

​Download Dataset

​Augment Dataset

Upload Dataset

From File

From Pandas DataFrame

Get Dataset

Download Dataset

Augment Dataset