The Datasets API provides methods for uploading, retrieving, and managing datasets for fine-tuning.

Upload Dataset

From File

pb.datasets.from_file(
    path: str,                    # Path to the file to upload
    name: str,                    # Name for the dataset
) -> Dataset

Upload a file from which a dataset will be created.

Parameters

  • path: str - Path to the file to upload
  • name: str - Name for the dataset

Returns

  • Dataset - The created dataset object

Example 1: Upload a JSONL file

# Upload a JSONL file as a dataset
dataset = pb.datasets.from_file(
    path="./data/news_articles.jsonl",
    name="tldr_news",
)

Example 2: Upload a CSV file

# Upload a CSV file as a dataset
dataset = pb.datasets.from_file(
    path="./data/customer_qa.csv",
    name="customer_questions",
)

From Pandas DataFrame

pb.datasets.from_pandas_dataframe(
    df: pd.DataFrame,             # Pandas DataFrame to upload
    name: str,                    # Name for the dataset
) -> Dataset

Upload a Pandas DataFrame from which a dataset will be created.

Parameters

  • df: pd.DataFrame - Pandas DataFrame to upload
  • name: str - Name for the dataset

Returns

  • Dataset - The created dataset object

Example

import pandas as pd

# Create a sample DataFrame
data = {
    "prompt": ["Summarize this article:", "Explain this concept:"],
    "completion": ["This is a summary.", "This is an explanation."]
}
df = pd.DataFrame(data)

# Upload the DataFrame as a dataset
dataset = pb.datasets.from_pandas_dataframe(
    df=df,
    name="sample_dataset",  
)

Get Dataset

pb.datasets.get(
    dataset_ref: str             # Name of the dataset
) -> Dataset

Retrieves and returns a dataset by its reference name or ID.

Parameters

  • dataset_ref: str - Name of the dataset

Returns

  • Dataset - The requested dataset object

Example

# Get a dataset by name
dataset = pb.datasets.get("tldr_news")

# Print dataset details
print(f"Dataset: {dataset.name}")
print(f"Description: {dataset.description}")
print(f"Number of rows: {dataset.num_rows}")
print(f"Created at: {dataset.created_at}")

Download Dataset

pb.datasets.download(
    dataset_ref: str,     # Name of the dataset
    dest: str = None,     # Local destination path
) -> str

Download a dataset to your local machine. The dataset will be saved in JSONL format.

Parameters

  • dataset_ref: str - Name of the dataset
  • dest: str, optional - Local destination to download the dataset

Returns

  • str - Path to the downloaded dataset

Example

# Download a dataset as JSONL
pb.datasets.download(
    dataset_ref="tldr_news",
    dest="./downloaded_datasets/",
)
print(f"Downloaded dataset to: {download_path}")

Augment Dataset

pb.datasets.augment(
    config: dict | AugmentationConfig,  # The configuration for the augmentation
    dataset: Dataset,                   # The dataset to augment
    name: str,                          # The name of the generated dataset, must be unique in the connection
    openai_api_key: str,                # The OpenAI API key. If not provided, will be read from `OPENAI_API_KEY`
) -> Dataset

Generate synthetic training examples by augmenting an existing dataset using large language models (LLMs). This method creates new, diverse examples that maintain the style and characteristics of your original dataset while expanding its size and variety.

Parameters

  • config: dict | AugmentationConfig - The configuration for the augmentation
  • dataset: Dataset - The dataset to augment
  • name: str, optional - The name of the generated dataset, must be unique in the connection
  • openai_api_key: str, optional - The OpenAI API key. If not provided, will be read from OPENAI_API_KEY

Returns

  • Dataset - The augmented dataset object

Example

from predibase import AugmentationConfig

# Create an augmentation configuration
config = AugmentationConfig(
    base_model="gpt-4-turbo",  # Required: The OpenAI model to use
    num_samples_to_generate=500,  # Optional: Number of synthetic examples to generate
    num_seed_samples=10,  # Optional: Number of seed samples to use
    augmentation_strategy="mixture_of_agents",  # Optional: Augmentation strategy
    task_context="Generate diverse examples for customer service questions"  # Optional: Task context
)

# Get the source dataset
source_dataset = pb.datasets.get("customer_questions")

# Augment the dataset
augmented_dataset = pb.datasets.augment(
    config=config,
    dataset=source_dataset,
    name="augmented_customer_questions",
    openai_api_key="your-openai-api-key"  # Optional: If not set in environment
)

# Print dataset details
print(f"Created augmented dataset: {augmented_dataset.name}")
print(f"Number of rows: {augmented_dataset.num_rows}")
print(f"Created at: {augmented_dataset.created_at}")

For more details about the available configuration options, see the AugmentationConfig documentation.