pb.datasets.from_pandas_dataframe( df: pd.DataFrame, # Pandas DataFrame to upload name: str, # Name for the dataset) -> Dataset
Upload a Pandas DataFrame from which a dataset will be created.Parameters
df: pd.DataFrame - Pandas DataFrame to upload
name: str - Name for the dataset
Returns
Dataset - The created dataset object
Example
Copy
Ask AI
import pandas as pd# Create a sample DataFramedata = { "prompt": ["Summarize this article:", "Explain this concept:"], "completion": ["This is a summary.", "This is an explanation."]}df = pd.DataFrame(data)# Upload the DataFrame as a datasetdataset = pb.datasets.from_pandas_dataframe( df=df, name="sample_dataset", )
pb.datasets.get( dataset_ref: str # Name of the dataset) -> Dataset
Retrieves and returns a dataset by its reference name or ID.Parameters
dataset_ref: str - Name of the dataset
Returns
Dataset - The requested dataset object
Example
Copy
Ask AI
# Get a dataset by namedataset = pb.datasets.get("tldr_news")# Print dataset detailsprint(f"Dataset: {dataset.name}")print(f"Description: {dataset.description}")print(f"Number of rows: {dataset.num_rows}")print(f"Created at: {dataset.created_at}")
pb.datasets.download( dataset_ref: str, # Name of the dataset dest: str = None, # Local destination path) -> str
Download a dataset to your local machine. The dataset will be saved in JSONL format.Parameters
dataset_ref: str - Name of the dataset
dest: str, optional - Local destination to download the dataset
Returns
str - Path to the downloaded dataset
Example
Copy
Ask AI
# Download a dataset as JSONLpb.datasets.download( dataset_ref="tldr_news", dest="./downloaded_datasets/",)print(f"Downloaded dataset to: {download_path}")
pb.datasets.augment( config: dict | AugmentationConfig, # The configuration for the augmentation dataset: Dataset, # The dataset to augment name: str, # The name of the generated dataset, must be unique in the connection openai_api_key: str, # The OpenAI API key. If not provided, will be read from `OPENAI_API_KEY`) -> Dataset
Generate synthetic training examples by augmenting an existing dataset using large language models (LLMs). This method creates new, diverse examples that maintain the style and characteristics of your original dataset while expanding its size and variety.Parameters
config: dict | AugmentationConfig - The configuration for the augmentation
dataset: Dataset - The dataset to augment
name: str, optional - The name of the generated dataset, must be unique in the connection
openai_api_key: str, optional - The OpenAI API key. If not provided, will be read from OPENAI_API_KEY
Returns
Dataset - The augmented dataset object
Example
Copy
Ask AI
from predibase import AugmentationConfig# Create an augmentation configurationconfig = AugmentationConfig( base_model="gpt-4-turbo", # Required: The OpenAI model to use num_samples_to_generate=500, # Optional: Number of synthetic examples to generate num_seed_samples=10, # Optional: Number of seed samples to use augmentation_strategy="mixture_of_agents", # Optional: Augmentation strategy task_context="Generate diverse examples for customer service questions" # Optional: Task context)# Get the source datasetsource_dataset = pb.datasets.get("customer_questions")# Augment the datasetaugmented_dataset = pb.datasets.augment( config=config, dataset=source_dataset, name="augmented_customer_questions", openai_api_key="your-openai-api-key" # Optional: If not set in environment)# Print dataset detailsprint(f"Created augmented dataset: {augmented_dataset.name}")print(f"Number of rows: {augmented_dataset.num_rows}")print(f"Created at: {augmented_dataset.created_at}")
For more details about the available configuration options, see the AugmentationConfig documentation.