The first step in fine-tuning your model is to prepare a training dataset — a file containing examples for the model to learn from.

Create a Dataset

You can connect datasets to Predibase via the UI or Python SDK. Datasets can be uploaded from local files or connected from external data sources like Amazon S3, Snowflake, and Databricks.

Upload Local File

You can connect your dataset to Predibase via the UI or Python SDK. To upload a file via the SDK, you can use:

from predibase import Predibase
pb = Predibase(api_token=<API_TOKEN>)

dataset = pb.datasets.from_file("/path/to/dataset.csv", name="my_dataset")

External Data Sources

Connecting data from an external data source is an Enterprise feature that provides a few key benefits over local file uploads:

  • Larger Datasets: External data sources can handle larger datasets than the 100MB limit for file uploads.
  • Secure Access: No data is persisted in Predibase, and credentials are encrypted and stored securely.
  • Out-of-Core Processing: External data sources can be streamed to handle datasets that are too large to fit in memory.

You can connect datasets from an external data source through the Web UI. Navigate to Data > Connect Data and select the data source you want to connect.

Supported external data sources include:

  • Amazon S3
  • Snowflake
  • Databricks
  • BigQuery

Supported File Formats

Predibase supports uploading datasets from local files, or connecting to external data sources including Amazon S3, Snowflake, and Databricks.

For file uploads, the following file formats are supported:

  • JSON (.json)
  • JSONL (.jsonl)
  • CSV (.csv)
  • Parquet (.parquet)
  • MS Excel (.xlsx)

There is a limit of 100MB for file uploads. If your dataset is larger than 100MB, try using an external data source to connect your data to Predibase.

Dataset Schema

The required schema for the dataset depends on the specific fine-tuning task type.

Supervised Fine-Tuning

Prompt-Completion Format

For SFT tasks, your dataset must contain two columns named prompt and completion:

  • prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
  • completion: The expected response that corresponds to the input provided in the “prompt” column.
  • split (optional): Should be either train or evaluation. To learn more, check out this section.

Any additional columns in your dataset will not be used.

Example dataset:

instruct_dataset.jsonl
{"prompt": "What is your name?", "completion": "Hi, I'm Pred!"}
{"prompt": "How are you today?", "completion": "I'm doing well how are you?"}

When doing supervised fine-tuning, you should add apply_chat_template=True in your fine-tuning config. This will automatically apply the appropriate chat template for the base model. Note: This is only applicable for instruction tuned models.

from predibase import SFTConfig

config = SFTConfig(
    base_model="llama-3-1-8b-instruct",
    apply_chat_template=True  # Automatically apply the model's chat template
)

Messages Format

For chat fine-tuning tasks, your dataset must contain one column named messages:

  • messages: Conversations in a JSON-style format with defined roles. Should be familiar to users used to working with OpenAI. Each row must contain at least one user role and one assistant role. weight (0 or 1) can be passed in for assistant messages to determine whether or not they are used for calculating loss (0 means no, 1 means yes, defaults to 1).
  • split (optional) Should be either train or evaluation. To learn more, check out this section.

We generally recommend saving these datasets in .jsonl format.

Example of chat dataset:

chat_dataset.jsonl
{"messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hi, I'm Pred!"}]}
{"messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I'm doing well how are you?", "weight": 1}]}

You can specify a split column in your dataset to distinguish training and evaluation rows. For chat datasets, the split should be included in the top-level JSON object for each row alongside the messages:

chat_dataset_with_split.jsonl
{"split": "train", "messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hi, I'm Pred!"}]}
{"split": "evaluation", "messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hi, I'm Pred!"}]}

Continued Pretraining

For CPT tasks, your dataset must contain one column named text:

  • text: Your input text. The model will learn do next-token prediction on these inputs.
  • split (optional): Should be either train or evaluation. To learn more, check out this section.

Any additional columns in your dataset will not be used.

Example:

completion_dataset.jsonl
{"text": "Once upon a time there was dog."}
{"text": "The dog was very friendly."}

Reinforcement Fine-Tuning (GRPO)

For GRPO tasks, your dataset must contain one column named prompt:

  • prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
  • split (optional) Should be either train or evaluation. To learn more, check out this section.
grpo_dataset.jsonl
{"prompt": "What is your name?"}
{"prompt": "How are you today?"}

While labels are not explicitly required for GRPO, you can pass in one or more columns of labels (or other data) in your dataset. These columns will then be passed to the reward functions you define so that you can use them to calculate rewards for your task.

Pretokenized Datasets for Large-Scale Training (Advanced)

For large training jobs that need to process over 1GB of training data, we suggest creating a pretokenized dataset using the HuggingFace Datasets package. Once you’ve written your dataset using the save_to_disk helper function and uploaded it to an external data source like S3, you can connect it in Predibase for fine-tuning provided it follows the schema below:

  • input_ids: The input prompt and expected response, both tokenized and concatenated together
  • labels: The labels array must have the same length as input_ids. For instruction tuning, set labels to -100 for prompt tokens and use the completion tokens as is. For continued pretraining, labels should be identical to input_ids. Any label with value -100 will be ignored during loss calculation.
  • split (optional) Should be either train or evaluation. To learn more, check out this section. We recommend skipping the split column for large-scale training.

Though the Datasets library allows for flexible schemas, to use your dataset in Predibase, it must contain columns input_ids and labels. All other columns will be ignored.

When using pretokenized datasets, the task type specified in the config is effectively ignored. Loss computation is based entirely on the labels provided. If you are unfamiliar with properly formatting a pretokenized dataset for your task, we highly encourage you to consider using a text dataset and not a pretokenized one.

Next Steps

For more guidance on preparing your datasets, including best practices and recommendations for train/evaluation splits, see our Fine-tuning Dataset Preparation Best Practices Guide.