Train and Evaluation Splits
For all of our dataset formats, you can assign an optional split column. The values in this column can either betrain
or evaluation
.
Any row with split == 'evaluation'
, will be used to compute evaluation metrics for the model during every checkpoint, while the train
examples are used for model training.
Create a Dataset
You can connect datasets to Predibase via the UI or Python SDK. Datasets can be uploaded from local files or connected from external data sources like Amazon S3, Snowflake, and Databricks.Upload Local File
You can connect your dataset to Predibase via the UI or Python SDK. To upload a file via the SDK, you can use:External Data Sources
Connecting data from an external data source is an Enterprise feature that provides a few key benefits over local file uploads:- Larger Datasets: External data sources can handle larger datasets than the 100MB limit for file uploads.
- Secure Access: No data is persisted in Predibase, and credentials are encrypted and stored securely.
- Out-of-Core Processing: External data sources can be streamed to handle datasets that are too large to fit in memory.
- Amazon S3
- Snowflake
- Databricks
- BigQuery
Supported File Formats
Predibase supports uploading datasets from local files, or connecting to external data sources including Amazon S3, Snowflake, and Databricks. For file uploads, the following file formats are supported:- JSON (.json)
- JSONL (.jsonl)
- CSV (.csv)
- Parquet (.parquet)
Troubleshooting and Common Errors
Troubleshooting and Common Errors
Here is a list of common errors that you might run into during file uploads. The most typical issues encountered generally fall under JSONL formatting errors. Try running the code snippet below first to see if your dataset is properly formatted.
ValueError: Trailing data
- This generally occurs when a dataset formatted as a JSONL file is uploaded with a .json file extension. Try uploading the same dataset with the .jsonl file extension.No ':' found when decoding object value
- This generally occurs with malformed JSON. Check that the dataset file is formatted correctly.C error: Expected x fields in line y, saw z
- This generally occurs when one or more rows in the dataset contains too many or too few entries. Check the error message for the problematic line and make sure that it is formatted correctly. Also, make sure the dataset is formatted as specified in the section below (How to Structure Your Dataset)ValueError: Expected object or value
- This generally occurs with malformed JSON. Check if the code snippet above can properly read the dataset. The problem may involve the encoding or the structure of the json file.
Dataset Schema
The required schema for the dataset depends on the specific fine-tuning task type.Supervised Fine-Tuning
Prompt-Completion Format
For SFT tasks, your dataset must contain two columns named prompt and completion:- prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
- completion: The expected response that corresponds to the input provided in the “prompt” column.
- split (optional): Should be either
train
orevaluation
. To learn more, check out this section.
instruct_dataset.jsonl
apply_chat_template=True
in your fine-tuning config. This will automatically apply the appropriate
chat template for the base model.
Note: This is only applicable for instruction tuned models.
Messages Format
For chat fine-tuning tasks, your dataset must contain one column named messages:- messages: Conversations in a JSON-style format with defined roles. Should
be familiar to users used to working with OpenAI. Each row must contain at
least one
user
role and oneassistant
role.weight
(0 or 1) can be passed in for assistant messages to determine whether or not they are used for calculating loss (0 means no, 1 means yes, defaults to 1). - split (optional) Should be either
train
orevaluation
. To learn more, check out this section.
.jsonl
format.
Example of chat dataset:
chat_dataset.jsonl
messages
:
chat_dataset_with_split.jsonl
Continued Pretraining
For CPT tasks, your dataset must contain one column named text:- text: Your input text. The model will learn do next-token prediction on these inputs.
- split (optional): Should be either
train
orevaluation
. To learn more, check out this section.
completion_dataset.jsonl
Reinforcement Fine-Tuning (GRPO)
For GRPO tasks, your dataset must contain one column named prompt:- prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
- split (optional) Should be either
train
orevaluation
. To learn more, check out this section.
grpo_dataset.jsonl
Classification
For classification your dataset must contain the columns text and label:- text: Your input text that is classified by the model.
- label: The label for the text in the text columns. This should be a string.
- split (optional): Should be either
train
orevaluation
. check out this section.
Text
For the text field, the data in this field will be exactly input as is to the model as its prompt. If you want to use chat messages, first apply the chat template to them using the transformers library and use that string as the text field. For example:Label
As a best practice you should verify your labels don’t have any unnecessary punctuation or spaces. Also use consistent casing since the labelspositive
and Positive
will be treated as different labels.
A simple normalization function for the labels is:
Pretokenized Datasets for Large-Scale Training (Advanced)
For large training jobs that need to process over 1GB of training data, we suggest creating a pretokenized dataset using the HuggingFace Datasets package. Once you’ve written your dataset using the save_to_disk helper function and uploaded it to an external data source like S3, you can connect it in Predibase for fine-tuning provided it follows the schema below:- input_ids: The input prompt and expected response, both tokenized and concatenated together
- labels: The labels array must have the same length as input_ids. For instruction tuning, set labels to -100 for prompt tokens and use the completion tokens as is. For continued pretraining, labels should be identical to input_ids. Any label with value -100 will be ignored during loss calculation.
- split (optional) Should be either
train
orevaluation
. To learn more, check out this section. We recommend skipping thesplit
column for large-scale training.
When using pretokenized datasets, the task type specified in the config is
effectively ignored. Loss computation is based entirely on the labels
provided. If you are unfamiliar with properly formatting a pretokenized
dataset for your task, we highly encourage you to consider using a text
dataset and not a pretokenized one.