The first step in fine-tuning your model is to prepare a training dataset — a file containing examples for the model to learn from.

Train and Evaluation Splits

For all of our dataset formats, you can assign an optional split column. The values in this column can either be train or evaluation. Any row with split == 'evaluation', will be used to compute evaluation metrics for the model during every checkpoint, while the train examples are used for model training.

Create a Dataset

You can connect datasets to Predibase via the UI or Python SDK. Datasets can be uploaded from local files or connected from external data sources like Amazon S3, Snowflake, and Databricks.

Upload Local File

You can connect your dataset to Predibase via the UI or Python SDK. To upload a file via the SDK, you can use:
from predibase import Predibase
pb = Predibase(api_token=<API_TOKEN>)

dataset = pb.datasets.from_file("/path/to/dataset.csv", name="my_dataset")

External Data Sources

Connecting data from an external data source is an Enterprise feature that provides a few key benefits over local file uploads:
  • Larger Datasets: External data sources can handle larger datasets than the 100MB limit for file uploads.
  • Secure Access: No data is persisted in Predibase, and credentials are encrypted and stored securely.
  • Out-of-Core Processing: External data sources can be streamed to handle datasets that are too large to fit in memory.
You can connect datasets from an external data source through the Web UI. Navigate to Data > Connect Data and select the data source you want to connect. Supported external data sources include:
  • Amazon S3
  • Snowflake
  • Databricks
  • BigQuery

Supported File Formats

Predibase supports uploading datasets from local files, or connecting to external data sources including Amazon S3, Snowflake, and Databricks. For file uploads, the following file formats are supported:
  • JSON (.json)
  • JSONL (.jsonl)
  • CSV (.csv)
  • Parquet (.parquet)
There is a limit of 100MB for file uploads. If your dataset is larger than 100MB, try using an external data source to connect your data to Predibase.

Dataset Schema

The required schema for the dataset depends on the specific fine-tuning task type.

Supervised Fine-Tuning

Prompt-Completion Format

For SFT tasks, your dataset must contain two columns named prompt and completion:
  • prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
  • completion: The expected response that corresponds to the input provided in the “prompt” column.
  • split (optional): Should be either train or evaluation. To learn more, check out this section.
Any additional columns in your dataset will not be used. Example dataset:
instruct_dataset.jsonl
{"prompt": "What is your name?", "completion": "Hi, I'm Pred!"}
{"prompt": "How are you today?", "completion": "I'm doing well how are you?"}
When doing supervised fine-tuning, you should add apply_chat_template=True in your fine-tuning config. This will automatically apply the appropriate chat template for the base model. Note: This is only applicable for instruction tuned models.
from predibase import SFTConfig

config = SFTConfig(
    base_model="llama-3-1-8b-instruct",
    apply_chat_template=True  # Automatically apply the model's chat template
)

Messages Format

For chat fine-tuning tasks, your dataset must contain one column named messages:
  • messages: Conversations in a JSON-style format with defined roles. Should be familiar to users used to working with OpenAI. Each row must contain at least one user role and one assistant role. weight (0 or 1) can be passed in for assistant messages to determine whether or not they are used for calculating loss (0 means no, 1 means yes, defaults to 1).
  • split (optional) Should be either train or evaluation. To learn more, check out this section.
We generally recommend saving these datasets in .jsonl format. Example of chat dataset:
chat_dataset.jsonl
{"messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hi, I'm Pred!"}]}
{"messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I'm doing well how are you?", "weight": 1}]}
You can specify a split column in your dataset to distinguish training and evaluation rows. For chat datasets, the split should be included in the top-level JSON object for each row alongside the messages:
chat_dataset_with_split.jsonl
{"split": "train", "messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hi, I'm Pred!"}]}
{"split": "evaluation", "messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hi, I'm Pred!"}]}

Continued Pretraining

For CPT tasks, your dataset must contain one column named text:
  • text: Your input text. The model will learn do next-token prediction on these inputs.
  • split (optional): Should be either train or evaluation. To learn more, check out this section.
Any additional columns in your dataset will not be used. Example:
completion_dataset.jsonl
{"text": "Once upon a time there was dog."}
{"text": "The dog was very friendly."}

Reinforcement Fine-Tuning (GRPO)

For GRPO tasks, your dataset must contain one column named prompt:
  • prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
  • split (optional) Should be either train or evaluation. To learn more, check out this section.
grpo_dataset.jsonl
{"prompt": "What is your name?"}
{"prompt": "How are you today?"}
While labels are not explicitly required for GRPO, you can pass in one or more columns of labels (or other data) in your dataset. These columns will then be passed to the reward functions you define so that you can use them to calculate rewards for your task.

Classification

For classification your dataset must contain the columns text and label:
  • text: Your input text that is classified by the model.
  • label: The label for the text in the text columns. This should be a string.
  • split (optional): Should be either train or evaluation. check out this section.

Text

For the text field, the data in this field will be exactly input as is to the model as its prompt. If you want to use chat messages, first apply the chat template to them using the transformers library and use that string as the text field. For example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("your-base-model")
messages = [
  {"role": "system", "content": "Your job is to label a movie review as either positive or negative."}
  {"role": "user", "content": "This film was a delightful surprise—sharp writing, strong performances, and visuals that kept me hooked from start to finish."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

Label

As a best practice you should verify your labels don’t have any unnecessary punctuation or spaces. Also use consistent casing since the labels positive and Positive will be treated as different labels. A simple normalization function for the labels is:
def normalize(label: str):
  return label.strip().lower()

Pretokenized Datasets for Large-Scale Training (Advanced)

For large training jobs that need to process over 1GB of training data, we suggest creating a pretokenized dataset using the HuggingFace Datasets package. Once you’ve written your dataset using the save_to_disk helper function and uploaded it to an external data source like S3, you can connect it in Predibase for fine-tuning provided it follows the schema below:
  • input_ids: The input prompt and expected response, both tokenized and concatenated together
  • labels: The labels array must have the same length as input_ids. For instruction tuning, set labels to -100 for prompt tokens and use the completion tokens as is. For continued pretraining, labels should be identical to input_ids. Any label with value -100 will be ignored during loss calculation.
  • split (optional) Should be either train or evaluation. To learn more, check out this section. We recommend skipping the split column for large-scale training.
Though the Datasets library allows for flexible schemas, to use your dataset in Predibase, it must contain columns input_ids and labels. All other columns will be ignored.
When using pretokenized datasets, the task type specified in the config is effectively ignored. Loss computation is based entirely on the labels provided. If you are unfamiliar with properly formatting a pretokenized dataset for your task, we highly encourage you to consider using a text dataset and not a pretokenized one.

Next Steps

For more guidance on preparing your datasets, including best practices and recommendations for train/evaluation splits, see our Fine-tuning Dataset Preparation Best Practices Guide.