Prepare Data
One of the necessary ingredients before you can begin fine-tuning is creating and prepareing training dataset to use. While gathering or constructing the dataset, you should aim to keep the following principles in mind:
- The dataset should contain a diverse set of examples
- The dataset should have at minimum 500-1000 representative examples, though more high-quality examples may help increase performance
- The examples in the training dataset should be similar to the requests at inference time
Dataset Formats
Predibase supports many connectors including File Upload, Snowflake, Databricks, Amazon S3 and others. For file uploads, the supported data formats include but are not limited to:
- JSON
- JSONL
- CSV
- Parquet
- MS Excel
There is a limit of 100MB for file uploads. If your dataset is larger than 100MB, try using cloud storage (e.g. Amazon S3, BigQuery, etc) to connect your data to Predibase.
Upload Dataset
You can connect your dataset to Predibase via the UI or Python SDK. To upload a file via the SDK, you can use:
dataset = pc.upload_file('{Path to local file}', 'Code Alpaca Dataset')
If you want to connect data from remote storage, please use the Web UI and navigate to Data > Connect Data. In the near future, we will be adding support to connect data from remote storage directly in the Predibase SDK.
Troubleshooting and Common Errors
import pandas as pd
pd.read_json("path/to/dataset.json")
pd.read_json("path/to/dataset.json", lines=True)
ValueError: Trailing data
- This generally occurs when a dataset formatted as a JSONL file is uploaded with a .json file extension. Try uploading the same dataset with the .jsonl file extension.
No ':' found when decoding object value
- This generally occurs with malformed JSON. Check that the dataset file is formatted correctly.
C error: Expected x fields in line y, saw z
- This generally occurs when one or more rows in the dataset contains too many or too few entries. Check the error message for the problematic line and make sure that it is formatted correctly. Also, make sure the dataset is formatted as specified in the section below (How to Structure Your Dataset)
ValueError: Expected object or value
- This generally occurs with malformed JSON. Check if the code snippet above can properly read the dataset. The problem may involve the encoding or the structure of the json file.
How to Structure Your Dataset
We just launched a major overhaul of our fine-tuning experience, which includes the removal of configuring prompt templates as part of your fine-tuning job. Instead, users will need to format their data ahead of time and upload it to Predibase. We've created this helper notebook for users to convert existing datasets to follow the new format.
Instruction Fine-Tuning Dataset Fields
For instruction fine-tuning, your dataset must contain two columns named prompt and completion:
- prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
- completion: The expected response that corresponds to the input provided in the "prompt" column.
- split (optional): Should be either
train
orevaluation
. To learn more, check out this section.
Any additional columns in your dataset will not be used.
(Beta) Text Completion Dataset Fields
For text completion, your dataset must contain one column named text:
- text: Your input text. The model will learn do next-token prediction on these inputs.
- split (optional): Should be either
train
orevaluation
. To learn more, check out this section.
Any additional columns in your dataset will not be used.
(Beta) Chat Dataset Fields
For chat fine-tuning, your dataset must contain one column named messages:
- messages: Conversations in a JSON-style format with defined roles. Should be familiar to users used to working with OpenAI. Each row must contain at least one
user
role and oneassistant
role.weight
(0 or 1) can be passed in for assistant messages to determine whether or not they are used for calculating loss (0 means no, 1 means yes, defaults to 1).
Example of chat dataset:
{"messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hi, I'm Pred!"}]}
{"messages": [{"role": "system", "content": "You are a chatbot named Pred"}, {"role": "user", "content": "How are you?"}, {"role": "assistant", "content": "I'm doing well how are you?", "weight": 0}]}
To run chat fine-tuning, users should fine-tune an instruction-tuned model on a chat dataset. Predibase will automatically infer the task from the dataset type, switching from instruction tuning to chat fine-tuning.
Instruction Formatting
Experimentation shows that using model-specific instruction templates significantly boosts performance.
To learn more about how to use instruction templates with your data to improve performance, you can read more here.
Pretokenized Data
If you have pretokenized your data and want to use that data for finetuning, your dataset must contain two columns named input_ids and labels.
- input_ids: The input prompt and expected response, both tokenized and concatenated
- labels: Varies depending on desired loss calculations
- split (optional) Should be either
train
orevaluation
. To learn more, check out this section.
For fine-tuning, you do not have to specify anything else. If your dataset has input_ids and labels as columns, preprocessing during fine-tuning will default to pretokenized preprocessing, EVEN IF you have the columns needed for other dataset types (i.e., prompt and completion for Instruction Fine-Tuning).
Note: when using pretokenized datasets, the task type specified in the config is effectively ignored. Loss computation is based entirely on the labels provided. If you are unfamiliar with properly formatting a pretokenized dataset for your task, we highly encourage you to consider using a text dataset and not a pretokenized one.
Train and Evaluation Splits
Once your dataset is formatted, you have the option to specify an optional split column.
- No Split Column: If no split column is specified in your dataset, all of your data will be used as training data and only training set metrics will be reported. The checkpoint used for inference will be the final checkpoint since there is no evaluation set to distinguish the optimal checkpoint.
- Split Column: If you do choose to specify a split column, the dataset should be divided into two parts: one for
train
and one forevaluation
. The checkpoint used for inference will be the checkpoint with the best performance (lowest loss) on the evaluation set.
How Evaluation Sets are Used in Predibase
When you start fine-tuning, we'll give you updates on how the model is performing on both splits. This helps you see how well it's learning on the training data and performing on data it hasn't seen before (i.e. evaluation data).
To specify datasets splits in Predibase, our recommendation is to:
- Set aside 80% of your data as the training set and 20% as your evaluation set. For smaller datasets, set aside 90% for training and 10% for evaluation.
- Create a new column in your dataset called
split
. - Set the split value for rows that should be used as training data as
train
. - Set the split value for rows that should be used for evaluation as
evaluation
.
Note, train
and evaluation
are the only two values supported.