Skip to main content

Prepare Data

One of the necessary ingredients before you can begin fine-tuning is creating and prepareing training dataset to use. While gathering or constructing the dataset, you should aim to keep the following principles in mind:

  • The dataset should contain a diverse set of examples
  • The dataset should have at minimum 500-1000 representative examples, though more high-quality examples may help increase performance
  • The examples in the training dataset should be similar to the requests at inference time

Dataset Formats

Predibase supports many connectors including File Upload, Snowflake, Databricks, Amazon S3 and others. For file uploads, the supported data formats include but are not limited to:

  • JSON
  • JSONL
  • CSV
  • Parquet
  • MS Excel

There is a limit of 100MB for file uploads. If your dataset is larger than 100MB, try using cloud storage (e.g. Amazon S3, BigQuery, etc) to connect your data to Predibase.

Upload Dataset

You can connect your dataset to Predibase via the UI or Python SDK. To upload a file via the SDK, you can use:

dataset = pc.upload_file('{Path to local file}', 'Code Alpaca Dataset')

If you want to connect data from remote storage, please use the Web UI and navigate to Data > Connect Data. In the near future, we will be adding support to connect data from remote storage directly in the Predibase SDK.

Troubleshooting and Common Errors
Here is a list of common errors that you might run into during file uploads. The most typical issues encountered generally fall under JSON(L) formatting errors. Try running the code snippet below first to see if your dataset is properly formatted.
import pandas as pd

pd.read_json("path/to/dataset.json")

pd.read_json("path/to/dataset.json", lines=True)
  • ValueError: Trailing data
    • This generally occurs when a dataset formatted as a JSONL file is uploaded with a .json file extension. Try uploading the same dataset with the .jsonl file extension.
  • No ':' found when decoding object value
    • This generally occurs with malformed JSON. Check that the dataset file is formatted correctly.
  • C error: Expected x fields in line y, saw z
    • This generally occurs when one or more rows in the dataset contains too many or too few entries. Check the error message for the problematic line and make sure that it is formatted correctly. Also, make sure the dataset is formatted as specified in the section below (How to Structure Your Dataset)
  • ValueError: Expected object or value
    • This generally occurs with malformed JSON. Check if the code snippet above can properly read the dataset. The problem may involve the encoding or the structure of the json file.

How to Structure Your Dataset

Convert your dataset to our new format

We just launched a major overhaul of our fine-tuning experience, which includes the removal of configuring prompt templates as part of your fine-tuning job. Instead, users will need to format their data ahead of time and upload it to Predibase. We've created this helper notebook for users to convert existing datasets to follow the new format.

Colab Notebook: Dataset Prep Helper

Instruction Fine-Tuning Dataset Fields

For instruction fine-tuning, your dataset must contain two columns named prompt and completion:

  • prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
  • completion: The expected response that corresponds to the input provided in the "prompt" column.
  • split (optional): Should be either train or evaluation. To learn more, check out this section.

Any additional columns in your dataset will not be used.

(Beta) Text Completion Dataset Fields

For text completion, your dataset must contain one column name text:

  • text: Your input text. The model will learn do next-token prediction on these inputs.
  • split (optional): Should be either train or evaluation. To learn more, check out this section.

Any additional columns in your dataset will not be used.

Instruction Formatting

Experimentation shows that using model-specific instruction templates significantly boosts performance.

To learn more about how to use instruction templates with your data to improve performance, you can read more here.

Train and Evaluation Splits

Once your dataset is formatted, you have the option to specify an optional split column.

  • No Split Column: If no split column is specified in your dataset, all of your data will be used as training data and only training set metrics will be reported. The checkpoint used for inference will be the final checkpoint since there is no evaluation set to distinguish the optimal checkpoint.
  • Split Column: If you do choose to specify a split column, the dataset should be divided into two parts: one for train and one for evaluation. The checkpoint used for inference will be the checkpoint with the best performance (lowest loss) on the evaluation set.

How Evaluation Sets are Used in Predibase

When you start fine-tuning, we'll give you updates on how the model is performing on both splits. This helps you see how well it's learning on the training data and performing on data it hasn't seen before (i.e. evaluation data).

To specify datasets splits in Predibase, our recommendation is to:

  • Set aside 80% of your data as the training set and 20% as your evaluation set. For smaller datasets, set aside 90% for training and 10% for evaluation.
  • Create a new column in your dataset called split.
  • Set the split value for rows that should be used as training data as train.
  • Set the split value for rows that should be used for evaluation as evaluation.

Note, train and evaluation are the only two values supported.