Skip to main content

Prepare Data

One of the necessary ingredients before you can begin fine-tuning is creating and prepareing training dataset to use. While gathering or constructing the dataset, you should aim to keep the following principles in mind:

  • The dataset should contain a diverse set of examples
  • The dataset should have at minimum 50-100 examples, though more high-quality examples may help increase performance
  • The examples in the training dataset should be similar to the requests at inference time

Dataset Formats

Predibase supports many connectors including File Upload, Snowflake, Databricks, Amazon S3 and others. For file uploads, the supported data formats include but are not limited to:

  • JSON
  • JSONL
  • CSV
  • Parquet
  • MS Excel

There is a limit of 100MB for file uploads. If your dataset is larger than 100MB, try using cloud storage (e.g. Amazon S3, BigQuery, etc) to connect your data to Predibase.

Upload Dataset

You can connect your dataset to Predibase via the UI or Python SDK. To upload a file via the SDK, you can use:

dataset = pc.upload_file('{Path to local file}', 'Code Alpaca Dataset')

If you want to connect data from remote storage, please use the Web UI and navigate to Data > Connect Data. In the near future, we will be adding support to connect data from remote storage directly in the Predibase SDK.

How to Structure Your Dataset

Convert your dataset to our new format

We just launched a major overhaul of our fine-tuning experience, which includes the removal of configuring prompt templates as part of your fine-tuning job. Instead, users will need to format their data ahead of time and upload it to Predibase. We've created this helper notebook for users to convert existing datasets to follow the new format.

Colab Notebook: Dataset Prep Helper

Dataset Fields

For instruction fine-tuning, your dataset must contain two columns named prompt and completion:

  • prompt: Your input prompt. It serves as the starting point or the guiding information for the model.
  • completion: The expected response that corresponds to the input provided in the "prompt" column.
  • split (optional): Should be either train or evaluation. To learn more, check out this section.

Any additional columns in your dataset will not be used.

Instruction Formatting

Experimentation shows that using model-specific instruction templates significantly boosts performance.

To learn more about how to use instruction templates with your data to improve performance, you can read more here.

Train and Evaluation Splits

Once your dataset is formatted, you have the option to specify an optional split column.

  • No Split Column: If no split column is specified in your dataset, all of your data will be used as training data and only training set metrics will be reported. The checkpoint used for inference will be the final checkpoint since there is no evaluation set to distinguish the optimal checkpoint.
  • Split Column: If you do choose to specify a split column, the dataset should be divided into two parts: one for train and one for evaluation. The checkpoint used for inference will be the checkpoint with the best performance (lowest loss) on the evaluation set.

How Evaluation Sets are Used in Predibase

When you start fine-tuning, we'll give you updates on how the model is performing on both splits. This helps you see how well it's learning on the training data and performing on data it hasn't seen before (i.e. evaluation data).

To specify datasets splits in Predibase, our recommendation is to:

  • Set aside 80% of your data as the training set and 20% as your evaluation set. For smaller datasets, set aside 90% for training and 10% for evaluation.
  • Create a new column in your dataset called split.
  • Set the split value for rows that should be used as training data as train.
  • Set the split value for rows that should be used for evaluation as evaluation.

Note, train and evaluation are the only two values supported.