Skip to main content

Prepare Data

One of the necessary ingredients before you can begin fine-tuning is a training dataset to use. While gathering or constructing the dataset, you should aim to keep the following principles in mind:

  • The dataset should contain a diverse set of examples
  • The dataset should have at minimum 50-100 examples, though more high-quality examples may help increase performance
  • The examples in the training dataset should be similar to the requests at inference time

Dataset Formats

Predibase supports many connectors including File Upload, Snowflake, Databricks, Amazon S3 and others. For file uploads, the supported data formats include but are not limited to:

  • CSV
  • Parquet
  • HDF5
  • ORC
  • JSON
  • MS Excel
  • Python Pickle

There is a limit of 100MB for file uploads. If your dataset is larger than 500MB, try using cloud storage (e.g. Amazon S3, BigQuery, etc) to connect your data to Predibase.

Upload Dataset

You can connect your dataset to Predibase via the UI or Python SDK. To upload a file via the SDK, you can use:

dataset = pc.upload_file('{Path to local file}', 'Code Alpaca Dataset')

How to Structure Your Dataset

Convert your dataset to our new format

In mid-April, we will be launching a major overhaul of our fine-tuning experience, which includes the removal of configuring prompt templates as part of your fine-tuning job. Instead, users will need to include prompt templates in their dataset. We've created this helper notebook for users to convert existing datasets to follow the new format.

Colab Notebook: Dataset Prep Helper

Currently Predibase supports instruction fine-tuning. (Other fine-tuning tasks such as completions and DPO are coming soon!)

For instruction fine-tuning, your dataset must contain two columns named "prompt" and "completion":

  • prompt: the fully materialized input to the model
  • completion: the desired response

How to configure the prompt_template parameter when fine-tuning (to be deprecated in mid-April):

When fine-tuning in the UI or SDK, include the following as your prompt template.

{prompt}
Deprecation Warning

In mid-April, we will be launching a major overhaul of our fine-tuning experience, which includes the removal of configuring prompt templates as part of your fine-tuning job. For right now, you'll still need to specify "{prompt}" as your prompt template, however after this release, your prompt column which contains your fully rendered prompt will be used automatically as the single input feature.

Examples

Example: Named entity recognition (CoNLL++) for Mistral-7B

promptcompletion
Your task is a Named Entity Recognition (NER) task. Predict the category of each entity, then place the entity into the list associated with the category in an output JSON payload. Below is an example: \n Input: EU rejects German call to boycott British lamb. Output: {"person": [], "organization": ["EU"], "location": [], "miscellaneous": ["German", "British"]} \n Now, complete the task. \n Input: EU rejects German call to boycott British lamb. Output:{"person": [], "organization": ["EU"], "location": [], "miscellaneous": ["German", "British"]}
Your task is a Named Entity Recognition (NER) task. Predict the category of each entity, then place the entity into the list associated with the category in an output JSON payload. Below is an example: \n Input: EU rejects German call to boycott British lamb. Output: {"person": \n [], "organization": ["EU"] "location": [], "miscellaneous": ["German", "British"]} \n Now, complete the task. \n Input: Peter Blackburn Output:{"person": ["Peter Blackburn"], "organization": [], "location": [], "miscellaneous": []}

Example with split column: Question answering (drop) for Llama chat models

Upcoming change regarding training splits

In mid-April, we will be launching a major overhaul of our fine-tuning experience, which includes increased transparency and control over how splits work.

By default, you do not need to provide a split column and your entire dataset will be used as the training set. If you'd like see evaluation metrics, include an evaluation set by adding a "split" column with "train" or "evaluation". If an evaluation set is specified, then the best checkpoint is determined by the checkpoint with the lowest evaluation loss value and is what will be used during inference. Otherwise, the last checkpoint is used.

promptcompletionsplit
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. If you don't know the answer to a question, please don't share false information. \n <</SYS>> \n Given a passage, you need to accurately identify and extract relevant spans of text that answer specific questions. Provide concise and coherent responses based on the information present in the passage. \n ### Passage: There were 153,791 households, out of which 44,762 (29.1%) had children under the age of 18 living in them, 50,797 (33.0%) were marriage living together, 24,122 (15.7%) had a female householder with no husband present, 8,799 (5.7%) had a male householder with no wife present. There were 11,289 (7.3%) POSSLQ, and 3,442 (2.2%) same-sex partnerships. 52,103 households (33.9%) were made up of individuals and 13,778 (9.0%) had someone living alone who was 65 years of age or older. The average household size was 2.49. There were 83,718 family (U.S. Census) (54.4% of all households); the average family size was 3.27. \n ### Question: Were there more POSSLQ or same-sex partnerships? \n ### Answer:POSSLQtrain
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible. If you don't know the answer to a question, please don't share false information. \n <</SYS>> \n Given a passage, you need to accurately identify and extract relevant spans of text that answer specific questions. Provide concise and coherent responses based on the information present in the passage. \n ### Passage: Hoping to increase their winning streak the Texans played on home ground for an Inter-conference duel with the Cowboys. Houston took the early lead in the 1st quarter when kicker Neil Rackers hit a 24-yard field goal. Then they fell behind with RB Marion Barber getting a 1-yard TD run, followed by kicker David Buehler's 49-yard field goal. The Texans struggled further in the third quarter when QB Tony Romo completed a 15-yard TD pass to WR Roy E. Williams. Houston replied with Rackers nailing a 30-yard field goal, but Dallas continued to score when Romo found Williams again on a 63-yard TD pass. Then David Buehler made a 40-yard field goal. The Texans would finally score when QB Matt Schaub made a 7-yard TD pass to WR Kevin Walter. \n ### Question: What are the two shortest touchdown passes made? \n ### Answer:7-yardevaluation

FAQs

What fine-tuning tasks do you currently support?

Predibase today supports instruction fine-tuning. In the near future, we'll be adding continued pre-training (completions), chat, DPO, and other fine-tuning tasks.

Do I need to include model-specific chat templates in the prompts in my datasets? (Ex. "<s>[INST] prompt [/INST]")

From our own experimentation, we found minimal performance differences when including model-specific chat templates when doing instruction fine-tuning since the trained adapter will learn to respond correctly with or without those additional tokens.

If you would like, you may include these model-specific templates in your prompts when fine-tuning. Different base models expect different templates, which you can find in Prompt Engineering Guide (see example for Mistral-7B-Instruct).

Regardless of your choice, your training and inference prompt templates must match to see the best performance. So if you include model-specific templates in your training data, you will also need to include them when doing inference.

How many examples should I have in my dataset?

The answer to this question is dependent on your task and the quality of your data! In general, we recommend at least a few hundred examples.