Dataset Requirements
What should my dataset look like?
Tabular format
Data should be in a table-format to be used by Predibase. Typically, this means that the columns of your dataset represent the individual features in your data and the rows represent the individual records in your dataset. Below is an example of a twitter bots dataset with tabular and text features:
In this example, the columns (i.e. description, favorites_count, average_account_days) are the individual features we can choose to use for model training and the rows are individual twitter accounts.
Dataset Size
The Predibase platform is built for scale and is designed to handle large datasets (up to hundreds of GB's and beyond). That being said, there are a few considerations to keep in mind as you connect your datasets:
- For file uploads, we support file sizes less than 100MB
- If your dataset is larger, we recommend using cloud storage (Snowflake, BigQuery, S3, etc.) to connect your data to Predibase
- For large CSV datasets (>1GB) we recommend using Parquet as a file format
How do I split my dataset?
Dataset Splitting
By default, Predibase will randomly split your dataset into train, validation, and test sets according to split probabilities, which are defaulted to 0.7, 0.1, and 0.2 respectively.
Apart from random splitting, there are a few other types of dataset splitting available within Predibase. All these options can be configured while training a model under Parameters > Dataset Preprocessing > Split Options
and are described below.
Fixed Split
- Description: If you’d like to use a pre-defined split across experiments, provide an additional “split” column in your dataset and set the following values for each row:
- 0: train
- 1: validation
- 2: test
- Note: Your dataset must contain a train split while the validation and test splits are encouraged, but technically optional.
- How to use: Set split type to
fixed
and specify the name of your split column. By default, Predibase will use a column called "split".
- Description: If you’d like to use a pre-defined split across experiments, provide an additional “split” column in your dataset and set the following values for each row:
Stratified split
- Description: In the case you want to ensure the distribution for a desired column is the same across splits (i.e. with imbalanced data), you can use a stratified split.
- How to use: Set split type to
stratify
, specify the column to use for stratified splitting, and the split probabilities for your train/val/test sets.
Datetime split
- Description: A common use case is splitting a column according to a datetime column where you may want to have the data split in a temporal order.
- If we were to use a uniformly random split strategy in these cases, then the model may not generalize well if the data distribution is subject to change over time. Splitting the training from the test data along the time dimension is one way to avoid this false sense of confidence, by showing how well the model should do on unseen data from the future.
- For datetime-based splitting, we order the data by date (ascending) and then split according to the split_probabilties. For example, if split_probabilities: [0.7, 0.1, 0.2], then the earliest 70% of the data will be used for training, the middle 10% used for validation, and the last 20% used for testing.
- How to use: Set split type to
datetime
, specify the column to use for datetime splitting, and the split probabilities for your train/val/test sets.
- Description: A common use case is splitting a column according to a datetime column where you may want to have the data split in a temporal order.
Hash split
- Description: Hash split is useful for deterministically splitting on a unique ID. Even when additional rows are added to the dataset in the future, each ID will retain its original split assignment.
- This approach does not guarantee that the split proportions will be assigned exactly, but the larger the dataset, the more closely the assignment should match the given proportions.
- This strategy may be particularly useful if you need to ensure that the particular ID only appears in a single split (for example, a user ID with multiple rows, where we want to ensure the user isn't in both train and test sets).
- How to use: Set split type to
hash
, specify the column to use for datetime splitting, and the split probabilities for your train/val/test sets.
- Description: Hash split is useful for deterministically splitting on a unique ID. Even when additional rows are added to the dataset in the future, each ID will retain its original split assignment.
Can I upload a folder or zip file to Predibase?
Currently, we don't support uploading zip files or files with directories. If you are looking to use either of these, we recommend combining the individual files into a single table-formatted dataset or leveraging external connections before connecting to Predibase.
How do I use multi-modal / unstructured data?
While data should be in a tabular format, this does not mean you are limited to structured data. To use unstructured data within Predibase, ensure you have columns in your dataset with unstructured data (text, image, audio, etc). If your unstructured data lives in a separate location from the rest of your data, see external connections for more details.