Dataset Profiles
When you connect a dataset for the first time, Predibase computes a dataset profile using a sample of data.
Why profile your dataset?
Understanding your dataset is an essential step in every machine learning project.
Quality Assurance: Dataset profiling helps in identifying potential problems at an early stage. These can include missing values, incorrect entries, outliers, and inconsistencies which can negatively affect the performance of the machine learning models.
Data Understanding: Understanding the distribution of data and relationships between different variables in the dataset helps in gaining insights, which can be used for feature selection and feature engineering. For example, if two features are highly correlated, one of them might be redundant and could be removed to simplify the model.
Default Model Selection: Different types of data may require different types of preprocessing and modeling pipelines. For example, categorical data can often be encoded differently than numerical data, and correspond to different metrics, and some model configurations work better with balanced datasets than with imbalanced ones.
Where is the dataset profile?
The dataset profile can be found on the Dataset Details page on the Preview
tab or on the Model Builder Page after dataset selection.
What's in a dataset profile?
Inferred type
Models trained on Predibase use Ludwig configurations. At a minimum, the config must specify the model's input and output features as well as their Ludwig types. For each column in your data, Predibase uses dataset profiling to infer which Ludwig type would be a good fit.
Data type assigments are used to establish default preprocessing, architectural, training, and evaluation policies in the formation of a complete training pipeline.
Hover over the type text to find out how the type was inferred.
Sometimes the inferred type may not be the best fit for a certain application or model architecture. There may be alternatively viable types, for example a high cardinality, long tail category feature may be better modeled as to be a text feature. When overriding, we recommend checking the Data Preview tab to confirm values fit the chosen data type.
Types can be overridden manually in the Model Builder.
Missing Values
The percentage of missing values. If there are too many missing values, it may be better to exclude the feature entirely.
Distribution and Statistics
This includes mean, median, mode, standard deviation, and other descriptive statistics. These can provide insight into the range and variability of each feature. Statistics are computed for:
- values for number features
- frequencies of classes for category features
- sequence lengths for text features
Knowing the distribution of features helps you make informed decisions about data transformations like normalization or pruning.
Outliers
Outliers can unduly influence some machine learning algorithms, leading to inaccurate models.
Correlation
Knowing the relationships and correlations between different features can provide insight into the structure of the data and potentially reveal multicollinearity, where two or more variables are highly correlated, and prevent target leakage.
Cardinality
For category features, the number and frequency of distinct categories is important. For example, category features with high cardinality may require special handling, like establishing a rare class cutoff. Highly imbalanced category features may benefit from upsampling or regularization.
Suitability
Whether the feature should be enabled by default. Examples of reasons why a feature is marked as disabled by default could be due to having a large percentage of missing values, appearing noisy or id-like, and being tightly correlated with the output feature (target leakage).
Hover over "enabled by default" to find out why the feature is enabled or disabled by default.
FAQ
How many samples does Predibase use for dataset profiles?
The dataset profile is computed using all examples in your dataset up to 10K examples. If your dataset is larger than 10K examples, then a random sample of 10K examples is used to calculate the profile.
In statistics, a sample size of 10K is often considered quite large and would likely get statistically significant and accurate profiling, even for datasets up to the tens or hundreds of millions, provided that the dataset is not extremely varied.