Skip to main content

Model Visualizations

Predibase offers a variety of out-of-the-box visualizations that showcase model performance. This page will explain these visualizations in more detail for those who may be unfamiliar.

Learning Curves​

πŸ’‘ Two-Line Summary: Learning curves visualize the performance of a model over many different training checkpoints. They typically show the values of one or more performance metrics (such as accuracy or loss) on the training, validation, and test sets.

  • What is a checkpoint? A checkpoint is a saved version of a model's weights and parameters at a certain point during training. This enables resuming training from the saved point instead of starting over in case of interruption or failure, and tracks the model's performance at different stages for comparison and choosing the best-performing model. In Predibase, a checkpoint is saved after each evaluation over the train, validation, and test set, usually at the end of each epoch or more frequently if checkpoints_per_epoch or steps_per_checkpoint is set.

Here are the steps to interpret learning curves:

  1. Look at the overall shape of the curve. If the curve is generally decreasing over time, it suggests that the model is improving as more data is seen. If the curve is generally flat or increasing, it suggests that the model is not improving or may be overfitting.
  2. Observe the difference between the training and validation curves. If the gap between the two is large, it suggests that the model is overfitting, as it is performing well on the training set but not on the validation set. If the gap is small, it suggests that the model is generalizing well to new data.
  3. Look for signs of underfitting or overfitting. If the model's performance on the training set is much lower than the performance on the validation set, it suggests that the model is underfitting and you might want to increase the capacity of the model. If the model's performance on the training set is much higher than the performance on the validation set, it suggests that the model is overfitting.
  4. Observe the rate of change of the curve. If the rate of change is high in the beginning and then slows down, it suggests that the model is quickly learning the basics of the task, but then hitting a plateau. If the rate of change is slow in the beginning and then speeds up, it suggests that the model is struggling to learn the basics of the task, but eventually makes progress.
  5. Observe the stopping point of the curve i.e when the curve stops improving. This can give an idea about when we should stop training and save the model, as we don't want to overfit our model.

It is important to keep in mind that learning curves can be influenced by many factors, such as the size and quality of the data, the complexity of the model, and the choice of hyperparameters. Therefore, it is important to consider the context of the problem and the specific model being used when interpreting learning curves.

Confusion Matrix​

A confusion matrix is often used to describe the performance of a classification model on a set of test data for which the true values are known. It helps to understand the predictions made by the model and how well it is performing. The matrix shows the number of times instances of a class are correctly classified and misclassified.

High accuracy does not always indicate a good model, especially when the data is unbalanced. The confusion matrix allows us to understand the proportion of true positives and true negatives to get a better understanding of the model's performance.

πŸ’‘ Two-Line Summary: A confusion matrix shows the number of correct and incorrect predictions for each class. It helps to identify any class-specific biases or imbalances in the model's performance. A perfect classifier will produce a confusion matrix with values only on the diagonal.

Binary Confusion Matrix Plot​

πŸ’‘ Two-Line Summary: A binary confusion matrix shows the number of true positives, false positives, true negatives, and false negative predictions. It helps identify what classes the model is misclassifying. A perfect classifier will produce a confusion matrix with values only on the diagonal.

A binary confusion matrix plot is a way to visualize the performance of a binary classification model, where the model makes a prediction of one of two possible outcomes, such as "positive" or "negative".

Here are the steps to interpret a binary confusion matrix plot:

  1. The plot will have two columns ("Predicted Positive" and "Predicted Negative"), and two rows ("Actual Positive" and "Actual Negative").
  2. The upper left cell shows the number of True Positives. These are the cases where the model predicted the outcome to be positive and it was actually positive.
  3. The upper right cell shows the number of False Positives. These are the cases where the model predicted the outcome to be positive but it was actually negative.
  4. The lower left cell shows the number of False Negatives. These are the cases where the model predicted the outcome to be negative but it was actually positive.
  5. The lower right cell shows the number of True Negatives. These are the cases where the model predicted the outcome to be negative and it was actually negative.

It's important to note that a good model should have a high number of true positives and true negatives, and low numbers of false positives and false negatives.

Multi-class Confusion Matrix Plot​

πŸ’‘ Two-Line Summary: The multiclass confusion matrix plot provides a visual representation of the accuracy of a multiclass classification model by displaying the number of correct and incorrect predictions for each class. It helps to identify any class-specific biases or imbalances in the model's performance.

A multi-class confusion matrix plot is a way to visualize the performance of a multi-class classification model, where the model makes a prediction of one of the multiple possible outcomes. This is usually the case when the output feature is a category feature.

Here are the steps to interpret a multi-class confusion matrix plot:

  1. The plot will have multiple columns, one for each predicted class, and multiple rows, one for each actual class.
  2. Each cell in the matrix represents the number of instances that were predicted as a certain class and actually belong to a certain class.
  3. The diagonal cells represent the number of correct predictions, where the predicted class matches the actual class.
  4. The off-diagonal cells represent the number of incorrect predictions, where the predicted class does not match the actual class.
  5. With this information, we can calculate some important evaluation metrics such as accuracy, precision, recall, F1 score, etc.

Additionally, you can also look at the individual class performance by calculating the precision, recall, and F1 score for each class.

Classes ranked by the entropy of confusion matrix row​

πŸ’‘ Two-Line Summary: The Classes ranked by the entropy of confusion matrix row plot is a visual representation of the class-wise uncertainty of a multi-class classification model, which is measured by the entropy of the predicted class probabilities. It helps to identify the most difficult classes for the model to predict.

In the context of a confusion matrix, entropy can be used to measure the uncertainty or confusion of a classifier's predictions for a particular class. This plot can be used to visualize the performance of a multi-class classification model. It plots the entropy of each row in the confusion matrix on the x-axis and the classes on the y-axis.

  1. x-axis: The x-axis represents the entropy of each row in the confusion matrix. Entropy is a measure of the uncertainty or randomness in the predictions of the model for each class. Higher entropy values indicate higher uncertainty in the model's predictions for that class.
  2. y-axis: The y-axis represents the different classes in the dataset. The classes are ranked by the entropy of their corresponding row in the confusion matrix, with the class with the highest entropy being plotted on the top and the class with the lowest entropy being plotted on the bottom.

The plot allows us to see which classes are being predicted accurately by the model and which classes are causing confusion in the predictions. Classes with high entropy indicate that the model is struggling to make accurate predictions for those classes and may require further analysis or refinement.

It's important to note that entropy is just one of the many ways to evaluate the performance of a multi-class classifier and it's important to use other evaluation metrics such as accuracy, precision, recall, F1 score, etc. to get a comprehensive understanding of the classifier performance.

ROC Curve​

πŸ’‘ Two-Line Summary: The ROC curve plots the true positive rate against the false positive rate at different classification thresholds. It provides insight into the trade-off between the true positive rate and the false positive rate and is summarized by the Area Under the Curve (AUC) AUC.

A ROC (Receiver Operating Characteristic) curve is a plot that shows the performance of a binary classification model by plotting the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. The false positive rate is the fraction of negative instances that are incorrectly classified as positive, while the true positive rate, also known as sensitivity or recall, is the fraction of positive instances that are correctly classified as positive.

Here are the steps to interpret a ROC curve plot:

  1. The x-axis represents the false positive rate (FPR) and the y-axis represents the true positive rate (TPR).
  2. The top left corner of the plot represents a perfect classifier, where the TPR is 1 and the FPR is 0.
  3. A classifier that is no better than random guessing will have a ROC curve that is a diagonal line from the bottom left corner to the top right corner of the plot.
  4. A classifier that is better than random guessing will have a ROC curve that is above the diagonal line.
  5. The closer the ROC curve is to the top left corner, the better the classifier is at distinguishing between the positive and negative classes.
  6. The AUC (Area Under the Curve) represents the overall performance of the classifier. The closer the AUC is to 1, the better the classifier is at distinguishing between the positive and negative classes.
  7. An AUC=1 indicates a perfect classifier, while an AUC=0.5 means the classifier is no better than random guessing. If AUC is < 0.5, it means that the model’s predictions are flipped, i.e., the model predicts positive for negative, and negative for positive.

It's important to note that the ROC curve gives an idea of how good the classifier is at balancing the trade-off between the true positive rate and the false positive rate, which can be especially useful in the case of imbalanced datasets.

It's also important to note that the ROC curve is only applicable for binary classifications. For multi-class classifications, other evaluation metrics should be used such as accuracy, precision, recall, F1 score, etc.

Precision Recall Curves​

πŸ’‘ Two-Line Summary: The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

  1. Precision: Precision is the ratio of true positive predictions (correct positive predictions) to the total number of positive predictions made by the model. It measures how accurate the positive predictions are. Precision is calculated as: Precision = True Positives / (True Positives + False Positives)
  2. Recall: Recall is the ratio of true positive predictions to the total number of actual positive cases in the dataset. It measures how well the model is able to identify all positive cases. Recall is calculated as: Recall = True Positives / (True Positives + False Negatives)

How to Interpret:

  • A model with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels.
  • A model with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.
  • High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

In some cases, a high-precision model may have low recall and vice versa. Therefore, it's important to consider both precision and recall when evaluating the performance of a binary classification model. The trade-off between precision and recall is often balanced by the use of metrics such as the F1 score, which is the harmonic mean of precision and recall.

F1 Score vs Frequency​

πŸ’‘ Two-Line Summary: The F1 score vs. frequency plot visualizes the relationship between the number of samples (frequency and the F1 score (a metric that combines precision and recall) of a model. It helps in identifying class imbalances and evaluating model performance on rare classes.

An F1 score vs. frequency plot is useful in machine learning to visualize the relationship between the number of samples and the F1 score of a model. The F1 score is a metric that combines precision and recall and is used to evaluate the overall performance of a binary classification model. By plotting the F1 score against the frequency of samples, you can get a sense of how the model's performance is changing as you collect more data.

For example, if a model's F1 score increases as the frequency of samples increases, it could indicate that the model is able to learn more about the patterns in the data as more samples are added. This could lead to better generalization performance and improved model accuracy. On the other hand, if the F1 score plateaus or decreases with increasing frequency, it could indicate that the model is overfitting the data, and additional samples are unlikely to lead to further improvements in performance. In such cases, other techniques, such as regularization or feature selection, may need to be applied to prevent overfitting.

Here are the steps to interpret an F1 score vs frequency plot:

  1. The left-side y-axis represents the frequency of the classes in the dataset, and the right-side y-axis represents the F1 Score of the model. The x-axis has all of the class names, sorted by the frequency of samples.
  2. If a class has a high frequency in the dataset, it means that the class has a lot of instances and it's important to have a good F1 Score for that class.
  3. If a class has a low frequency in the dataset, it means that the class has a few instances and the F1 Score for that class might not be as important.
  4. A good classifier should have a high F1 Score across all classes regardless of their frequency.
  5. A classifier that performs well on classes with high frequency but poorly on classes with low frequency might not be ideal for certain use cases.
  6. It's important to analyze the plot to identify the classes that need more attention, for example, classes with low frequency but low F1 score, which may indicate the classifier is not able to generalize well on those classes.

It's important to note that F1 Score is just one of the many ways to evaluate the performance of a multi-class classifier, it's important to use other evaluation metrics such as accuracy, precision, recall, AUC, etc. to get a comprehensive understanding of the classifier performance.

Brier Plot​

πŸ’‘ Two-Line Summary: The brier score plot is a visual representation of the accuracy of a probabilistic prediction model, where the brier score measures the difference between the predicted probabilities and actual outcomes. The plot helps to evaluate model calibration and sharpness and identify systematic prediction biases.

Say we have two models that correctly predicted that the input image is a dog. One model predicts this with a probability of 0.53 and the other with 0.87. They are both correct and have the same accuracy (assuming a 0.5 threshold for the binary classifier), but the second model feels better because it is more confident. This is what the Brier score captures.

A Brier plot is a visual tool used to evaluate the performance of a probabilistic forecast or binary classifier. The y-axis represents the Brier score, which has a range from 0 to 1, with lower scores indicating more accurate predictions. The x-axis represents the class names, ordered from highest brier score to lowest brier score.

The plot takes into account both the calibration and discrimination of a model's predictions. Calibration refers to the accuracy of the predicted probabilities, i.e., whether the predicted probabilities match the observed frequencies of events. Discrimination refers to the ability of the model to distinguish between positive and negative events.

A well-calibrated model with good discrimination will have a low Brier score, indicating that the predicted probabilities are accurate and the model is able to effectively distinguish between positive and negative events. On the other hand, a model with poor calibration or discrimination will have a high Brier score, indicating that the predicted probabilities are less accurate and the model is less effective at distinguishing between positive and negative events.

How to interpret a brier score plot:

  • A high Brier score for a particular class means that the model's predicted probabilities for that class are far from the true outcomes. This could be due to several reasons, such as poor model fit, overfitting, or incorrect model assumptions. In other words, the model is making inaccurate predictions for that class, which could lead to poor performance on unseen data.
  • A low Brier score for a particular class means that the model's predicted probabilities for that class are close to the true outcomes. This suggests that the model is making accurate predictions for that class, which could lead to good performance on unseen data.

Calibration Plot​

πŸ’‘ Two-Line Summary: A calibration plot is a visual representation of the accuracy of a probabilistic prediction model, which plots the predicted probability of a positive outcome against the actual proportion of positive outcomes. It helps to evaluate consistency between the predicted probabilities and actual outcomes and identify systematic over- or under-prediction biases.

A calibration plot is a visual representation of the accuracy of a model's predicted probabilities. It is used to assess the quality of the model's predictions in terms of how well the predicted probabilities align with the true outcomes.

Here are the steps to interpret a calibration plot:

  1. The x-axis represents the predicted probability of an event occurring, while the y-axis represents the true proportion of events that actually occurred.
  2. Look for a diagonal line that runs from the bottom left corner to the top right corner of the plot. This line represents perfect calibration, where the predicted probabilities match the true proportions.
  3. Observe the location of the data points on the plot. If the points are close to the diagonal line, it indicates that the model's predicted probabilities are well-calibrated. If the points are far from the diagonal line, it indicates that the model's predicted probabilities are poorly calibrated.
  4. Check if the points are clustered around a certain area of the plot. If the points are clustered around the bottom left corner, it suggests that the model is under-predicting the true proportion of events. If the points are clustered around the top right corner, it suggests that the model is over-predicting the true proportion of events.
  5. Observe the spread of the points on the plot. A narrower spread of points indicates that the model's predicted probabilities are more accurate, while a wider spread of points indicates that the model's predicted probabilities are less accurate.
  6. Observe the shape of the plot, if the plot is U shaped, it suggests the model is over-predicting negative cases and under-predicting positive cases and if the plot is inverted U shaped, it suggests the model is under-predicting negative cases and over-predicting positive cases.
  7. If the calibration plot deviates from the ideal straight line, it indicates that the model is either overconfident or underconfident in its predictions. For example, if the plot is above the ideal line, it means that the model is overconfident in its predictions and that the predicted probabilities are higher than the actual probabilities of positive events. Conversely, if the plot is below the ideal line, it means that the model is underconfident in its predictions and that the predicted probabilities are lower than the actual probabilities of positive events.

Note that this is just a basic overview of how to interpret a calibration plot, and the specific interpretation may depend on the context of the problem and the model being used.

Feature Importance Plots​

Feature importance plots are important in machine learning for several reasons:

  1. Understanding the data: Feature importance plots can help to understand which features in the dataset are the most important for making predictions. This can be useful for identifying patterns or relationships in the data that might not be immediately obvious.
  2. Simplifying the model: By identifying the most important features, feature importance plots can help to simplify the model by removing less important features. This can lead to faster training times, lower memory requirements, and improved interpretability.
  3. Debugging: Feature importance plots can help to debug a model by identifying if any feature is not contributing to the prediction.
  4. Interpreting models: Feature importance plots can be used to understand how a model is making predictions. This can be particularly useful for models that are difficult to interpret, such as deep neural networks.
  5. Explainability: Feature importance plots can be used to provide insights about which features are having the most impact on the model's predictions. This can be useful for building trust in the model with stakeholders and decision-makers.

πŸ’‘ Two-Line Summary: Feature Importance Plots show the relative contribution of each feature in predicting the target feature of a machine learning model. It helps to identify the most important predictors, understand the relationship between the features and the outcome, and help in decision-making for feature selection and feature engineering.

A feature importance summary plot highlights how much each individual feature influences the model’s predictions.

The Y-Axis has feature names ranked in order from the ones with the highest impact/importance to the lowest impact/importance. The X-axis represents the effect of the feature on the model.

When you find that a feature has low feature importance, it means that the feature is not contributing much to the model's predictions. There are several options for what you can do with features that have low feature importance, depending on the specific context and goals of your analysis:

  1. Remove the feature: It may be redundant or irrelevant, and removing it may simplify the model and reduce the risk of overfitting.
  2. Combine with other features: If the feature does not contribute much on its own, it may be useful when combined with other features. You can try creating new features by combining the low-importance feature with other features.
  3. Re-engineer the feature: You can try re-engineering the feature to better capture the relevant information, for example by transforming the feature, binning the feature, or aggregating the feature.
  4. Interact with other features: You can try creating interaction features that capture the relationship between the low-importance feature and other features.
  5. Get more data: You can try getting more data to better capture the relationship between the feature and the target variable, which might improve that feature’s importance.

It is important to keep in mind that feature importance can change depending on the specific model, so you should always validate the results on a held-out validation set. Additionally, it may be useful to consider the domain knowledge and the specific context of the problem when making decisions about which features to keep or remove.