Skip to main content

Quickstart: Supervised ML

This quickstart will show you how to train, evaluate, and deploy a supervised-ML model in Predibase. Check out the following notebook to follow along with this quickstart:

Open In Colab Notebook

Setup

If you're using the Python SDK, you'll first need to initialize your PredibaseClient object. All SDK examples below will assume this has already been done. Make sure you've installed the SDK and configured your API token as well.

import pandas as pd
import yaml
from predibase import PredibaseClient

# If you've already run `pbase login`, you don't need to provide any credentials here.
#
# If you're running in an environment where the `pbase login` command-line tool isn't available,
# you can also set your API token using `pc = PredibaseClient(token="<your token here>")`.
pc = PredibaseClient()

Dataset Preparation

Predibase can train a supervised-ML model on any table-like dataset, meaning that every feature has its own column and every example its own row.

In this example, we'll use this Rotten Tomatoes dataset, a CSV file with variety of feature types and a binary target.

Download the data locally here.

rotten_tomatoes_df = pd.read_csv("rotten_tomatoes.csv")
rotten_tomatoes_df.head()

Your results should look a little something like this:

movie_titlecontent_ratinggenresruntimetop_criticreview_contentrecommended
Deliver Us from EvilRAction & Adventure, Horror117.0TRUEDirector Scott Derrickson and his co-writer, Paul Harris Boardman, deliver a routine procedural with unremarkable frights.0
BarbaraPG-13Art House & International, Drama105.0FALSESomehow, in this stirring narrative, Barbara manages to keep hold of her principles, and her humanity and courage, and battles to save a dissident teenage girl whose life the Communists are trying to destroy.1
Horrible BossesRComedy98.0FALSEThese bosses cannot justify either murder or lasting comic memories, fatally compromising a farce that could have been great but ends up merely mediocre.0
Money MonsterRDrama98.0FALSEA satire about television that feels like it was made by the kind of people who claim they don't even watch TV.0
Battle RoyaleNRAction & Adventure, Art House & International, Drama, Mystery & Suspense114.0FALSEBattle Royale is The Hunger Games not diluted for young audiences.1

Now let's connect this dataset to Predibase:

rotten_tomatoes_dataset = pc.upload_file("rotten_tomatoes.csv", "Rotten Tomatoes Reviews")

You should see the Rotten Tomatoes Reviews dataset appear on the first row of the Predibase datasets page in your UI:

Model Training

Now that we have our dataset, we can train our first model. We'll use the create_model method which will create a new model repository for us, and train a model with the config that we provide.

To start, let's use a basic config that specifies the model inputs and outputs. We'll let Predibase handle the rest:

rotten_tomatoes.yaml
input_features:
- name: genres
type: set
preprocessing:
tokenizer: comma
- name: content_rating
type: category
- name: top_critic
type: binary
- name: runtime
type: number
- name: review_content
type: text
output_features:
- name: recommended
type: binary

This config file tells Predibase that we want to train a model using the following input features:

  • The genres associated with the movie will be used as a set feature
  • The movie's content rating will be used as a category feature
  • Whether the review was done by a top critic or not will be used as a binary feature
  • The movie's runtime will be used as a number feature
  • The review content will be used as text feature

This config file also tells Predibase that we want our model to have the following output features:

  • The recommended column indicates whether a movie was recommended and is used as a binary feature

Now let's read our config file in and pass it to the model for training:

rotten_tomatoes_config = yaml.safe_load(
"""
input_features:
- name: genres
type: set
preprocessing:
tokenizer: comma
- name: content_rating
type: category
- name: top_critic
type: binary
- name: runtime
type: number
- name: review_content
type: text
output_features:
- name: recommended
type: binary
"""
)

Finally, let's pass the dataset and config to the create_model method. Here we will also specify a repository name, repository description, and model description for the first model trained. These are all optional, but will help you keep track of your models.

rotten_tomatoes_model = pc.create_model(
repository_name="Rotten Tomatoes Recommender",
dataset=rotten_tomatoes_dataset,
config=rotten_tomatoes_config,
repo_description="Predict whether a movie is recommended or not",
model_description="Baseline Model"
)

After running this command, if you follow the provided link, you can track your model training in real-time:

Model Evaluation

Once our model has trained, we can evaluate the performance metrics both through the SDK, and manually in the UI:

UI Model Evaluation

When evaluating a model in the model version view, you have a variety of tools to help you understand your model's performance.

One of the most helpful tools is the feature importance graph, which shows you how much each feature contributed to the model's predictions. This can help you understand which features are most important to your model, and which features may be unnecessary.

# Feature importance screenshot

In addition to feature importance for individual fields, we also provide feature importance down to the text token level!

SDK Model Evaluation

You can also evaluate your model's performance using the SDK. To do so, you can use the Model.evaluate method:

rotten_tomatoes_model.evaluate("recommended", rotten_tomatoes_dataset)

Deployment & Prediction

Once you're satisfied with your model's performance, you can deploy it to a REST API endpoint using the create_deployment method:

rotten_tomatoes_deployment = pc.create_deployment('rotten_tomatoes_deployment', rotten_tomatoes_model)

Now that you've deployed your model, you can easily make predictions using the Deployment.predict method:

preds_df = rotten_tomatoes_deployment[0].predict(rotten_tomatoes_df.sample(100), stream=True)
preds_df.head()

With this command, you can pass in any unseen data that you want predictions on, and Predibase will return a dataframe with the predictions for each example.

Conclusion

In this example, we've shown how to train, evaluate, and deploy a model using Predibase. We've also shown how to make predictions using the deployed model. For more in depth documentation of the Predibase SDK, check out the rest of the documentation.