Quickstart: Supervised ML
This quickstart will show you how to train, evaluate, and deploy a supervised-ML model in Predibase. Check out the following notebook to follow along with this quickstart:
Setup
If you're using the Python SDK, you'll first need to initialize your PredibaseClient
object. All SDK examples
below will assume this has already been done. Make sure you've installed the SDK and configured your API token as well.
import pandas as pd
import yaml
from predibase import PredibaseClient
# If you've already run `pbase login`, you don't need to provide any credentials here.
#
# If you're running in an environment where the `pbase login` command-line tool isn't available,
# you can also set your API token using `pc = PredibaseClient(token="<your token here>")`.
pc = PredibaseClient()
Dataset Preparation
Predibase can train a supervised-ML model on any table-like dataset, meaning that every feature has its own column and every example its own row.
In this example, we'll use this Rotten Tomatoes dataset, a CSV file with variety of feature types and a binary target.
Download the data locally here.
rotten_tomatoes_df = pd.read_csv("rotten_tomatoes.csv")
rotten_tomatoes_df.head()
Your results should look a little something like this:
movie_title | content_rating | genres | runtime | top_critic | review_content | recommended |
---|---|---|---|---|---|---|
Deliver Us from Evil | R | Action & Adventure, Horror | 117.0 | TRUE | Director Scott Derrickson and his co-writer, Paul Harris Boardman, deliver a routine procedural with unremarkable frights. | 0 |
Barbara | PG-13 | Art House & International, Drama | 105.0 | FALSE | Somehow, in this stirring narrative, Barbara manages to keep hold of her principles, and her humanity and courage, and battles to save a dissident teenage girl whose life the Communists are trying to destroy. | 1 |
Horrible Bosses | R | Comedy | 98.0 | FALSE | These bosses cannot justify either murder or lasting comic memories, fatally compromising a farce that could have been great but ends up merely mediocre. | 0 |
Money Monster | R | Drama | 98.0 | FALSE | A satire about television that feels like it was made by the kind of people who claim they don't even watch TV. | 0 |
Battle Royale | NR | Action & Adventure, Art House & International, Drama, Mystery & Suspense | 114.0 | FALSE | Battle Royale is The Hunger Games not diluted for young audiences. | 1 |
Now let's connect this dataset to Predibase:
rotten_tomatoes_dataset = pc.upload_file("rotten_tomatoes.csv", "Rotten Tomatoes Reviews")
You should see the Rotten Tomatoes Reviews dataset appear on the first row of the Predibase datasets page in your UI:
Model Training
Now that we have our dataset, we can train our first model. We'll use the create_model
method which will create a new
model repository for us, and train a model with the config that we provide.
To start, let's use a basic config that specifies the model inputs and outputs. We'll let Predibase handle the rest:
input_features:
- name: genres
type: set
preprocessing:
tokenizer: comma
- name: content_rating
type: category
- name: top_critic
type: binary
- name: runtime
type: number
- name: review_content
type: text
output_features:
- name: recommended
type: binary
This config file tells Predibase that we want to train a model using the following input features:
- The genres associated with the movie will be used as a set feature
- The movie's content rating will be used as a category feature
- Whether the review was done by a top critic or not will be used as a binary feature
- The movie's runtime will be used as a number feature
- The review content will be used as text feature
This config file also tells Predibase that we want our model to have the following output features:
- The recommended column indicates whether a movie was recommended and is used as a binary feature
Now let's read our config file in and pass it to the model for training:
rotten_tomatoes_config = yaml.safe_load(
"""
input_features:
- name: genres
type: set
preprocessing:
tokenizer: comma
- name: content_rating
type: category
- name: top_critic
type: binary
- name: runtime
type: number
- name: review_content
type: text
output_features:
- name: recommended
type: binary
"""
)
Finally, let's pass the dataset and config to the create_model
method. Here we will also specify a repository name,
repository description, and model description for the first model trained. These are all optional, but will help you
keep track of your models.
rotten_tomatoes_model = pc.create_model(
repository_name="Rotten Tomatoes Recommender",
dataset=rotten_tomatoes_dataset,
config=rotten_tomatoes_config,
repo_description="Predict whether a movie is recommended or not",
model_description="Baseline Model"
)
After running this command, if you follow the provided link, you can track your model training in real-time:
Model Evaluation
Once our model has trained, we can evaluate the performance metrics both through the SDK, and manually in the UI:
UI Model Evaluation
When evaluating a model in the model version view, you have a variety of tools to help you understand your model's performance.
One of the most helpful tools is the feature importance graph, which shows you how much each feature contributed to the model's predictions. This can help you understand which features are most important to your model, and which features may be unnecessary.
# Feature importance screenshot
In addition to feature importance for individual fields, we also provide feature importance down to the text token level!
SDK Model Evaluation
You can also evaluate your model's performance using the SDK. To do so, you can use the Model.evaluate
method:
rotten_tomatoes_model.evaluate("recommended", rotten_tomatoes_dataset)
Deployment & Prediction
Once you're satisfied with your model's performance, you can deploy it to a REST API endpoint using the create_deployment
method:
rotten_tomatoes_deployment = pc.create_deployment('rotten_tomatoes_deployment', rotten_tomatoes_model)
Now that you've deployed your model, you can easily make predictions using the Deployment.predict
method:
preds_df = rotten_tomatoes_deployment[0].predict(rotten_tomatoes_df.sample(100), stream=True)
preds_df.head()
With this command, you can pass in any unseen data that you want predictions on, and Predibase will return a dataframe with the predictions for each example.
Conclusion
In this example, we've shown how to train, evaluate, and deploy a model using Predibase. We've also shown how to make predictions using the deployed model. For more in depth documentation of the Predibase SDK, check out the rest of the documentation.