Quickstart: Supervised ML
This quickstart will show you how to get started with Predibase in 15 minutes using the ATIS dataset from Kaggle.
Step 1: Upload a dataset
Your first step will be to connect data to Predibase. To access the ATIS dataset, you can simply follow the link and download the CSV file after logging into your Kaggle account. For the purpose of this quickstart, we'll just use the atis_intents.csv file. This dataset doesn't have a header row. You will need to add "intent" and message" as column names at the top of the file. The dataset has 8 categories of intents with a total of 4834 rows.
Once you have the file, you can navigate to the Data tab on the left to upload it. Simply go to Data -> Connect Data -> File, upload the file and give it a name (we suggest "atis_demo" for this tutorial).
After you upload the file, you'll be brought to the Datasets tab. The dataset will appear as connecting.
Once it's fully connected, you can click on it to access the Dataset Details page. From this page, you can see metadata like “Num rows”, “Size bytes”, and all the available fields in the dataset. Right now, we don't have any models configured for this dataset, so you won't see any active model versions. Later, you will be able to either configure a new model for a given field or retrain the active version of the model for that field.
Step 2: Train a model
Now, go to the Models tab from the left-navigation menu and we'll train our first model in Predibase.
To get started, click +New Model Repository where you'll provide your model repository with a name, and then click “Next: Train a model”.
You will see three sections: Setup, Model Type, and Feature Selection sections after you name the model repository. For this demo, we are going to click the second option to build a custom model.
First, let’s start with Setup. When training a model, you’ll need to specify which dataset you want to use and the engine you want to train the model on (see engine section of the docs for more details). We recommend using a training engine (called “train_engine”) for all your model training jobs which will automatically rightsize the compute to your dataset and model. If you want to override our recommendation, you can select a different engine from the drop-down menu. If you have uploaded the csv file and named it “atis_demo”, you can click the drop-down buttons to select “file: file_uploads” as “Connections and the corresponding Dataset as “file: atis_demo”.
Next, select a target(s). You can select multiple columns as targets in Predibase and train a multi-task model, but in this case we only have one target and let's set intent to be the target.
For Model Type, as this is a text classification use case, we will select Neural Network (default). If you want to learn more about the Gradient Boosted Trees and how it compares to the Neural Network, read this blog for more details.
Let’s move to the Feature Selection section. You will see all the columns in your dataset with a few options. You can adjust the Data Type of the column (which has implications for the model), as well as toggle whether or not the feature is used in the model by adjusting Active. This can be relevant if there are features in your dataset you don't want the model to use as input or learn from, in this case, we only have one feature message, and let’s keep it active.
Notice that the Next button at the top right is now active (blue) and waiting for you to press it.
Now the model is queued. For the ATIS model, this should only take several minutes on our train engine. Grab a cup of coffee and come back soon!
While you are waiting, you can click on the Model Version (circled in red below) and watch the learning curves update as the Model is training. After a few minutes, you’ll see the model’s status change to Ready. It means you can click the Model Version to check the model configs and performance metrics.
After clicking the link to the model you just trained, you will see the plot of the completed learning curve. You can dive in deeper by checking different tabs to look at the config, metrics, and plots.
For example, let’s click the Confusion Matrix tab. Confusion matrix, also known as an error matrix, is a table layout that allows visualization of the performance of an algorithm. It has two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions. If the values on the main diagonal of the matrix are close to 1, that means the model is able to predict most examples correctly. In this plot, it looks like our first model performs well on the eight different categories because a lot of the values on the main diagonal of this matrix are over 0.8 and a few close to 1.
Step 3: Query your model
Now you're ready for the fun part! Head to the Query tab and start testing it out. PQL queries are executed on an engine. By default, these are executed on “default_engine” which is a small CPU cluster. You can override this selection for larger queries by selecting a different engine from the dropdown menu. Make sure you've selected an engine and use the File Uploads Connection to select the atis_demo dataset.
This editor is a PQL editor, which can execute both SQL and PQL queries. Try it out and get a sense of the dataset by first writing a SQL query.
SELECT * FROM atis_demo limit 5;
This will select 5 rows from the atis_demo dataset, which you can see below.
Now that we know what the dataset looks like, we can try our first predictive query. Since we trained a model to predict the intent column, we can now run the following query to get predictions as a new column added to our dataset – intent_predictions.
GIVEN SELECT * FROM atis_demo LIMIT 5;
Now scroll all the way to the right and notice the new column added to the table that contains predictions from our text classification model for what a message’s intent is.
One of the powerful things about PQL is that we can embed more advanced machine learning functionality in the language. In particular, we can write PQL queries that return both predictions as well as the confidence of the predictions or the explanations for why the model made a given prediction.
To try a query with confidence, simply add that to your Predict statement as below:
PREDICT intent with CONFIDENCE
GIVEN SELECT * FROM atis_demo LIMIT 5;
To try machine learning explanations with explanation, try the following query:
PREDICT intent with EXPLANATION
GIVEN SELECT * FROM atis_demo LIMIT 5;
This query will add one additional column to the table, which is called intent_explanation.
If you want to see how each feature contributes to the prediction of one example, click the value in the intent_explanation column and you will see the plot below. Since this dataset only has one feature, the explanation query doesn’t give us much information.
But when you have more than one features, you can use this to understand how the features contribute to the prediction better.
By default, you can see the top 3 most impactful features in the model's prediction from this view, but you can also click on any explanation to dive deeper and understand how each feature is driving the model to predict a greater or lower likelihood of the target variable.
For example, if we use the Titanic dataset with 8 features to predict the likelihood of the survival of passengers, this is the plot that shows how impactful each feature is if you click into the value of the Survived_explaination column.
And that's all you need for our quick start!
Step 4: Try it on your data
The most exciting part about Predibase is being able to use it directly on your data, with a mix of both SQL for data transformations and PQL for machine learning to train machine learning models and glean insights. This documentation, along with the platform in your hands, will be a great starting point to see what you can do next with deep learning.
We can't wait to hear more about what you try!