Skip to main content

PREDICT

PREDICT is used to compute predictions for one or more target features from a trained model given an input selection.

Syntax

Following is the grammar for making predictions:

PREDICT target [, target]* GIVEN input [, input ]*

target:
feature_name [ AS target_alias ]
[ WITH property [, property]* ]
[ USING model_name [ VERSION model_version] ]

property:
CONFIDENCE
| EXPLANATION [ ( algo='algorithm' ) ]
| PROBABILITIES [ OF ( 'category' [, 'category']* ) ]
| METADATA

input:
{ constant_input | range_input | query_input }

constant_input: identifier = literal_value

range_input: identifier =
ARRAY[ literal_value [, literal_value]* ]
| RANGE_INT(low, high, [, step])
| RANGE_FLOAT(low, high [, steps] [, { LINEAR | LOG }])

query_input:
SELECT feature_name [, feature_name]*
FROM dataset_name
[ WHERE where_condition ]
[ ORDER BY sort_col { ASC | DESC } [ LIMIT row_count ] ]

The * syntax in the above grammar indicates that an element can repeat zero or more times.

Target features

Predict begins with a series of target features to be predicted. For every target, there must be a corresponding feature known to Predibase. If more than one features matches, it must be disambiguated by providing a dataset and possibly connection scope.

Example of scoping to a dataset:

PREDICT dataset_name.target_feature ...

Also of note:

  • At prediction time, if any input columns overlap with target features, the inputs will be ignored.
  • Training will happen over the full dataset, not just the input features provided in the GIVEN clause, with missing values being imputed based on pre-processing configuration.

Target alias

The default prediction column name will be <target_feature_name>_predictions. Using the alias syntax, these can be renamed per column, which will also be reflect for the property columns.

For example the following query:

PREDICT Survived as pred WITH CONFIDENCE
GIVEN SELECT * FROM titainic

Will include the following aliased columns:

  • Survived_predictions -> pred
  • Survived_probability -> pred_probability

With property

The WITH clause can specify additional properties for prediction that are included as additional columns in the result set.

Following is the set of supported properties:

  • CONFIDENCE: returns a numerical score [0, 1] indicating the model’s confidence in its.
  • EXPLANATION: returns the explanation for why the model made its prediction. See below for supported algorithms.
  • PROBABILITIES: returns a list or all or specific (OF keyword) class probabilities.
  • METADATA: returns the model_name and model_version loaded for the target.

You can include one or more of these properties for each target eg:

PREDICT Survived WITH CONFIDENCE, PROBABILITIES OF ('True', 'False')
GIVEN SELECT * FROM titainic

Explanation Algorithms

Explanations are returned using one of the following algorithms which can be specified with the algo key.

  • ig explanations with the Integrated Gradients algorithm (default).
  • shap explanations with the SHAP algorithm.
  • gbm explanations for GBM models using feature importance scores of type "gain" (default for GBM models).

Example:

PREDICT is_fraud WITH EXPLANATION ( algo='shap' )
GIVEN SELECT * FROM titanic

For more information on explanations, see the Explanations section.

Using clause

Every prediction column can be produced by a specific model version if desired.

PREDICT feature_name USING model_name [ VERSION model_version]

By default, PQL will select the default model for all targets, but a specific model name and version can also be provided.

For model names with spaces or other special characters, surround the identifier with double quotes:

USING "my model name" VERSION 1

Input features

GIVEN is used to declaratively construct the input features to use for prediction.

The result of a GIVEN clause is a logical distributed DataFrame constructed from one or more inputs. All inputs are separated by a comma.

Query inputs

Simple SQL select statements can be used as input for example:

GIVEN SELECT * FROM titanic WHERE Survived=False ORDER BY Age LIMIT 10

The following syntax is used to parse the selection.

SELECT feature_name [, feature_name]*
FROM dataset_name
[ WHERE where_condition ]
[ ORDER BY sort_col { ASC | DESC } [ LIMIT row_count ] ]
  • Each feature_name identifies a feature to use as input, or * indicating all features.
  • The dataset_name from which the data will be selected, registered as a dataset.
  • Optional where_condition which must evaluate to true for a row to be selected.
  • Optional order_by_condition which specifies one or more features to order by up to a row_count limit.

Hypothetical inputs

In addition to inputs from a dataset, these can be replaced or augmented with hypothetical inputs that can be used to for testing how a prediction might respond to a combination of inputs that may not exist in your dataset.

We support a number of different input types which will be discussed in detail below:

Constant input

Constant inputs are used for one-off / real-time prediction when the inputs used for prediction are known at query time.

GIVEN age=28, first_name='John'

Using constant inputs allows us to avoid hitting the native query engine and instead going directly to the prediction system. For data stored in warehouses or data lakes, this can significantly improve query latency.

Array input

Produce an input with an array of one or more literal values surrounded by square brackets:

Example to provide three different inputs for age:

GIVEN age=ARRAY[5, 25, 50]

Range Int input

Produce a range of integer values from low, to high with an optional step value:

Example will provide age values 1 through 10:

GIVEN age=RANGE_INT(1, 10)

Example will provide age values 5, 10, 15, 20, 25 using the step value of 5:

GIVEN age=RANGE_INT(5, 25, 5)

Range Float input

Produce a range of continuous values from low, to high with an optional steps, and scale which defaults to LINEAR

Example will provide a range of values 0.1, 0.2 ... 1.0 using 10 steps to get there:

GIVEN price=RANGE_FLOAT(0.1, 1, 10)

This example attempts to vary the discount from 1% to 50% using a LOG scale with a total of 100 steps.

GIVEN discount=RANGE_FLOAT(0.01, 0.5, 100, LOG)

Combining inputs

All of these input types can be combined. Here’s an example of how this can be used in a more complex query.

GIVEN
(SELECT a, b FROM dataset1 WHERE c=1),
(SELECT * FROM dataset2 WHERE d=2),
e=ARRAY[1,2,3],
f=RANGE_INT(5,25,5),
g=RANGE_FLOAT(0.1,1,10)

In this case, the order of execution is as follows:

  1. All native / selection inputs are executed in order, with the result of the current query being joined (on index) with the previous result.
  2. All constant inputs are applied afterwards as constant columns on the result DataFrame in order.
    • If there were no native / selection inputs, then constant columns will form a new DataFrame with one row.

Examples

Following are a list of example predict queries, of which could be composed into more complex SELECT statements.

Basic prediction

The simplest form of the prediction query.

PREDICT Survived GIVEN select * from titanic

Advanced prediction

This example shows all optional clauses.

PREDICT Survived WITH confidence
USING titanic_model
GIVEN SELECT * from titanic
WHERE Survived is not null LIMIT 10

Hypothetical query

Start with a single example, then explore different perturbations of the input features to simulate different outcomes. Example output values are included as comments in the query below.

PREDICT Survived
GIVEN
(SELECT * from titanic WHERE Survived is null LIMIT 1),
PClass=1,
Sex=ARRAY['male', 'female'],
Age=RANGE_INT(50, 52), -- 50, 51, 52
Fare=RANGE_FLOAT(10, 120, 5, LOG) -- 10, 18, 34, 64, 120

Nested query

Return a subset of fields or return calculated fields which will be used to sort or limit results:

SELECT
Survived,
Pred,
CASE WHEN Survived = Pred THEN 1 ELSE 0 END as success
FROM (
PREDICT Survived as Pred
GIVEN SELECT * FROM titanic
)
ORDER BY success DESC
LIMIT 10

Compare query

The WITH METADATA property can be used in conjunction with a UNION of two prediction results to compare the default model for a given target vs version 1 of the titanic_small model as show in the following example:

WITH preds AS (
SELECT * FROM (
PREDICT Survived with confidence, metadata
GIVEN SELECT * from titanic
)
UNION ALL SELECT * FROM (
PREDICT Survived with confidence, metadata
USING titanic_small VERSION 1 -- specific model
GIVEN SELECT * FROM titanic
)
)
SELECT
Survived_model_name,
Survived_model_version,
AVG(Survived_probability) AS Survived_model_probability,
COUNT(*) as total_predictions,
SUM(CASE WHEN Survived = Survived_predictions THEN 1 ELSE 0 END) as correct_predictions
FROM preds
GROUP BY Survived_model_name, Survived_model_version