PREDICT
PREDICT
is used to compute predictions for one or more target features from a trained model given an input selection.
Syntax
Following is the grammar for making predictions:
PREDICT target [, target]* GIVEN input [, input ]*
target:
feature_name [ AS target_alias ]
[ WITH property [, property]* ]
[ USING model_name [ VERSION model_version] ]
property:
CONFIDENCE
| EXPLANATION [ ( algo='algorithm' ) ]
| PROBABILITIES [ OF ( 'category' [, 'category']* ) ]
| METADATA
input:
{ constant_input | range_input | query_input }
constant_input: identifier = literal_value
range_input: identifier =
ARRAY[ literal_value [, literal_value]* ]
| RANGE_INT(low, high, [, step])
| RANGE_FLOAT(low, high [, steps] [, { LINEAR | LOG }])
query_input:
SELECT feature_name [, feature_name]*
FROM dataset_name
[ WHERE where_condition ]
[ ORDER BY sort_col { ASC | DESC } [ LIMIT row_count ] ]
The *
syntax in the above grammar indicates that an element can repeat zero or more times.
Target features
Predict begins with a series of target features to be predicted. For every target, there must be a corresponding feature known to Predibase. If more than one features matches, it must be disambiguated by providing a dataset and possibly connection scope.
Example of scoping to a dataset:
PREDICT dataset_name.target_feature ...
Also of note:
- At prediction time, if any input columns overlap with target features, the inputs will be ignored.
- Training will happen over the full dataset, not just the input features provided in the
GIVEN
clause, with missing values being imputed based on pre-processing configuration.
Target alias
The default prediction column name will be <target_feature_name>_predictions
. Using the alias syntax, these can be renamed per column, which will also be reflect for the property columns.
For example the following query:
PREDICT Survived as pred WITH CONFIDENCE
GIVEN SELECT * FROM titainic
Will include the following aliased columns:
Survived_predictions
->pred
Survived_probability
->pred_probability
With property
The WITH
clause can specify additional properties for prediction that are included as additional columns in the result set.
Following is the set of supported properties:
CONFIDENCE
: returns a numerical score [0, 1] indicating the model’s confidence in its.EXPLANATION
: returns the explanation for why the model made its prediction. See below for supported algorithms.PROBABILITIES
: returns a list or all or specific (OF
keyword) class probabilities.METADATA
: returns themodel_name
andmodel_version
loaded for the target.
You can include one or more of these properties for each target eg:
PREDICT Survived WITH CONFIDENCE, PROBABILITIES OF ('True', 'False')
GIVEN SELECT * FROM titainic
Explanation Algorithms
Explanations are returned using one of the following algorithms which can be specified with the algo
key.
ig
explanations with the Integrated Gradients algorithm (default).shap
explanations with the SHAP algorithm.gbm
explanations for GBM models using feature importance scores of type "gain" (default for GBM models).
Example:
PREDICT is_fraud WITH EXPLANATION ( algo='shap' )
GIVEN SELECT * FROM titanic
For more information on explanations, see the Explanations section.
Using clause
Every prediction column can be produced by a specific model version if desired.
PREDICT feature_name USING model_name [ VERSION model_version]
By default, PQL will select the default model for all targets, but a specific model name and version can also be provided.
For model names with spaces or other special characters, surround the identifier with double quotes:
USING "my model name" VERSION 1
Input features
GIVEN
is used to declaratively construct the input features to use for prediction.
The result of a GIVEN
clause is a logical distributed DataFrame constructed from one or more inputs. All inputs are separated by a comma.
Query inputs
Simple SQL select statements can be used as input for example:
GIVEN SELECT * FROM titanic WHERE Survived=False ORDER BY Age LIMIT 10
The following syntax is used to parse the selection.
SELECT feature_name [, feature_name]*
FROM dataset_name
[ WHERE where_condition ]
[ ORDER BY sort_col { ASC | DESC } [ LIMIT row_count ] ]
- Each
feature_name
identifies a feature to use as input, or*
indicating all features. - The
dataset_name
from which the data will be selected, registered as a dataset. - Optional
where_condition
which must evaluate to true for a row to be selected. - Optional
order_by_condition
which specifies one or more features to order by up to arow_count
limit.
Hypothetical inputs
In addition to inputs from a dataset, these can be replaced or augmented with hypothetical inputs that can be used to for testing how a prediction might respond to a combination of inputs that may not exist in your dataset.
We support a number of different input types which will be discussed in detail below:
Constant input
Constant inputs are used for one-off / real-time prediction when the inputs used for prediction are known at query time.
GIVEN age=28, first_name='John'
Using constant inputs allows us to avoid hitting the native query engine and instead going directly to the prediction system. For data stored in warehouses or data lakes, this can significantly improve query latency.
Array input
Produce an input with an array of one or more literal values surrounded by square brackets:
Example to provide three different inputs for age:
GIVEN age=ARRAY[5, 25, 50]
Range Int input
Produce a range of integer values from low, to high with an optional step value:
Example will provide age values 1 through 10:
GIVEN age=RANGE_INT(1, 10)
Example will provide age values 5, 10, 15, 20, 25
using the step value of 5
:
GIVEN age=RANGE_INT(5, 25, 5)
Range Float input
Produce a range of continuous values from low, to high with an optional steps, and scale which defaults to LINEAR
Example will provide a range of values 0.1, 0.2 ... 1.0
using 10
steps to get there:
GIVEN price=RANGE_FLOAT(0.1, 1, 10)
This example attempts to vary the discount from 1% to 50% using a LOG
scale with a total of 100
steps.
GIVEN discount=RANGE_FLOAT(0.01, 0.5, 100, LOG)
Combining inputs
All of these input types can be combined. Here’s an example of how this can be used in a more complex query.
GIVEN
(SELECT a, b FROM dataset1 WHERE c=1),
(SELECT * FROM dataset2 WHERE d=2),
e=ARRAY[1,2,3],
f=RANGE_INT(5,25,5),
g=RANGE_FLOAT(0.1,1,10)
In this case, the order of execution is as follows:
- All native / selection inputs are executed in order, with the result of the current query being joined (on index) with the previous result.
- All constant inputs are applied afterwards as constant columns on the result DataFrame in order.
- If there were no native / selection inputs, then constant columns will form a new DataFrame with one row.
Examples
Following are a list of example predict queries, of which could be composed into more complex SELECT statements.
Basic prediction
The simplest form of the prediction query.
PREDICT Survived GIVEN select * from titanic
Advanced prediction
This example shows all optional clauses.
PREDICT Survived WITH confidence
USING titanic_model
GIVEN SELECT * from titanic
WHERE Survived is not null LIMIT 10
Hypothetical query
Start with a single example, then explore different perturbations of the input features to simulate different outcomes. Example output values are included as comments in the query below.
PREDICT Survived
GIVEN
(SELECT * from titanic WHERE Survived is null LIMIT 1),
PClass=1,
Sex=ARRAY['male', 'female'],
Age=RANGE_INT(50, 52), -- 50, 51, 52
Fare=RANGE_FLOAT(10, 120, 5, LOG) -- 10, 18, 34, 64, 120
Nested query
Return a subset of fields or return calculated fields which will be used to sort or limit results:
SELECT
Survived,
Pred,
CASE WHEN Survived = Pred THEN 1 ELSE 0 END as success
FROM (
PREDICT Survived as Pred
GIVEN SELECT * FROM titanic
)
ORDER BY success DESC
LIMIT 10
Compare query
The WITH METADATA
property can be used in conjunction with a UNION
of two prediction results to compare the default model for a given target vs version 1 of the titanic_small
model as show in the following example:
WITH preds AS (
SELECT * FROM (
PREDICT Survived with confidence, metadata
GIVEN SELECT * from titanic
)
UNION ALL SELECT * FROM (
PREDICT Survived with confidence, metadata
USING titanic_small VERSION 1 -- specific model
GIVEN SELECT * FROM titanic
)
)
SELECT
Survived_model_name,
Survived_model_version,
AVG(Survived_probability) AS Survived_model_probability,
COUNT(*) as total_predictions,
SUM(CASE WHEN Survived = Survived_predictions THEN 1 ELSE 0 END) as correct_predictions
FROM preds
GROUP BY Survived_model_name, Survived_model_version