SELECT
SELECT
can be used to obtain rows from the dataset and to filter results
returned by prediction queries.
Syntax
SELECT feature_name [, feature_name]*
FROM { dataset_name | query }
[ WHERE where_condition ]
[ ORDER BY sort_col { ASC | DESC } [ LIMIT row_count ] ]
Features
Similar to normal SQL, a *
will return all columns from the dataset, or a comma-separated
list of columns can be provided.
Prediction outputs follow the form <feature_name>_<property>[_<category>]
for the following properties:
predictions
the predicted column valuesprobability
the model confidence expressed as probability of true class fromWITH confidence
probabilities
the model probabilities for eachcategory
expressed as probability fromWITH probabilities
explanation
the model explanation fromWITH explanation
Source
The select source can be a dataset, another SQL query or a PQL PREDICT or EVALUATE query.
Where
An optional where clause that may include any of the columns from the sub-query.
Order by
Results can optionally be ordered.
Limit
Results can be limited on rows returned.
Examples
Select from dataset
Select queries can retrieve results from datasets filter on specific columns and returning a subset like so:
SELECT * FROM dataset_name where split = 'test' limit 100
Select across datasets
Select queries can fetch data from multiple datasets that can be from the current connection ( dataset_one
) or from another connection (connection_two.dataset_two
) eg:
SELECT ds1.*
FROM dataset_one ds1
INNER JOIN connection_two.dataset_two ds2 on ds1.id = ds2.id
If these queries are from the same connection then the join will be pushed down to the source system. If these queries cross connections (for example one is from a data warehouse and other is from an object store), then the join will happen in Predibase with the filter being pushed down where possible.
Select model errors
In cases where the true labels are known, nested selection can be used to slice model predictions to measure performance and understand what the models gets right / wrong.
SELECT * FROM (
PREDICT Survived
GIVEN
select * from titanic
where Survived is not null
) WHERE Survived <> Survived_predictions
Select high confidence predictions
If we want to use these predictions in production, we can decide what level of confidence we will trust based on our data / problem. For example, if we only want to work with predictions with confidence above 0.7, we can do the following:
SELECT PassengerId, Survived_predictions FROM (
PREDICT Survived
WITH confidence
GIVEN
SELECT * from titanic
where Survived is not null
limit 10
) WHERE Survived_probability > 0.7
With aggregation
In order to perform error analysis you can use a WITH
query to return one or more specific results sets which you can group by to get statistics like in the following example
WITH survived_errors as (
SELECT Survived, Survived_probabilities_True FROM (
PREDICT Survived WITH PROBABILITIES OF ('True', 'False')
GIVEN SELECT * from titanic
)
WHERE Survived <> Survived_predictions
)
SELECT
Survived,
COUNT(*) as error_count,
MIN(Survived_probabilities_True) min_prob,
AVG(Survived_probabilities_True) avg_prob,
MAX(Survived_probabilities_True) max_prob
FROM survived_errors
GROUP BY Survived
HAVING COUNT(*) > 10