Skip to main content

SELECT

SELECT can be used to obtain rows from the dataset and to filter results returned by prediction queries.

Syntax

SELECT feature_name [, feature_name]* 
FROM { dataset_name | query }
[ WHERE where_condition ]
[ ORDER BY sort_col { ASC | DESC } [ LIMIT row_count ] ]

Features

Similar to normal SQL, a * will return all columns from the dataset, or a comma-separated list of columns can be provided.

Prediction outputs follow the form <feature_name>_<property>[_<category>] for the following properties:

  • predictions the predicted column values
  • probability the model confidence expressed as probability of true class from WITH confidence
  • probabilities the model probabilities for each category expressed as probability from WITH probabilities
  • explanation the model explanation from WITH explanation

Source

The select source can be a dataset, another SQL query or a PQL PREDICT or EVALUATE query.

Where

An optional where clause that may include any of the columns from the sub-query.

Order by

Results can optionally be ordered.

Limit

Results can be limited on rows returned.

Examples

Select from dataset

Select queries can retrieve results from datasets filter on specific columns and returning a subset like so:

SELECT * FROM dataset_name where split = 'test' limit 100

Select across datasets

Select queries can fetch data from multiple datasets that can be from the current connection ( dataset_one) or from another connection (connection_two.dataset_two) eg:

SELECT ds1.*
FROM dataset_one ds1
INNER JOIN connection_two.dataset_two ds2 on ds1.id = ds2.id

If these queries are from the same connection then the join will be pushed down to the source system. If these queries cross connections (for example one is from a data warehouse and other is from an object store), then the join will happen in Predibase with the filter being pushed down where possible.

Select model errors

In cases where the true labels are known, nested selection can be used to slice model predictions to measure performance and understand what the models gets right / wrong.

SELECT * FROM (
PREDICT Survived
GIVEN
select * from titanic
where Survived is not null
) WHERE Survived <> Survived_predictions

Select high confidence predictions

If we want to use these predictions in production, we can decide what level of confidence we will trust based on our data / problem. For example, if we only want to work with predictions with confidence above 0.7, we can do the following:

SELECT PassengerId, Survived_predictions FROM (
PREDICT Survived
WITH confidence
GIVEN
SELECT * from titanic
where Survived is not null
limit 10
) WHERE Survived_probability > 0.7

With aggregation

In order to perform error analysis you can use a WITH query to return one or more specific results sets which you can group by to get statistics like in the following example

WITH survived_errors as (
SELECT Survived, Survived_probabilities_True FROM (
PREDICT Survived WITH PROBABILITIES OF ('True', 'False')
GIVEN SELECT * from titanic
)
WHERE Survived <> Survived_predictions
)
SELECT
Survived,
COUNT(*) as error_count,
MIN(Survived_probabilities_True) min_prob,
AVG(Survived_probabilities_True) avg_prob,
MAX(Survived_probabilities_True) max_prob
FROM survived_errors
GROUP BY Survived
HAVING COUNT(*) > 10