# SELECT

`SELECT`

can be used to obtain rows from the dataset and to filter results
returned by prediction queries.

## Syntax

`SELECT feature_name [, feature_name]* `

FROM { dataset_name | query }

[ WHERE where_condition ]

[ ORDER BY sort_col { ASC | DESC } [ LIMIT row_count ] ]

### Features

Similar to normal SQL, a `*`

will return all columns from the dataset, or a comma-separated
list of columns can be provided.

Prediction outputs follow the form `<feature_name>_<property>[_<category>]`

for the following properties:

`predictions`

the predicted column values`probability`

the model confidence expressed as probability of true class from`WITH confidence`

`probabilities`

the model probabilities for each`category`

expressed as probability from`WITH probabilities`

`explanation`

the model explanation from`WITH explanation`

### Source

The select source can be a dataset, another SQL query or a PQL PREDICT or EVALUATE query.

#### Where

An optional where clause that may include any of the columns from the sub-query.

#### Order by

Results can optionally be ordered.

#### Limit

Results can be limited on rows returned.

## Examples

### Select from dataset

Select queries can retrieve results from datasets filter on specific columns and returning a subset like so:

`SELECT * FROM dataset_name where split = 'test' limit 100`

### Select across datasets

Select queries can fetch data from multiple datasets that can be from the current connection ( `dataset_one`

) or from another connection (`connection_two.dataset_two`

) eg:

`SELECT ds1.*`

FROM dataset_one ds1

INNER JOIN connection_two.dataset_two ds2 on ds1.id = ds2.id

If these queries are from the same connection then the join will be pushed down to the source system. If these queries cross connections (for example one is from a data warehouse and other is from an object store), then the join will happen in Predibase with the filter being pushed down where possible.

### Select model errors

In cases where the true labels are known, nested selection can be used to slice model predictions to measure performance and understand what the models gets right / wrong.

`SELECT * FROM (`

PREDICT Survived

GIVEN

select * from titanic

where Survived is not null

) WHERE Survived <> Survived_predictions

### Select high confidence predictions

If we want to use these predictions in production, we can decide what level of confidence we will trust based on our data / problem. For example, if we only want to work with predictions with confidence above 0.7, we can do the following:

`SELECT PassengerId, Survived_predictions FROM (`

PREDICT Survived

WITH confidence

GIVEN

SELECT * from titanic

where Survived is not null

limit 10

) WHERE Survived_probability > 0.7

### With aggregation

In order to perform error analysis you can use a `WITH`

query to return one or more specific results sets which you can group by to get statistics like in the following example

`WITH survived_errors as (`

SELECT Survived, Survived_probabilities_True FROM (

PREDICT Survived WITH PROBABILITIES OF ('True', 'False')

GIVEN SELECT * from titanic

)

WHERE Survived <> Survived_predictions

)

SELECT

Survived,

COUNT(*) as error_count,

MIN(Survived_probabilities_True) min_prob,

AVG(Survived_probabilities_True) avg_prob,

MAX(Survived_probabilities_True) max_prob

FROM survived_errors

GROUP BY Survived

HAVING COUNT(*) > 10