PQL Examples
This page serves as a quick reference as you embark on your journey with PQL. These queries use the Twitter Bots dataset, but can be easily modified for your use case.
Introduction
Definitions
SQL: Structured Query Language
PQL: Predictive Query Language (an extension of SQL!)
Overview
In SQL, you typically query against a database, which contains one or more tables. Each table is identified by a name (e.g. "Customers" or "Orders") and tables contain records (rows) with data.
In PQL, you treat your datasets as individual tables that you can query and run predictions against.
- SQL/PQL keywords are NOT case sensitive: select is the same as SELECT. Upper-case is used below as a common practice for readability, but is not necessary.
- In PQL, semicolons are optional unless you are writing multiple commands in one query.
Part 1: Exploring your data (SQL Review)
SELECT keyword
Fetch the entire dataset; query time depends on the size of the dataset
SELECT * FROM twitter_bots;
Fetch sample of 20 rows
SELECT * FROM twitter_bots LIMIT 20;
Fetch specific columns from those 20 rows
SELECT description, followers_count, screen_name FROM twitter_bots LIMIT 20;
DISTINCT keyword
Determine the unique values for a given field (i.e. unique languages)
SELECT DISTINCT lang FROM twitter_bots;
Determine the count of unique values for a given field (i.e. # of unique languages)
SELECT COUNT(DISTINCT lang) FROM twitter_bots;
WHERE keyword
Filter rows using the WHERE syntax
SELECT * FROM twitter_bots WHERE verified = True;
Filter rows where verified is not equal to true (i.e. false)
SELECT * FROM twitter_bots WHERE verified <> True;
BETWEEN keyword
Filter rows where friends_count is between two values.
- Note: Numerical values don't need quotes
SELECT * FROM twitter_bots WHERE friends_count BETWEEN 100 and 1000;
LIKE keyword
Filter by rows where the description contains the word 'bot'
- Note: % indicates a wildcard.
SELECT * FROM twitter_bots WHERE description LIKE '% bot %';
IN keyword
Filter by specified values for a categorical feature
SELECT * FROM twitter_bots WHERE lang IN ('en', 'ca');
AND / OR keyword
You can also use AND, OR in your queries to combine conditions
SELECT * FROM twitter_bots WHERE friends_count < 100 AND (lang='en' OR lang='ca');
ORDER BY keyword
ORDER BY allows you to sort the results by a given field.
Default ORDER BY is ascending order ASC
SELECT * FROM twitter_bots ORDER BY account_age_days DESC;
Use Single Quotes for String Literals
In keeping with the ANSI-SQL standard, remember to use single quote ('), not double quotes ("), when referring to strings
SELECT * FROM twitter_bots WHERE lang <> 'en';
Find Degree of Imbalance
To understand how your data is distributed, you can write a query that returns the number of occurrences for each unique value.
SELECT account_type, count(*) FROM twitter_bots
GROUP BY account_type;
Part 2: Model Performance with PQL
Run inference (with default model)
Run inference using a given target (i.e. account_type)
PREDICT account_type GIVEN
SELECT * FROM twitter_bots LIMIT 20;
Run inference (with confidence)
PREDICT account_type WITH CONFIDENCE GIVEN
SELECT * FROM twitter_bots LIMIT 20;
Run inference (with explanations)
PREDICT account_type WITH EXPLANATION GIVEN
SELECT * FROM twitter_bots LIMIT 20;
Evaluate Keyword
EVALUATE allows you to view model performance metrics given different slices of your data.
Example showing loss & accuracy on entire dataset:
EVALUATE account_type GIVEN SELECT * FROM twitter_bots;
Example showing metrics for users where language was English:
EVALUATE account_type GIVEN SELECT * FROM twitter_bots WHERE lang = 'en';
Example showing metrics for users where language was NOT English:
EVALUATE account_type GIVEN SELECT * FROM twitter_bots WHERE lang <> 'en';
Drift Detection & Retraining
With PQL and the Evaluate
command, you can easily compare model performance across model versions. This can be especially useful for use-cases around drift detection and retraining.
Example 1A: Identify drift by comparing model performance across versions
with compare_performance as (
SELECT * FROM (
EVALUATE account_type
GIVEN SELECT * FROM twitter_bots WHERE created_at <= '2016-07-01'
)
UNION ALL SELECT * FROM (
EVALUATE account_type
GIVEN SELECT * FROM twitter_bots WHERE created_at > '2016-07-01' AND created_at <= '2016-12-31'
)
)
SELECT * FROM compare_performance
WHERE metric_name = 'accuracy';
Example 1B: Given performance degredation, users can retrain a new model on fresh data
RETRAIN MODEL "Twitter Bots";
Part 3: Hypothetical Queries
What if I vary the feature account_age_days?
PREDICT account_type WITH CONFIDENCE
GIVEN
(SELECT * FROM twitter_bots WHERE favourites_count>500 LIMIT 1),
account_age_days=ARRAY[1, 100, 1000, 5000]
;
What if I run inference on a new, single example?
PREDICT account_type GIVEN
description='I am a bot',
followers_count=100
friends_count=10,
lang="en",
screen_name='coolest_demo',
statuses_count=20,
verified='false',
account_age_days=100
;
Part 4: Art of the possible
Where was the model incorrect?
Query to find where the model was wrong (Predictions differ from Ground Truth)
WITH bot_incorrect AS (
SELECT * FROM (
PREDICT account_type
GIVEN SELECT * FROM twitter_bots
)
WHERE account_type <> account_type_predictions
)
SELECT *
FROM bot_incorrect;
How likely is a given example a human?
Query to rank twitter bots examples by their "human probability"
WITH results AS (
SELECT
CASE
WHEN account_type_predictions = 'human' THEN account_type_probability
ELSE 1.0 - account_type_probability
END AS human_probability,
*
FROM
(PREDICT account_type WITH CONFIDENCE GIVEN
SELECT * FROM twitter_bots
LIMIT 100)
)
SELECT *
FROM results
ORDER BY human_probability;
Where was the model most confident?
Query to get the predictions ordered by confidence
WITH results AS (
PREDICT account_type WITH CONFIDENCE
GIVEN SELECT * FROM twitter_bots
)
SELECT *
FROM results
ORDER BY account_type_probability;
Which examples were least likely to be a bot?
Order the twitter bots dataset from least likely to be a bot to most likely to be a bot (according to the model)
WITH results AS (
PREDICT account_type WITH PROBABILITIES OF ('bot')
GIVEN SELECT * FROM twitter_bots
)
SELECT *
FROM results
ORDER BY account_type_probabilities_bot;
How can I see the false positives of the model?
FP Query: Extension of last query to find false positives
Note the query below can be easily modified to find TP / TN / FN as well by replacing the WHERE clause with:
- True Positives:
WHERE account_type = 'bot' AND account_type_predictions = 'bot'
- True Negatives:
WHERE account_type = 'human' AND account_type_predictions = 'human'
- False Negatives:
WHERE account_type = 'bot' AND account_type_predictions = 'human'
WITH bot_false_positives AS (
SELECT * FROM (
PREDICT account_type WITH PROBABILITIES OF ('bot')
GIVEN SELECT * FROM twitter_bots
)
WHERE account_type = 'human' AND account_type_predictions = 'bot'
)
SELECT *
FROM bot_false_positives
ORDER BY account_type_probabilities_bot DESC
LIMIT 10;