Skip to main content

PQL Examples

This page serves as a quick reference as you embark on your journey with PQL. These queries use the Twitter Bots dataset, but can be easily modified for your use case.

Introduction

Definitions

SQL: Structured Query Language
PQL: Predictive Query Language (an extension of SQL!)

Overview

In SQL, you typically query against a database, which contains one or more tables. Each table is identified by a name (e.g. "Customers" or "Orders") and tables contain records (rows) with data.

In PQL, you treat your datasets as individual tables that you can query and run predictions against.

  • SQL/PQL keywords are NOT case sensitive: select is the same as SELECT. Upper-case is used below as a common practice for readability, but is not necessary.
  • In PQL, semicolons are optional unless you are writing multiple commands in one query.

Part 1: Exploring your data (SQL Review)

SELECT keyword

Fetch the entire dataset; query time depends on the size of the dataset

SELECT * FROM twitter_bots;

Fetch sample of 20 rows

SELECT * FROM twitter_bots LIMIT 20;

Fetch specific columns from those 20 rows

SELECT description, followers_count, screen_name FROM twitter_bots LIMIT 20;

DISTINCT keyword

Determine the unique values for a given field (i.e. unique languages)

SELECT DISTINCT lang FROM twitter_bots;

Determine the count of unique values for a given field (i.e. # of unique languages)

SELECT COUNT(DISTINCT lang) FROM twitter_bots;

WHERE keyword

Filter rows using the WHERE syntax

SELECT * FROM twitter_bots WHERE verified = True;

Filter rows where verified is not equal to true (i.e. false)

SELECT * FROM twitter_bots WHERE verified <> True;

BETWEEN keyword

Filter rows where friends_count is between two values.

  • Note: Numerical values don't need quotes
SELECT * FROM twitter_bots WHERE friends_count BETWEEN 100 and 1000;

LIKE keyword

Filter by rows where the description contains the word 'bot'

  • Note: % indicates a wildcard.
SELECT * FROM twitter_bots WHERE description LIKE '% bot %';

IN keyword

Filter by specified values for a categorical feature

SELECT * FROM twitter_bots WHERE lang IN ('en', 'ca');

AND / OR keyword

You can also use AND, OR in your queries to combine conditions

SELECT * FROM twitter_bots WHERE friends_count < 100 AND (lang='en' OR lang='ca');

ORDER BY keyword

ORDER BY allows you to sort the results by a given field.
Default ORDER BY is ascending order ASC

SELECT * FROM twitter_bots ORDER BY account_age_days DESC;

Use Single Quotes for String Literals

In keeping with the ANSI-SQL standard, remember to use single quote ('), not double quotes ("), when referring to strings

SELECT * FROM twitter_bots WHERE lang <> 'en';

Find Degree of Imbalance

To understand how your data is distributed, you can write a query that returns the number of occurrences for each unique value.

SELECT account_type, count(*) FROM twitter_bots
GROUP BY account_type;

Part 2: Model Performance with PQL

Run inference (with default model)

Run inference using a given target (i.e. account_type)

PREDICT account_type GIVEN
SELECT * FROM twitter_bots LIMIT 20;

Run inference (with confidence)

PREDICT account_type WITH CONFIDENCE GIVEN
SELECT * FROM twitter_bots LIMIT 20;

Run inference (with explanations)

PREDICT account_type WITH EXPLANATION GIVEN
SELECT * FROM twitter_bots LIMIT 20;

Evaluate Keyword

EVALUATE allows you to view model performance metrics given different slices of your data.

Example showing loss & accuracy on entire dataset:

EVALUATE account_type GIVEN SELECT * FROM twitter_bots;

Example showing metrics for users where language was English:

EVALUATE account_type GIVEN SELECT * FROM twitter_bots WHERE lang = 'en';

Example showing metrics for users where language was NOT English:

EVALUATE account_type GIVEN SELECT * FROM twitter_bots WHERE lang <> 'en';

Drift Detection & Retraining

With PQL and the Evaluate command, you can easily compare model performance across model versions. This can be especially useful for use-cases around drift detection and retraining.

Example 1A: Identify drift by comparing model performance across versions

with compare_performance as (
SELECT * FROM (
EVALUATE account_type
GIVEN SELECT * FROM twitter_bots WHERE created_at <= '2016-07-01'
)
UNION ALL SELECT * FROM (
EVALUATE account_type
GIVEN SELECT * FROM twitter_bots WHERE created_at > '2016-07-01' AND created_at <= '2016-12-31'
)
)
SELECT * FROM compare_performance
WHERE metric_name = 'accuracy';

Example 1B: Given performance degredation, users can retrain a new model on fresh data

RETRAIN MODEL "Twitter Bots";

Part 3: Hypothetical Queries

What if I vary the feature account_age_days?

PREDICT account_type WITH CONFIDENCE 
GIVEN
(SELECT * FROM twitter_bots WHERE favourites_count>500 LIMIT 1),
account_age_days=ARRAY[1, 100, 1000, 5000]
;

What if I run inference on a new, single example?

PREDICT account_type GIVEN
description='I am a bot',
followers_count=100
friends_count=10,
lang="en",
screen_name='coolest_demo',
statuses_count=20,
verified='false',
account_age_days=100
;

Part 4: Art of the possible

Where was the model incorrect?

Query to find where the model was wrong (Predictions differ from Ground Truth)

WITH bot_incorrect AS (
SELECT * FROM (
PREDICT account_type
GIVEN SELECT * FROM twitter_bots
)
WHERE account_type <> account_type_predictions
)
SELECT *
FROM bot_incorrect;

How likely is a given example a human?

Query to rank twitter bots examples by their "human probability"

WITH results AS (
SELECT
CASE
WHEN account_type_predictions = 'human' THEN account_type_probability
ELSE 1.0 - account_type_probability
END AS human_probability,
*
FROM
(PREDICT account_type WITH CONFIDENCE GIVEN
SELECT * FROM twitter_bots
LIMIT 100)
)
SELECT *
FROM results
ORDER BY human_probability;

Where was the model most confident?

Query to get the predictions ordered by confidence

WITH results AS (
PREDICT account_type WITH CONFIDENCE
GIVEN SELECT * FROM twitter_bots
)
SELECT *
FROM results
ORDER BY account_type_probability;

Which examples were least likely to be a bot?

Order the twitter bots dataset from least likely to be a bot to most likely to be a bot (according to the model)

WITH results AS (
PREDICT account_type WITH PROBABILITIES OF ('bot')
GIVEN SELECT * FROM twitter_bots
)
SELECT *
FROM results
ORDER BY account_type_probabilities_bot;

How can I see the false positives of the model?

FP Query: Extension of last query to find false positives
Note the query below can be easily modified to find TP / TN / FN as well by replacing the WHERE clause with:

  • True Positives: WHERE account_type = 'bot' AND account_type_predictions = 'bot'
  • True Negatives: WHERE account_type = 'human' AND account_type_predictions = 'human'
  • False Negatives: WHERE account_type = 'bot' AND account_type_predictions = 'human'
WITH bot_false_positives AS (
SELECT * FROM (
PREDICT account_type WITH PROBABILITIES OF ('bot')
GIVEN SELECT * FROM twitter_bots
)
WHERE account_type = 'human' AND account_type_predictions = 'bot'
)
SELECT *
FROM bot_false_positives
ORDER BY account_type_probabilities_bot DESC
LIMIT 10;