Skip to main content

CREATE CONNECTION

CREATE CONNECTION creates a new data source from an external connection.

Syntax

Following is the grammar for creating a connection, with connection_parameters dependent on the connection type (see below). The connection uri can be optionally provided as the base path for blob stores.

CREATE [ OR REPLACE ] CONNECTION [ IF NOT EXISTS ] connection_name
CONNECTION_TYPE = { adls | gcs | s3 | mysql | postgresql | redshift | snowflake | bigquery }
connection_parameters

The Predibase parser adheres to ANSI/ISO SQL standards and a superset of grammar supported by Apache Calcite parser.

When a parameter is a choice of one or more values they are enclosed in curly brackets { }. Optional parameters are enclosed in square brackets [ ].

Strings are enclosed in single quotes eg: 'string', and identifiers are escaped with double quotes eg: "identifier".

Connection name

A unique name for the data source in Predibase. This name can be used to resolve datasets and features in the case of ambiguity. For example as in the FROM clause:

SELECT feature_name FROM connection_name.dataset_name

Connection type

The type of external data source to connect to. One of:

  • adls: Azure Data Lake Storage.
  • gcs: Google Cloud Storage.
  • s3: Amazon S3.
  • mysql: MySQL database.
  • postgresql: PostgreSQL database.
  • redshift: Amazon Redshift data warehouse.
  • snowflake: Snowflake data warehouse.
  • bigquery: Google BigQuery data warehouse.
  • databricks: Databricks SQL.

Note that local file uploads cannot be connected in this way. Please use the Predibase UI for this.

Connection parameters

A series of connection parameters are required depending on the connection type.

ADLS

ADLS requires an access key or connection string and optional protocol eg:

CONNECTION_TYPE = adls
[ AZURE_ACCESS_KEY = 'account_key' ]
[ AZURE_CONNECTION_STRING = 'connection_string' ]
[ AZURE_PROTOCOL = { 'abfs' } ]

GSC

GCS requires a GCP JSON token string:

CONNECTION_TYPE = gcs
GCP_TOKEN = 'json_string'

S3

S3 requires Access Key ID and Secret Key and optionally a Role ARN:

CONNECTION_TYPE = s3
AWS_ACCESS_KEY_ID = 'access_key_id'
AWS_SECRET_ACCESS_KEY = 'secret_access_key'
[ AWS_ROLE_ARN = 'arn' ]

Snowflake

In addition to the above, snowflake requires a warehouse, database and schema name as well as username and password.

CONNECTION_TYPE = snowflake
WAREHOUSE_NAME = 'warehouse_name'
DATABASE_NAME = 'database_name'
SCHEMA_NAME = 'schema_name'
USERNAME = 'user'
PASSWORD = 'pwd'

Bigquery

Bigquery requires a GCP token as well as a project and dataset name.

CONNECTION_TYPE = bigquery
GCP_TOKEN = 'json_string'
PROJECT_NAME = 'project_name'
DATASET_NAME = 'dataset_name'

Databricks

Databricks SQL requires a personal access token, server host name, http path and optional schema name.

CONNECTION_TYPE = DATABRICKS
ACCESS_TOKEN = 'access_token'
SERVER_HOST_NAME = 'server_host_name'
HTTP_PATH = 'http_path'
[ SCHEMA_NAME = 'schema_name' ]

MySql, PostgresSql, Redshift

All these databases require a database and schema name as well as username and password.

CONNECTION_TYPE = { mysql | postgresql | redshift }
DATABASE_ADDRESS = 'host:port'
DATABASE_NAME = 'database_name'
SCHEMA_NAME = 'schema_name'
USERNAME = 'user'
PASSWORD = 'pwd'

Examples

Here is an example of how to create a connection for a mysql database:

CREATE CONNECTION mysql_connection
CONNECTION_TYPE = mysql
DATABASE_ADDRESS = '127.0.0.1:3306'
DATABASE_NAME = 'my_database'
SCHEMA_NAME = 'public'
USERNAME = 'my_username'
PASSWORD = 'my_password'

Here is an example of how to create a s3 connection:

CREATE CONNECTION s3_connection
CONNECTION_TYPE = S3
AWS_ACCESS_KEY_ID = 'xxx'
AWS_SECRET_ACCESS_KEY = 'xxx'