Connections & Datasets
A connection represents credentials stored by Predibase. These can be S3 access keys, database credentials, and more. Once a connection is created, you can import objects/tables from that connection to create datasets. You can create a connection through any interface - the UI, the Python SDK, or PQL.
A dataset represents a data object in Predibase which is used for training models and running queries.
1) Navigate to the Data tab, then click the Connect Data button
2) Select a connector type
3) Enter credentials
4) Import datasets from connection
5) Wait for the dataset to connect before it can be used for training and queries
When you connect a dataset for the first time, Predibase computes a dataset profile using a sample of data. The Dataset Detail page shows information about the dataset like disk size, number of examples, feature information, sampled examples, and other metadata.
A common pattern for datasets containing image, audio, or similar data types is to store the image / audio files in a storage service like S3. In Predibase, you can use these unstructured data types by ensuring there is a reference to these files by URI within the dataset. As an example, here's what this can look like:
External connections allow you to provide credentials for accessing these kinds of remote files on a per-column basis. To configure an external connection, follow the steps below:
1. Connect your dataset to Predibase
The first thing you need to do to use external connections is to connect your dataset with URI references to Predibase using one of the data connectors.
2. Create a connection containing the credentials to your storage bucket(s).
Once your dataset is connected, create a connection to where the unstructured data (image, audio, etc) lives. Typically, this will be in a data storage provider like S3 or GCS.
- IMPORTANT: Ensure that your URI references use "native" URI formats (e.g., "s3://my-image.png" instead of "https://s3.us-west-1.amazonaws.com/...").
3. Link the "external" connection from Step 2 to the dataset in Step 1
You can link the external connection you just created to the dataset by clicking into the dataset from Step 1 to open the dataset details page. On the dataset details page, find the field in your dataset that contains the reference URI's and click the
Create button for the field you wish to link.
Create, select the desired connection from the configuration window:
You should then see the configured connection linked in the dataset schema:
Nice! You've successfully created an external connection and linked it to the relevant dataset. Predibase will now inject the appopriate credentials from the external connection into training / prediction where relevant and remove once data read is complete.
The end result is you can now train and predict on the unified dataset in a seamless way.
Currently external connections are only supported during training and prediction. This means that query results containing external connection-linked columns may display default placeholder values. This is a display limitation only. Support for displaying thumbnails and previews of externally linked data in query results is coming soon.