Real-time Serving (Beta)
By deploying a trained model version for real-time-serving, you can deploy your model on dedicated serving engines to enable high-throughput inference via a hosted endpoint.
Creating a Deployment
Model deployments can be created through the UI, PQL, or the SDK. To deploy via the UI, press the Deploy
button at the top-right of the Model Version Page for the model version you'd like to deploy.
Clicking the Deploy
button will bring up the "Create Deployment" modal, which will ask for three inputs:
- The deployment name. This should be unique across the organization.
- The serving engine. This is the engine that will host your deployment.
- Comment. This can be anything helpful to describe the deployment in detail.
Once you have clicked Deploy
on the above page, the model will begin the deployment process. Note that this can take up to 10 minutes, especially if it's the first time deploying to a particular serving engine.
Dropping a Deployment
Model deployments can be deleted through PQL, the SDK, or the UI. To drop a deployment through the UI, press the trash bin logo at the top-right of the page.
Listing deployments
Model deployments can be shown through PQL, the SDK, or the UI. To see all deployments for a given Model Repository, click on the deployments tab on the Model Repository page.
Using Deployments to Run Inference
Today, one can run inference on deployed models through PQL or the SDK.