Private Deployments
Deploy models in your own environment
Private deployments provide serverless, dedicated model instances with:
- Automatic scaling from zero to meet demand
- Full control over resources and configuration
- Higher throughput and lower latency
- Production-ready SLAs
- Pay only for what you use
Quick Start
First, install the Predibase Python SDK:
Here’s how to create and use your first serverless deployment:
Deployment Options
Configure your deployment’s scaling behavior to optimize for cost and
performance. See the DeploymentConfig
reference for the full
definition and defaults.
Common configuration examples:
GPU hardware
Your private model deployments can be run on the following GPUs and associated tiers. See our pricing.
Accelerator | ID | Predibase Tiers | GPUs | SKU |
---|---|---|---|---|
1 A10G 24GB | a10_24gb_100 | All | 1 | A10G |
1 L40S 48GB | l40s_48gb_100 | All | 1 | L40S |
1 L4 24GB | l4_24gb_100 | VPC | 1 | L4 |
1 A100 80GB | a100_80gb_100 | Developer, Enterprise SaaS | 1 | A100 |
2 A100 80GB | a100_80gb_200 | Enterprise SaaS | 2 | A100 |
4 A10G 24GB | a10_24gb_400 | Enterprise VPC | 4 | A10G |
1 H100 80GB PCIe | h100_80gb_pcie_100 | Enterprise SaaS and VPC | 1 | H100 |
1 H100 80GB SXM | h100_80gb_sxm_100 | Enterprise SaaS and VPC | 1 | H100 |
To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please reach out to us at sales@predibase.com.
Managing Deployments
Monitoring Serverless Scaling
Track your deployment’s scaling behavior and resource usage:
Update Configuration
Update deployment configuration with zero downtime.
Delete Deployment
Clean up resources when they’re no longer needed. Serverless deployments incur no cost when deleted:
Features
Speculative Decoding
While any base model can be trained as a Turbo LoRA, some models require additional deployment configurations to support adapter inference properly. See which models support turbo adapters out of the box, as denoted by the “Adapter Pre-load Not Required” column.
- If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
- If the base model fine-tuned requires the adapter to be pre-loaded, you’ll need create a private deployment like so:
Prefix Caching
Improve cold start performance by caching common prefixes:
Request Logging
Enable comprehensive logging to track usage patterns and optimize scaling:
Reserved Capacity
Enterprise customers can reserve capacity for their private deployments. Deployments backed by reserved capacity offer the following benefits over on-demand capacity:
- Lower cost: Reserved capacity offers lower per-hour costs than on-demand capacity.
- SLA-backed uptime: Reserved capacity is backed by a standard 99.9% uptime SLA (reach out to us if you need a higher SLA).
- SLA-backed coldstart latency: Reserved capacity replicas are backed by a standard 30s p50 GPU acquisition latency SLA.
If you need to reserve capacity for your private deployments, please reach out to us at sales@predibase.com.
Setting Reserved Capacity
You can assign reserved capacity when creating a private deployment:
You can also assign capacity to or remove capacity from an existing private deployment:
Viewing Reserved Capacity
You can check your reservations in the UI under Settings > Reserved Capacity.
Direct Ingress
Direct ingress is a latency optimization and security feature for VPC customers, described in VPC Overview.
Prompting With Direct Ingress
Important: Direct ingress deployments can only be accessed from hosts within your peered VPC. These deployments are not visible outside the peered VPC and cannot be accessed through alternative network paths, including the standard control plane interface.
Finding Your Direct Ingress Endpoint
To obtain your deployment’s direct ingress endpoint:
- Navigate to the Configuration tab for your deployment
- Locate the endpoint URL (format:
vpce-0123456789-01abcde.vpce-svc-012345abc.us-west-2.vpce.amazonaws.com
)
SDK Integration
Use the following Python code to connect via the SDK:
REST API
To use the REST API directly, you’ll need both your direct ingress endpoint and tenant ID (found in the Settings tab).
Note: While these requests use HTTP, they remain secure because all connections are routed through AWS PrivateLink and never leave AWS networks.
Cost Optimization
Optimize your serverless deployment costs:
- Scale to Zero: Set
min_replicas=0
to avoid paying for gpu hours when your deployment isn’t receiving requests - Cooldown Time: Adjust based on your usage patterns
- Hardware Selection: Choose cost-effective GPUs for smaller models
- Batch Processing: Use batch inference for large workloads
Next Steps
For more information, see: