Private Deployments
Deploy models in your own environment
Private deployments provide serverless, dedicated model instances with:
- Automatic scaling from zero to meet demand
- Full control over resources and configuration
- Higher throughput and lower latency
- Production-ready SLAs
- Pay only for what you use
Quick Start
First, install the Predibase Python SDK:
Here’s how to create and use your first serverless deployment:
Deployment Options
Configure your deployment’s scaling behavior to optimize for cost and
performance. See the DeploymentConfig
reference for the full
definition and defaults.
Common configuration examples:
GPU hardware
Your private model deployments can be run on the following GPUs and associated tiers. See our pricing.
Accelerator | ID | Predibase Tiers | GPUs | SKU |
---|---|---|---|---|
1 A10G 24GB | a10_24gb_100 | All | 1 | A10G |
1 L40S 48GB | l40s_48gb_100 | All | 1 | L40S |
1 L4 24GB | l4_24gb_100 | VPC | 1 | L4 |
1 A100 80GB | a100_80gb_100 | Developer, Enterprise SaaS | 1 | A100 |
2 A100 80GB | a100_80gb_200 | Enterprise SaaS | 2 | A100 |
4 A10G 24GB | a10_24gb_400 | Enterprise VPC | 4 | A10G |
1 H100 80GB PCIe | h100_80gb_pcie_100 | Enterprise SaaS and VPC | 1 | H100 |
1 H100 80GB SXM | h100_80gb_sxm_100 | Enterprise SaaS and VPC | 1 | H100 |
To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please reach out to us at sales@predibase.com.
Managing Deployments
Monitoring Serverless Scaling
Track your deployment’s scaling behavior and resource usage:
Update Configuration
Update deployment configuration with zero downtime.
Delete Deployment
Clean up resources when they’re no longer needed. Serverless deployments incur no cost when deleted:
Features
Speculative Decoding
While any base model can be trained as a Turbo LoRA, some models require additional deployment configurations to support adapter inference properly. See which models support turbo adapters out of the box, as denoted by the “Adapter Pre-load Not Required” column.
- If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
- If the base model fine-tuned requires the adapter to be pre-loaded, you’ll need create a private deployment like so:
Prefix Caching
Improve cold start performance by caching common prefixes:
Request Logging
Enable comprehensive logging to track usage patterns and optimize scaling:
Cost Optimization
Optimize your serverless deployment costs:
- Scale to Zero: Set
min_replicas=0
to avoid paying for gpu hours when your deployment isn’t receiving requests - Cooldown Time: Adjust based on your usage patterns
- Hardware Selection: Choose cost-effective GPUs for smaller models
- Batch Processing: Use batch inference for large workloads
Next Steps
For more information, see: