Private deployments provide serverless, dedicated model instances with:
- Automatic scaling from zero to meet demand
- Full control over resources and configuration
- Higher throughput and lower latency
- Production-ready SLAs
- Pay only for what you use
Quick Start
First, install the Predibase Python SDK:
Here’s how to create and use your first serverless deployment:
from predibase import Predibase, DeploymentConfig
pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")
# Create a serverless deployment with auto-scaling
deployment = pb.deployments.create(
name="my-qwen3-8b",
config=DeploymentConfig(
base_model="qwen3-8b",
min_replicas=0, # Scale to zero when idle (serverless)
max_replicas=1, # Scale up to 1 replica on demand
cooldown_time=3600 # Scale down after 1 hour idle
)
)
# Use the deployment - it will automatically scale up when needed
client = pb.deployments.client("my-qwen3-8b")
response = client.generate("What is machine learning?")
print(response.generated_text)
Deployment Options
Configure your deployment’s scaling behavior to optimize for cost and
performance. See the DeploymentConfig
reference for the full
definition and defaults.
config = DeploymentConfig(
# Required
base_model="qwen3-8b", # Name of the base model to deploy
# Optional parameters
accelerator=None, # Specific GPU accelerator to use (e.g., "a10_24gb_100")
custom_args=None, # List of additional arguments passed to the model
cooldown_time=None, # Time in seconds before scaling down (default: 3600)
hf_token=None, # Hugging Face token for private models
min_replicas=None, # Minimum number of replicas (default: 0)
max_replicas=None, # Maximum number of replicas (default: 1)
scale_up_threshold=None, # Request threshold for scaling up
quantization=None, # Quantization method to use
uses_guaranteed_capacity=None, # Whether to use guaranteed GPU capacity
max_total_tokens=None, # Maximum total tokens for context + generation
lorax_image_tag=None, # Specific LoRAX image version
request_logging_enabled=None, # Enable/disable request logging
direct_ingress=None, # Enable/disable direct ingress
preloaded_adapters=None, # List of adapter IDs to preload
speculator=None, # Single adapter ID or list of adapter IDs to preload
prefix_caching=None, # Enable/disable KV cache prefix optimization
disable_adapters=None # Disables adapters on the deployment. Defaults to `False` if not explicitly set.
)
Common configuration examples:
# Basic serverless deployment that scales to zero
config = DeploymentConfig(
base_model="qwen3-8b",
min_replicas=0, # Scale to zero when idle (minimum cost)
max_replicas=1, # Scale up to 1 replica under load
cooldown_time=3600 # Scale down after 1 hour idle
)
# Always-on deployment with specific hardware and performance optimizations
config = DeploymentConfig(
base_model="qwen3-8b",
min_replicas=1, # Always keep 1 replica running
accelerator="l40s_48gb_100", # Use A10 GPU
quantization="fp8", # Use FP8 quantization for higher throughput
)
# Autoscaling deployment with multiple pre-loaded adapters
config = DeploymentConfig(
base_model="qwen3-8b",
min_replicas=1,
max_replicas=4,
preloaded_adapters=["my-adapter/1", "my-adapter/2"], # Pre-loaded adapter
)
GPU hardware
Your private model deployments can be run on the following GPUs and associated
tiers. See our pricing.
Accelerator | ID | Predibase Tiers | GPUs | SKU |
---|
1 A10G 24GB | a10_24gb_100 | All | 1 | A10G |
1 L40S 48GB | l40s_48gb_100 | All | 1 | L40S |
1 L4 24GB | l4_24gb_100 | VPC | 1 | L4 |
1 A100 80GB | a100_80gb_100 | Developer, Enterprise SaaS | 1 | A100 |
2 A100 80GB | a100_80gb_200 | Enterprise SaaS | 2 | A100 |
4 A10G 24GB | a10_24gb_400 | Enterprise VPC | 4 | A10G |
1 H100 80GB PCIe | h100_80gb_pcie_100 | Enterprise SaaS and VPC | 1 | H100 |
1 H100 80GB SXM | h100_80gb_sxm_100 | Enterprise SaaS and VPC | 1 | H100 |
To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please
reach out to us at sales@predibase.com.
Managing Deployments
Monitoring Serverless Scaling
Track your deployment’s scaling behavior and resource usage:
# List all deployments
deployments = pb.deployments.list()
# Get specific deployment
deployment = pb.deployments.get("my-qwen3-8b")
print(f"Status: {deployment.status}")
print(f"Current Replicas: {deployment.current_replicas}") # 0 when scaled to zero
Update Configuration
Update deployment configuration with zero downtime.
from predibase import UpdateDeploymentConfig
# Update scaling configuration
pb.deployments.update(
deployment_ref="my-qwen3-8b",
config=UpdateDeploymentConfig(
min_replicas=1, # Change to always-on (disable scale-to-zero)
max_replicas=2, # Allow scaling to 2 replicas for higher load
cooldown_time=1800 # Change cooldown time
)
)
Delete Deployment
Clean up resources when they’re no longer needed. Serverless deployments incur
no cost when deleted:
pb.deployments.delete("my-qwen3-8b")
Features
Speculative Decoding
While any base model can be trained as a Turbo LoRA, some models require additional deployment
configurations to support adapter inference properly. See which models support turbo adapters
out of the box, as denoted by the “Adapter Pre-load Not Required” column.
- If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
- If the base model fine-tuned requires the adapter to be pre-loaded, you’ll need create a private deployment like so:
# Deploy with Turbo LoRA or Turbo adapter
pb.deployments.create(
name="solar-pro-preview-instruct-deployment",
config=DeploymentConfig(
base_model="solar-pro-preview-instruct",
preloaded_adapters=["my-repo/1"], # Where "my-repo/1" is the adapter ID of a Turbo or Turbo LoRA
)
)
# Since the adapter is already loaded, prompt without needing to specify the adapter dynamically
client = pb.deployments.client("solar-pro-preview-instruct-deployment")
print(client.generate("Where is the best slice shop in NYC?", max_new_tokens=100).generated_text)
Prefix Caching
Improve cold start performance by caching common prefixes:
config = DeploymentConfig(
base_model="qwen3-8b",
prefix_caching=True # Enable caching to reduce cold start impact
)
Request Logging
Enable comprehensive logging to track usage patterns and optimize scaling:
config = DeploymentConfig(
base_model="qwen3-8b",
requests_logging_enabled=True # Enable logging for usage analysis
)
Cost Optimization
Optimize your serverless deployment costs:
- Scale to Zero: Set
min_replicas=0
to avoid paying for gpu hours when your deployment isn’t receiving requests
- Cooldown Time: Adjust based on your usage patterns
- Hardware Selection: Choose cost-effective GPUs for smaller models
- Batch Processing: Use batch inference for large workloads
Next Steps
For more information, see: