Documentation Index
Fetch the complete documentation index at: https://docs.predibase.com/llms.txt
Use this file to discover all available pages before exploring further.
Private deployments provide serverless, dedicated model instances with:
- Automatic scaling from zero to meet demand
- Full control over resources and configuration
- Higher throughput and lower latency
- Production-ready SLAs
- Pay only for what you use
Quick Start
First, install the Predibase Python SDK:
Here’s how to create and use your first serverless deployment:
from predibase import Predibase, DeploymentConfig
pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")
# Create a serverless deployment with auto-scaling
deployment = pb.deployments.create(
name="my-qwen3-8b",
config=DeploymentConfig(
base_model="qwen3-8b",
min_replicas=0, # Scale to zero when idle (serverless)
max_replicas=1, # Scale up to 1 replica on demand
cooldown_time=3600 # Scale down after 1 hour idle
)
)
# Use the deployment - it will automatically scale up when needed
client = pb.deployments.client("my-qwen3-8b")
response = client.generate("What is machine learning?")
print(response.generated_text)
Deployment Options
Configure your deployment’s scaling behavior to optimize for cost and
performance. See the DeploymentConfig
reference for the full
definition and defaults.
config = DeploymentConfig(
# Required
base_model="qwen3-8b", # Name of the base model to deploy
# Optional parameters
accelerator=None, # Specific GPU accelerator to use (e.g., "a10_24gb_100")
custom_args=None, # List of additional arguments passed to the model
cooldown_time=None, # Time in seconds before scaling down (default: 3600)
hf_token=None, # Hugging Face token for private models
min_replicas=None, # Minimum number of replicas (default: 0)
max_replicas=None, # Maximum number of replicas (default: 1)
scale_up_threshold=None, # Request threshold for scaling up
quantization=None, # Quantization method to use
uses_guaranteed_capacity=None, # Whether to use guaranteed GPU capacity
max_total_tokens=None, # Maximum total tokens for context + generation
max_num_batched_tokens=None, # Maximum tokens that can be batched together
lorax_image_tag=None, # Specific LoRAX image version
request_logging_enabled=None, # Enable/disable request logging
direct_ingress=None, # Enable/disable direct ingress
preloaded_adapters=None, # List of adapter IDs to preload
speculator=None, # Single adapter ID or list of adapter IDs to preload
prefix_caching=None, # Enable/disable KV cache prefix optimization
merge_adapter=None, # Merge the preloaded adapter to the base model.
disable_adapters=None # Disables adapters on the deployment. Defaults to `False` if not explicitly set.
)
Common configuration examples:
# Basic serverless deployment that scales to zero
config = DeploymentConfig(
base_model="qwen3-8b",
min_replicas=0, # Scale to zero when idle (minimum cost)
max_replicas=1, # Scale up to 1 replica under load
cooldown_time=3600 # Scale down after 1 hour idle
)
# Always-on deployment with specific hardware and performance optimizations
config = DeploymentConfig(
base_model="qwen3-8b",
min_replicas=1, # Always keep 1 replica running
accelerator="l40s_48gb_100", # Use A10 GPU
quantization="fp8", # Use FP8 quantization for higher throughput
max_num_batched_tokens=8192, # Allow larger batches for higher throughput
)
# Autoscaling deployment with multiple pre-loaded adapters
config = DeploymentConfig(
base_model="qwen3-8b",
min_replicas=1,
max_replicas=4,
preloaded_adapters=["my-adapter/1", "my-adapter/2"], # Pre-loaded adapter
)
GPU hardware
Your private model deployments can be run on the following GPUs and associated
tiers. See our pricing.
| Accelerator | ID | Predibase Tiers | GPUs | SKU |
|---|
| 1 A10G 24GB | a10_24gb_100 | All | 1 | A10G |
| 1 L40S 48GB | l40s_48gb_100 | All | 1 | L40S |
| 1 L4 24GB | l4_24gb_100 | Enterprise VPC | 1 | L4 |
| 1 A100 80GB | a100_80gb_100 | Enterprise SaaS | 1 | A100 |
| 2 A100 80GB | a100_80gb_200 | Enterprise SaaS | 2 | A100 |
| 4 A10G 24GB | a10_24gb_400 | Enterprise VPC | 4 | A10G |
| 1 H100 80GB PCIe | h100_80gb_pcie_100 | Enterprise SaaS and VPC | 1 | H100 |
| 1 H100 80GB SXM | h100_80gb_sxm_100 | Enterprise SaaS and VPC | 1 | H100 |
To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please
reach out to us at sales@predibase.com.
Managing Deployments
Monitoring Serverless Scaling
Track your deployment’s scaling behavior and resource usage:
# List all deployments
deployments = pb.deployments.list()
# Get specific deployment
deployment = pb.deployments.get("my-qwen3-8b")
print(f"Status: {deployment.status}")
print(f"Current Replicas: {deployment.current_replicas}") # 0 when scaled to zero
Update Configuration
Update deployment configuration with zero downtime.
from predibase import UpdateDeploymentConfig
# Update scaling configuration
pb.deployments.update(
deployment_ref="my-qwen3-8b",
config=UpdateDeploymentConfig(
min_replicas=1, # Change to always-on (disable scale-to-zero)
max_replicas=2, # Allow scaling to 2 replicas for higher load
cooldown_time=1800 # Change cooldown time
)
)
Delete Deployment
Clean up resources when they’re no longer needed. Serverless deployments incur
no cost when deleted:
pb.deployments.delete("my-qwen3-8b")
Features
Speculative Decoding
While any base model can be trained as a Turbo LoRA, some models require additional deployment
configurations to support adapter inference properly. See which models support turbo adapters
out of the box, as denoted by the “Adapter Pre-load Not Required” column.
- If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
- If the base model fine-tuned requires the adapter to be pre-loaded, you’ll need create a private deployment like so:
# Deploy with Turbo LoRA or Turbo adapter
pb.deployments.create(
name="solar-pro-preview-instruct-deployment",
config=DeploymentConfig(
base_model="solar-pro-preview-instruct",
preloaded_adapters=["my-repo/1"], # Where "my-repo/1" is the adapter ID of a Turbo or Turbo LoRA
)
)
# Since the adapter is already loaded, prompt without needing to specify the adapter dynamically
client = pb.deployments.client("solar-pro-preview-instruct-deployment")
print(client.generate("Where is the best slice shop in NYC?", max_new_tokens=100).generated_text)
Prefix Caching
Improve cold start performance by caching common prefixes:
config = DeploymentConfig(
base_model="qwen3-8b",
prefix_caching=True # Enable caching to reduce cold start impact
)
(BETA) Merging Adapters
Merging an adapter can speed up inference at the cost of flexibility. Once you merge an adapter to the base model on a deployment,
that deployment cannot use any other adapters for inference.
config = DeploymentConfig(
base_model="qwen3-8b",
merge_adapter=True,
preloaded_adapters=["your-qwen3-8b-adapter/1"]
)
Please note:
- The adapter and the base model must match
- There must only be one preloaded adapter when configuring the model
- There will be a penalty on model startup time since the adapter is merged on the fly. This penalty will not be present for enterprise customers who
use “guaranteed capacity” on their deployment.
Request Logging
Enable comprehensive logging to track usage patterns and optimize scaling:
config = DeploymentConfig(
base_model="qwen3-8b",
requests_logging_enabled=True # Enable logging for usage analysis
)
Reserved Capacity
Enterprise customers can reserve capacity for their private deployments. Deployments backed by reserved capacity offer the following benefits over on-demand capacity:
- Lower cost: Reserved capacity offers lower per-hour costs than on-demand capacity.
- SLA-backed uptime: Reserved capacity is backed by a standard 99.9% uptime SLA (reach out to us if you need a higher SLA).
- SLA-backed coldstart latency: Reserved capacity replicas are backed by a standard 30s p50 GPU acquisition latency SLA.
If you need to reserve capacity for your private deployments, please reach out to us at sales@predibase.com.
Setting Reserved Capacity
You can assign reserved capacity when creating a private deployment:
config = DeploymentConfig(
base_model="qwen3-8b",
uses_guaranteed_capacity=True, # Enable reserved capacity
...
)
You can also assign capacity to or remove capacity from an existing private deployment:
pb.deployments.update(
deployment_ref="my-qwen3-8b",
config=UpdateDeploymentConfig(
uses_guaranteed_capacity=True, # Enable reserved capacity
...
)
)
Viewing Reserved Capacity
You can check your reservations in the UI under Settings > Reserved Capacity.
Direct Ingress
Direct ingress is a latency optimization and security feature for VPC customers, described in
VPC Overview.
Prompting With Direct Ingress
Important: Direct ingress deployments can only be accessed from hosts within your peered VPC. These deployments are
not visible outside the peered VPC and cannot be accessed through alternative network paths, including the standard
control plane interface.
Finding Your Direct Ingress Endpoint
To obtain your deployment’s direct ingress endpoint:
- Navigate to the Configuration tab for your deployment
- Locate the endpoint URL (format:
vpce-0123456789-01abcde.vpce-svc-012345abc.us-west-2.vpce.amazonaws.com)
SDK Integration
Use the following Python code to connect via the SDK:
from predibase import Predibase
pb = Predibase()
client = pb.deployments.client('deployment-name', serving_url_override='<direct_ingress_endpoint>')
response = client.generate(prompt='hello', max_new_tokens=16)
REST API
To use the REST API directly, you’ll need both your direct ingress endpoint and tenant ID (found in the Settings tab).
Note: While these requests use HTTP, they remain secure because all connections are routed through AWS PrivateLink
and never leave AWS networks.
curl --request POST \
--url 'http://<direct_ingress_endpoint>/<tenant_id>/deployments/v2/llms/<deployment_name>/generate' \
--header 'Content-Type: application/json' \
--header 'User-Agent: <user_agent>' \
--data '{
"inputs": "hello",
"parameters": {
"max_new_tokens": 16,
"details": true
}
}'
Cost Optimization
Optimize your serverless deployment costs:
- Scale to Zero: Set
min_replicas=0 to avoid paying for gpu hours when your deployment isn’t receiving requests
- Cooldown Time: Adjust based on your usage patterns
- Hardware Selection: Choose cost-effective GPUs for smaller models
- Batch Processing: Use batch inference for large workloads
Next Steps
For more information, see: