Private deployments provide serverless, dedicated model instances with:

  • Automatic scaling from zero to meet demand
  • Full control over resources and configuration
  • Higher throughput and lower latency
  • Production-ready SLAs
  • Pay only for what you use

Quick Start

First, install the Predibase Python SDK:

pip install -U predibase

Here’s how to create and use your first serverless deployment:

from predibase import Predibase, DeploymentConfig

pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")

# Create a serverless deployment with auto-scaling
deployment = pb.deployments.create(
    name="my-qwen3-8b",
    config=DeploymentConfig(
        base_model="qwen3-8b",
        min_replicas=0,  # Scale to zero when idle (serverless)
        max_replicas=1,  # Scale up to 1 replica on demand
        cooldown_time=3600  # Scale down after 1 hour idle
    )
)

# Use the deployment - it will automatically scale up when needed
client = pb.deployments.client("my-qwen3-8b")
response = client.generate("What is machine learning?")
print(response.generated_text)

Deployment Options

Configure your deployment’s scaling behavior to optimize for cost and performance. See the DeploymentConfig reference for the full definition and defaults.

config = DeploymentConfig(
    # Required
    base_model="qwen3-8b",  # Name of the base model to deploy

    # Optional parameters
    accelerator=None,  # Specific GPU accelerator to use (e.g., "a10_24gb_100")
    custom_args=None,  # List of additional arguments passed to the model
    cooldown_time=None,  # Time in seconds before scaling down (default: 3600)
    hf_token=None,  # Hugging Face token for private models
    min_replicas=None,  # Minimum number of replicas (default: 0)
    max_replicas=None,  # Maximum number of replicas (default: 1)
    scale_up_threshold=None,  # Request threshold for scaling up
    quantization=None,  # Quantization method to use
    uses_guaranteed_capacity=None,  # Whether to use guaranteed GPU capacity
    max_total_tokens=None,  # Maximum total tokens for context + generation
    lorax_image_tag=None,  # Specific LoRAX image version
    request_logging_enabled=None,  # Enable/disable request logging
    direct_ingress=None,  # Enable/disable direct ingress
    preloaded_adapters=None,  # List of adapter IDs to preload
    speculator=None,  # Single adapter ID or list of adapter IDs to preload
    prefix_caching=None,  # Enable/disable KV cache prefix optimization
    disable_adapters=None   # Disables adapters on the deployment. Defaults to `False` if not explicitly set.
)

Common configuration examples:

# Basic serverless deployment that scales to zero
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=0,                 # Scale to zero when idle (minimum cost)
    max_replicas=1,                 # Scale up to 1 replica under load
    cooldown_time=3600              # Scale down after 1 hour idle
)

# Always-on deployment with specific hardware and performance optimizations
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=1,                 # Always keep 1 replica running
    accelerator="l40s_48gb_100",    # Use A10 GPU
    quantization="fp8",             # Use FP8 quantization for higher throughput
)

# Autoscaling deployment with multiple pre-loaded adapters
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=1,
    max_replicas=4,
    preloaded_adapters=["my-adapter/1", "my-adapter/2"],  # Pre-loaded adapter
)

GPU hardware

Your private model deployments can be run on the following GPUs and associated tiers. See our pricing.

AcceleratorIDPredibase TiersGPUsSKU
1 A10G 24GBa10_24gb_100All1A10G
1 L40S 48GBl40s_48gb_100All1L40S
1 L4 24GBl4_24gb_100VPC1L4
1 A100 80GBa100_80gb_100Developer, Enterprise SaaS1A100
2 A100 80GBa100_80gb_200Enterprise SaaS2A100
4 A10G 24GBa10_24gb_400Enterprise VPC4A10G
1 H100 80GB PCIeh100_80gb_pcie_100Enterprise SaaS and VPC1H100
1 H100 80GB SXMh100_80gb_sxm_100Enterprise SaaS and VPC1H100

To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please reach out to us at sales@predibase.com.

Managing Deployments

Monitoring Serverless Scaling

Track your deployment’s scaling behavior and resource usage:

# List all deployments
deployments = pb.deployments.list()

# Get specific deployment
deployment = pb.deployments.get("my-qwen3-8b")
print(f"Status: {deployment.status}")
print(f"Current Replicas: {deployment.current_replicas}")  # 0 when scaled to zero

Update Configuration

Update deployment configuration with zero downtime.

from predibase import UpdateDeploymentConfig

# Update scaling configuration
pb.deployments.update(
    deployment_ref="my-qwen3-8b",
    config=UpdateDeploymentConfig(
        min_replicas=1,    # Change to always-on (disable scale-to-zero)
        max_replicas=2,    # Allow scaling to 2 replicas for higher load
        cooldown_time=1800 # Change cooldown time
    )
)

Delete Deployment

Clean up resources when they’re no longer needed. Serverless deployments incur no cost when deleted:

pb.deployments.delete("my-qwen3-8b")

Features

Speculative Decoding

While any base model can be trained as a Turbo LoRA, some models require additional deployment configurations to support adapter inference properly. See which models support turbo adapters out of the box, as denoted by the “Adapter Pre-load Not Required” column.

  • If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
  • If the base model fine-tuned requires the adapter to be pre-loaded, you’ll need create a private deployment like so:
# Deploy with Turbo LoRA or Turbo adapter
pb.deployments.create(
    name="solar-pro-preview-instruct-deployment",
    config=DeploymentConfig(
        base_model="solar-pro-preview-instruct",
        preloaded_adapters=["my-repo/1"], # Where "my-repo/1" is the adapter ID of a Turbo or Turbo LoRA
    )
)

# Since the adapter is already loaded, prompt without needing to specify the adapter dynamically
client = pb.deployments.client("solar-pro-preview-instruct-deployment")
print(client.generate("Where is the best slice shop in NYC?", max_new_tokens=100).generated_text)

Prefix Caching

Improve cold start performance by caching common prefixes:

config = DeploymentConfig(
    base_model="qwen3-8b",
    prefix_caching=True  # Enable caching to reduce cold start impact
)

Request Logging

Enable comprehensive logging to track usage patterns and optimize scaling:

config = DeploymentConfig(
    base_model="qwen3-8b",
    requests_logging_enabled=True  # Enable logging for usage analysis
)

Cost Optimization

Optimize your serverless deployment costs:

  1. Scale to Zero: Set min_replicas=0 to avoid paying for gpu hours when your deployment isn’t receiving requests
  2. Cooldown Time: Adjust based on your usage patterns
  3. Hardware Selection: Choose cost-effective GPUs for smaller models
  4. Batch Processing: Use batch inference for large workloads

Next Steps

For more information, see: