Private deployments provide serverless, dedicated model instances with:

  • Automatic scaling from zero to meet demand
  • Full control over resources and configuration
  • Higher throughput and lower latency
  • Production-ready SLAs
  • Pay only for what you use

Quick Start

First, install the Predibase Python SDK:

pip install -U predibase

Here’s how to create and use your first serverless deployment:

from predibase import Predibase, DeploymentConfig

pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")

# Create a serverless deployment with auto-scaling
deployment = pb.deployments.create(
    name="my-qwen3-8b",
    config=DeploymentConfig(
        base_model="qwen3-8b",
        min_replicas=0,  # Scale to zero when idle (serverless)
        max_replicas=1,  # Scale up to 1 replica on demand
        cooldown_time=3600  # Scale down after 1 hour idle
    )
)

# Use the deployment - it will automatically scale up when needed
client = pb.deployments.client("my-qwen3-8b")
response = client.generate("What is machine learning?")
print(response.generated_text)

Deployment Options

Configure your deployment’s scaling behavior to optimize for cost and performance. See the DeploymentConfig reference for the full definition and defaults.

config = DeploymentConfig(
    # Required
    base_model="qwen3-8b",  # Name of the base model to deploy

    # Optional parameters
    accelerator=None,  # Specific GPU accelerator to use (e.g., "a10_24gb_100")
    custom_args=None,  # List of additional arguments passed to the model
    cooldown_time=None,  # Time in seconds before scaling down (default: 3600)
    hf_token=None,  # Hugging Face token for private models
    min_replicas=None,  # Minimum number of replicas (default: 0)
    max_replicas=None,  # Maximum number of replicas (default: 1)
    scale_up_threshold=None,  # Request threshold for scaling up
    quantization=None,  # Quantization method to use
    uses_guaranteed_capacity=None,  # Whether to use guaranteed GPU capacity
    max_total_tokens=None,  # Maximum total tokens for context + generation
    lorax_image_tag=None,  # Specific LoRAX image version
    request_logging_enabled=None,  # Enable/disable request logging
    direct_ingress=None,  # Enable/disable direct ingress
    preloaded_adapters=None,  # List of adapter IDs to preload
    speculator=None,  # Single adapter ID or list of adapter IDs to preload
    prefix_caching=None,  # Enable/disable KV cache prefix optimization
    disable_adapters=None   # Disables adapters on the deployment. Defaults to `False` if not explicitly set.
)

Common configuration examples:

# Basic serverless deployment that scales to zero
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=0,                 # Scale to zero when idle (minimum cost)
    max_replicas=1,                 # Scale up to 1 replica under load
    cooldown_time=3600              # Scale down after 1 hour idle
)

# Always-on deployment with specific hardware and performance optimizations
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=1,                 # Always keep 1 replica running
    accelerator="l40s_48gb_100",    # Use A10 GPU
    quantization="fp8",             # Use FP8 quantization for higher throughput
)

# Autoscaling deployment with multiple pre-loaded adapters
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=1,
    max_replicas=4,
    preloaded_adapters=["my-adapter/1", "my-adapter/2"],  # Pre-loaded adapter
)

GPU hardware

Your private model deployments can be run on the following GPUs and associated tiers. See our pricing.

AcceleratorIDPredibase TiersGPUsSKU
1 A10G 24GBa10_24gb_100All1A10G
1 L40S 48GBl40s_48gb_100All1L40S
1 L4 24GBl4_24gb_100VPC1L4
1 A100 80GBa100_80gb_100Developer, Enterprise SaaS1A100
2 A100 80GBa100_80gb_200Enterprise SaaS2A100
4 A10G 24GBa10_24gb_400Enterprise VPC4A10G
1 H100 80GB PCIeh100_80gb_pcie_100Enterprise SaaS and VPC1H100
1 H100 80GB SXMh100_80gb_sxm_100Enterprise SaaS and VPC1H100

To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please reach out to us at sales@predibase.com.

Managing Deployments

Monitoring Serverless Scaling

Track your deployment’s scaling behavior and resource usage:

# List all deployments
deployments = pb.deployments.list()

# Get specific deployment
deployment = pb.deployments.get("my-qwen3-8b")
print(f"Status: {deployment.status}")
print(f"Current Replicas: {deployment.current_replicas}")  # 0 when scaled to zero

Update Configuration

Update deployment configuration with zero downtime.

from predibase import UpdateDeploymentConfig

# Update scaling configuration
pb.deployments.update(
    deployment_ref="my-qwen3-8b",
    config=UpdateDeploymentConfig(
        min_replicas=1,    # Change to always-on (disable scale-to-zero)
        max_replicas=2,    # Allow scaling to 2 replicas for higher load
        cooldown_time=1800 # Change cooldown time
    )
)

Delete Deployment

Clean up resources when they’re no longer needed. Serverless deployments incur no cost when deleted:

pb.deployments.delete("my-qwen3-8b")

Features

Speculative Decoding

While any base model can be trained as a Turbo LoRA, some models require additional deployment configurations to support adapter inference properly. See which models support turbo adapters out of the box, as denoted by the “Adapter Pre-load Not Required” column.

  • If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
  • If the base model fine-tuned requires the adapter to be pre-loaded, you’ll need create a private deployment like so:
# Deploy with Turbo LoRA or Turbo adapter
pb.deployments.create(
    name="solar-pro-preview-instruct-deployment",
    config=DeploymentConfig(
        base_model="solar-pro-preview-instruct",
        preloaded_adapters=["my-repo/1"], # Where "my-repo/1" is the adapter ID of a Turbo or Turbo LoRA
    )
)

# Since the adapter is already loaded, prompt without needing to specify the adapter dynamically
client = pb.deployments.client("solar-pro-preview-instruct-deployment")
print(client.generate("Where is the best slice shop in NYC?", max_new_tokens=100).generated_text)

Prefix Caching

Improve cold start performance by caching common prefixes:

config = DeploymentConfig(
    base_model="qwen3-8b",
    prefix_caching=True  # Enable caching to reduce cold start impact
)

Request Logging

Enable comprehensive logging to track usage patterns and optimize scaling:

config = DeploymentConfig(
    base_model="qwen3-8b",
    requests_logging_enabled=True  # Enable logging for usage analysis
)

Reserved Capacity

Enterprise customers can reserve capacity for their private deployments. Deployments backed by reserved capacity offer the following benefits over on-demand capacity:

  • Lower cost: Reserved capacity offers lower per-hour costs than on-demand capacity.
  • SLA-backed uptime: Reserved capacity is backed by a standard 99.9% uptime SLA (reach out to us if you need a higher SLA).
  • SLA-backed coldstart latency: Reserved capacity replicas are backed by a standard 30s p50 GPU acquisition latency SLA.

If you need to reserve capacity for your private deployments, please reach out to us at sales@predibase.com.

Setting Reserved Capacity

You can assign reserved capacity when creating a private deployment:

config = DeploymentConfig(
    base_model="qwen3-8b",
    uses_guaranteed_capacity=True,  # Enable reserved capacity
    ...
)

You can also assign capacity to or remove capacity from an existing private deployment:

pb.deployments.update(
    deployment_ref="my-qwen3-8b",
    config=UpdateDeploymentConfig(
        uses_guaranteed_capacity=True,  # Enable reserved capacity
        ...
    )
)

Viewing Reserved Capacity

You can check your reservations in the UI under Settings > Reserved Capacity.

Direct Ingress

Direct ingress is a latency optimization and security feature for VPC customers, described in VPC Overview.

Prompting With Direct Ingress

Important: Direct ingress deployments can only be accessed from hosts within your peered VPC. These deployments are not visible outside the peered VPC and cannot be accessed through alternative network paths, including the standard control plane interface.

Finding Your Direct Ingress Endpoint

To obtain your deployment’s direct ingress endpoint:

  1. Navigate to the Configuration tab for your deployment
  2. Locate the endpoint URL (format: vpce-0123456789-01abcde.vpce-svc-012345abc.us-west-2.vpce.amazonaws.com)

SDK Integration

Use the following Python code to connect via the SDK:

from predibase import Predibase

pb = Predibase()
client = pb.deployments.client('deployment-name', serving_url_override='<direct_ingress_endpoint>')
response = client.generate(prompt='hello', max_new_tokens=16)

REST API

To use the REST API directly, you’ll need both your direct ingress endpoint and tenant ID (found in the Settings tab).

Note: While these requests use HTTP, they remain secure because all connections are routed through AWS PrivateLink and never leave AWS networks.

curl --request POST \
  --url 'http://<direct_ingress_endpoint>/<tenant_id>/deployments/v2/llms/<deployment_name>/generate' \
  --header 'Content-Type: application/json' \
  --header 'User-Agent: <user_agent>' \
  --data '{
    "inputs": "hello",
    "parameters": {
        "max_new_tokens": 16,
        "details": true
    }
}'

Cost Optimization

Optimize your serverless deployment costs:

  1. Scale to Zero: Set min_replicas=0 to avoid paying for gpu hours when your deployment isn’t receiving requests
  2. Cooldown Time: Adjust based on your usage patterns
  3. Hardware Selection: Choose cost-effective GPUs for smaller models
  4. Batch Processing: Use batch inference for large workloads

Next Steps

For more information, see: