Private Deployments

Private deployments provide serverless, dedicated model instances with:

Automatic scaling from zero to meet demand
Full control over resources and configuration
Higher throughput and lower latency
Production-ready SLAs
Pay only for what you use

Quick Start

First, install the Predibase Python SDK:

pip install -U predibase

Here’s how to create and use your first serverless deployment:

from predibase import Predibase, DeploymentConfig

pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")

# Create a serverless deployment with auto-scaling
deployment = pb.deployments.create(
    name="my-qwen3-8b",
    config=DeploymentConfig(
        base_model="qwen3-8b",
        min_replicas=0,  # Scale to zero when idle (serverless)
        max_replicas=1,  # Scale up to 1 replica on demand
        cooldown_time=3600  # Scale down after 1 hour idle
    )
)

# Use the deployment - it will automatically scale up when needed
client = pb.deployments.client("my-qwen3-8b")
response = client.generate("What is machine learning?")
print(response.generated_text)

Deployment Options

Configure your deployment’s scaling behavior to optimize for cost and performance. See the DeploymentConfig reference for the full definition and defaults.

config = DeploymentConfig(
    # Required
    base_model="qwen3-8b",  # Name of the base model to deploy

    # Optional parameters
    accelerator=None,  # Specific GPU accelerator to use (e.g., "a10_24gb_100")
    custom_args=None,  # List of additional arguments passed to the model
    cooldown_time=None,  # Time in seconds before scaling down (default: 3600)
    hf_token=None,  # Hugging Face token for private models
    min_replicas=None,  # Minimum number of replicas (default: 0)
    max_replicas=None,  # Maximum number of replicas (default: 1)
    scale_up_threshold=None,  # Request threshold for scaling up
    quantization=None,  # Quantization method to use
    uses_guaranteed_capacity=None,  # Whether to use guaranteed GPU capacity
    max_total_tokens=None,  # Maximum total tokens for context + generation
    max_num_batched_tokens=None,  # Maximum tokens that can be batched together
    lorax_image_tag=None,  # Specific LoRAX image version
    request_logging_enabled=None,  # Enable/disable request logging
    direct_ingress=None,  # Enable/disable direct ingress
    preloaded_adapters=None,  # List of adapter IDs to preload
    speculator=None,  # Single adapter ID or list of adapter IDs to preload
    prefix_caching=None,  # Enable/disable KV cache prefix optimization
    merge_adapter=None, # Merge the preloaded adapter to the base model. 
    disable_adapters=None   # Disables adapters on the deployment. Defaults to `False` if not explicitly set.
)

Common configuration examples:

# Basic serverless deployment that scales to zero
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=0,                 # Scale to zero when idle (minimum cost)
    max_replicas=1,                 # Scale up to 1 replica under load
    cooldown_time=3600              # Scale down after 1 hour idle
)

# Always-on deployment with specific hardware and performance optimizations
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=1,                 # Always keep 1 replica running
    accelerator="l40s_48gb_100",    # Use A10 GPU
    quantization="fp8",             # Use FP8 quantization for higher throughput
    max_num_batched_tokens=8192,    # Allow larger batches for higher throughput
)

# Autoscaling deployment with multiple pre-loaded adapters
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=1,
    max_replicas=4,
    preloaded_adapters=["my-adapter/1", "my-adapter/2"],  # Pre-loaded adapter
)

GPU hardware

Your private model deployments can be run on the following GPUs and associated tiers. See our pricing.

Accelerator	ID	Predibase Tiers	GPUs	SKU
1 A10G 24GB	`a10_24gb_100`	All	1	A10G
1 L40S 48GB	`l40s_48gb_100`	All	1	L40S
1 L4 24GB	`l4_24gb_100`	Enterprise VPC	1	L4
1 A100 80GB	`a100_80gb_100`	Enterprise SaaS	1	A100
2 A100 80GB	`a100_80gb_200`	Enterprise SaaS	2	A100
4 A10G 24GB	`a10_24gb_400`	Enterprise VPC	4	A10G
1 H100 80GB PCIe	`h100_80gb_pcie_100`	Enterprise SaaS and VPC	1	H100
1 H100 80GB SXM	`h100_80gb_sxm_100`	Enterprise SaaS and VPC	1	H100

To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please reach out to us at [email protected].

Managing Deployments

Monitoring Serverless Scaling

Track your deployment’s scaling behavior and resource usage:

# List all deployments
deployments = pb.deployments.list()

# Get specific deployment
deployment = pb.deployments.get("my-qwen3-8b")
print(f"Status: {deployment.status}")
print(f"Current Replicas: {deployment.current_replicas}")  # 0 when scaled to zero

Update Configuration

Update deployment configuration with zero downtime.

from predibase import UpdateDeploymentConfig

# Update scaling configuration
pb.deployments.update(
    deployment_ref="my-qwen3-8b",
    config=UpdateDeploymentConfig(
        min_replicas=1,    # Change to always-on (disable scale-to-zero)
        max_replicas=2,    # Allow scaling to 2 replicas for higher load
        cooldown_time=1800 # Change cooldown time
    )
)

Delete Deployment

Clean up resources when they’re no longer needed. Serverless deployments incur no cost when deleted:

pb.deployments.delete("my-qwen3-8b")

Features

Speculative Decoding

While any base model can be trained as a Turbo LoRA, some models require additional deployment configurations to support adapter inference properly. See which models support turbo adapters out of the box, as denoted by the “Adapter Pre-load Not Required” column.

If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
If the base model fine-tuned requires the adapter to be pre-loaded, you’ll need create a private deployment like so:

# Deploy with Turbo LoRA or Turbo adapter
pb.deployments.create(
    name="solar-pro-preview-instruct-deployment",
    config=DeploymentConfig(
        base_model="solar-pro-preview-instruct",
        preloaded_adapters=["my-repo/1"], # Where "my-repo/1" is the adapter ID of a Turbo or Turbo LoRA
    )
)

# Since the adapter is already loaded, prompt without needing to specify the adapter dynamically
client = pb.deployments.client("solar-pro-preview-instruct-deployment")
print(client.generate("Where is the best slice shop in NYC?", max_new_tokens=100).generated_text)

Prefix Caching

Improve cold start performance by caching common prefixes:

config = DeploymentConfig(
    base_model="qwen3-8b",
    prefix_caching=True  # Enable caching to reduce cold start impact
)

(BETA) Merging Adapters

Merging an adapter can speed up inference at the cost of flexibility. Once you merge an adapter to the base model on a deployment, that deployment cannot use any other adapters for inference.

config = DeploymentConfig(
  base_model="qwen3-8b", 
  merge_adapter=True,
  preloaded_adapters=["your-qwen3-8b-adapter/1"]
)

Please note:

The adapter and the base model must match
There must only be one preloaded adapter when configuring the model
There will be a penalty on model startup time since the adapter is merged on the fly. This penalty will not be present for enterprise customers who use “guaranteed capacity” on their deployment.

Request Logging

Enable comprehensive logging to track usage patterns and optimize scaling:

config = DeploymentConfig(
    base_model="qwen3-8b",
    requests_logging_enabled=True  # Enable logging for usage analysis
)

Reserved Capacity

Enterprise customers can reserve capacity for their private deployments. Deployments backed by reserved capacity offer the following benefits over on-demand capacity:

Lower cost: Reserved capacity offers lower per-hour costs than on-demand capacity.
SLA-backed uptime: Reserved capacity is backed by a standard 99.9% uptime SLA (reach out to us if you need a higher SLA).
SLA-backed coldstart latency: Reserved capacity replicas are backed by a standard 30s p50 GPU acquisition latency SLA.

If you need to reserve capacity for your private deployments, please reach out to us at [email protected].

Setting Reserved Capacity

You can assign reserved capacity when creating a private deployment:

config = DeploymentConfig(
    base_model="qwen3-8b",
    uses_guaranteed_capacity=True,  # Enable reserved capacity
    ...
)

You can also assign capacity to or remove capacity from an existing private deployment:

pb.deployments.update(
    deployment_ref="my-qwen3-8b",
    config=UpdateDeploymentConfig(
        uses_guaranteed_capacity=True,  # Enable reserved capacity
        ...
    )
)

Viewing Reserved Capacity

You can check your reservations in the UI under Settings > Reserved Capacity.

Direct Ingress

Direct ingress is a latency optimization and security feature for VPC customers, described in VPC Overview.

Prompting With Direct Ingress

Important: Direct ingress deployments can only be accessed from hosts within your peered VPC. These deployments are not visible outside the peered VPC and cannot be accessed through alternative network paths, including the standard control plane interface.

Finding Your Direct Ingress Endpoint

To obtain your deployment’s direct ingress endpoint:

Navigate to the Configuration tab for your deployment
Locate the endpoint URL (format: vpce-0123456789-01abcde.vpce-svc-012345abc.us-west-2.vpce.amazonaws.com)

SDK Integration

Use the following Python code to connect via the SDK:

from predibase import Predibase

pb = Predibase()
client = pb.deployments.client('deployment-name', serving_url_override='<direct_ingress_endpoint>')
response = client.generate(prompt='hello', max_new_tokens=16)

REST API

To use the REST API directly, you’ll need both your direct ingress endpoint and tenant ID (found in the Settings tab). Note: While these requests use HTTP, they remain secure because all connections are routed through AWS PrivateLink and never leave AWS networks.

curl --request POST \
  --url 'http://<direct_ingress_endpoint>/<tenant_id>/deployments/v2/llms/<deployment_name>/generate' \
  --header 'Content-Type: application/json' \
  --header 'User-Agent: <user_agent>' \
  --data '{
    "inputs": "hello",
    "parameters": {
        "max_new_tokens": 16,
        "details": true
    }
}'

Cost Optimization

Optimize your serverless deployment costs:

Scale to Zero: Set min_replicas=0 to avoid paying for gpu hours when your deployment isn’t receiving requests
Cooldown Time: Adjust based on your usage patterns
Hardware Selection: Choose cost-effective GPUs for smaller models
Batch Processing: Use batch inference for large workloads

Next Steps

For more information, see:

Getting Started

Inference

Fine-Tuning

Account

Integrations

Examples

Resources

Private Deployments

Quick Start

Deployment Options

GPU hardware

Managing Deployments

Monitoring Serverless Scaling

Update Configuration

Delete Deployment

Features

Speculative Decoding

Prefix Caching

(BETA) Merging Adapters

Request Logging

Reserved Capacity

Setting Reserved Capacity

Viewing Reserved Capacity

Direct Ingress

Prompting With Direct Ingress

Finding Your Direct Ingress Endpoint

SDK Integration

REST API

Cost Optimization

Next Steps

Getting Started

Inference

Fine-Tuning

Account

Integrations

Examples

Resources

​Quick Start

​Deployment Options

​GPU hardware

​Managing Deployments

​Monitoring Serverless Scaling

​Update Configuration

​Delete Deployment

​Features

​Speculative Decoding

​Prefix Caching

​(BETA) Merging Adapters

​Request Logging

​Reserved Capacity

​Setting Reserved Capacity

​Viewing Reserved Capacity

​Direct Ingress

​Prompting With Direct Ingress

​Finding Your Direct Ingress Endpoint

​SDK Integration

​REST API

​Cost Optimization

​Next Steps

Quick Start

Deployment Options

GPU hardware

Managing Deployments

Monitoring Serverless Scaling

Update Configuration

Delete Deployment

Features

Speculative Decoding

Prefix Caching

(BETA) Merging Adapters

Request Logging

Reserved Capacity

Setting Reserved Capacity

Viewing Reserved Capacity

Direct Ingress

Prompting With Direct Ingress

Finding Your Direct Ingress Endpoint

SDK Integration

REST API

Cost Optimization

Next Steps