DeploymentConfig

The DeploymentConfig class defines the parameters used for creating private serverless deployments. This configuration is used to specify model, resource allocation, and other deployment options.

Parameters

Parameter	Type	Required	Default	Description
`base_model`	string	Yes		The base model to deploy. Can be a short name from our models list or Hugging Face model path
`accelerator`	string	No		(Deprecated) SKU of accelerator to use. If not specified, Predibase chooses the best suitable option
`compute_spec`	ServingComputeSpec	Yes		Specifies the type and assignment of compute / accelerator resources for this deployment
`cooldown_time`	integer	No	3600	Time in seconds before scaling down idle replicas. Minimum is 10 minutes (600 seconds)
`hf_token`	string	No		Hugging Face token for private models
`min_replicas`	integer	No	0	Minimum number of replicas
`max_replicas`	integer	No	1	Maximum number of replicas
`scale_up_threshold`	integer	No	1	Number of simultaneous requests before scaling up
`quantization`	string	No		Quantization method (`none`, `fp8`, `bitsandbytes-nf4`). Default is based on the model and accelerator.
`uses_guaranteed_capacity`	boolean	No	false	Whether to use guaranteed capacity
`max_total_tokens`	integer	No		Maximum number of tokens per request
`max_num_batched_tokens`	integer	No		Maximum number of tokens that can be batched together. Higher values increase throughput but may cause request preemption
`lorax_image_tag`	string	No		Tag for the LoRAX image
`request_logging_enabled`	boolean	No	false	Whether to enable request logging
`direct_ingress`	boolean	No	false	Creates a direct endpoint to the LLM, bypassing the Predibase control plane
`preloaded_adapters`	array[string]	No		List of adapter IDs to preload on deployment initialization
`speculator`	string	No		Speculator to use for the deployment (`auto`, `disabled`, or adapter ID of a Turbo or Turbo LoRA)
`prefix_caching`	boolean	No	false	Whether to enable prefix caching
`merge_adapter`	boolean	No	false	Whether to merge the preloaded adapter with the base model
`cache_model`	boolean	No	false	If true, caches the HF weights of the model in a private S3 bucket (see details)
`custom_args`	array[string]	No		Custom arguments to pass to the LoRAX launcher

Example Usage

from predibase import DeploymentConfig

# Basic configuration
config = DeploymentConfig(
    base_model="qwen3-8b",
    accelerator="a100_80gb_100",
    min_replicas=0,
    max_replicas=1
)

# Advanced configuration
from predibase import ServingComputeSpec, ServingComputeRequests, ComputeRequest

config = DeploymentConfig(
    base_model="qwen3-8b",
    compute_spec=ServingComputeSpec(
      region="us-west-2",  # Optional unless there are multiple regions registered in your VPC setup
      requests=ServingComputeRequests(
        inference=ComputeRequest(
          sku="a100_80gb_100",
        )
      ),
    ),
    max_total_tokens=4094,
    max_num_batched_tokens=8192,  # Allow larger batches for higher throughput
    quantization="fp8",
    request_logging_enabled=True,
    preloaded_adapters=["my-adapter/1", "my-adapter/2"],
    prefix_caching=True
)

Additional Pointers

Supported models and revisions

For base_model, use the short names provided in the list of available models or you can provide the path to a Hugging Face model (e.g. “meta-llama/Meta-Llama-3-8B”).
You can optionally specify a Hugging Face revision for the base model by specifying the base_model param in the format model@revision.

Speculative decoding

By default, all Predibase deployments of supported models will leverage speculative decoding to improve model performance by default.
- When speculator is set to auto, the deployment will use a pre-configured speculator based on its base model or, if not available, one of the preloaded Turbo or Turbo LoRA adapters provided by the user.
- Conversely, when speculator is disabled, no Turbo LoRA or Turbo adapters may be preloaded (i.e. may not be specified in preloadedAdapters).

Model caching

Availability: Model caching is exclusively available to VPC customers. Cached models are stored in a private VPC bucket with read access restricted to the VPC customer only. Behavior:

Deployments always load model weights fom the S3 cache, if available.
The cacheModel parameter controls whether model weights are written to the cache.

Versioning:

By default, weights are cached at the latest model revision
To cache a specific revision, specify the base_model parameter with the format: model@revision

Inference

Fine-Tuning

Parameters

Example Usage

Additional Pointers

Supported models and revisions

Speculative decoding

Model caching

Inference

Fine-Tuning

​Parameters

​Example Usage

​Additional Pointers

​Supported models and revisions

​Speculative decoding

​Model caching

​Related Resources

Parameters

Example Usage

Additional Pointers

Supported models and revisions

Speculative decoding

Model caching

Related Resources