The DeploymentConfig class defines the parameters used for creating private serverless deployments. This configuration is used to specify model, resource allocation, and other deployment options.

Parameters

ParameterTypeRequiredDefaultDescription
base_modelstringYesThe base model to deploy. Can be a short name from our models list or Hugging Face model path
acceleratorstringNo(Deprecated) SKU of accelerator to use. If not specified, Predibase chooses the best suitable option
compute_specServingComputeSpecYesSpecifies the type and assignment of compute / accelerator resources for this deployment
cooldown_timeintegerNo3600Time in seconds before scaling down idle replicas. Minimum is 10 minutes (600 seconds)
hf_tokenstringNoHugging Face token for private models
min_replicasintegerNo0Minimum number of replicas
max_replicasintegerNo1Maximum number of replicas
scale_up_thresholdintegerNo1Number of simultaneous requests before scaling up
quantizationstringNoQuantization method (none, fp8, bitsandbytes-nf4). Default is based on the model and accelerator.
uses_guaranteed_capacitybooleanNofalseWhether to use guaranteed capacity
max_total_tokensintegerNoMaximum number of tokens per request
lorax_image_tagstringNoTag for the LoRAX image
request_logging_enabledbooleanNofalseWhether to enable request logging
direct_ingressbooleanNofalseCreates a direct endpoint to the LLM, bypassing the Predibase control plane
preloaded_adaptersarray[string]NoList of adapter IDs to preload on deployment initialization
speculatorstringNoSpeculator to use for the deployment (auto, disabled, or adapter ID of a Turbo or Turbo LoRA)
prefix_cachingbooleanNofalseWhether to enable prefix caching
cache_modelbooleanNofalseIf true, caches the HF weights of the model in a private S3 bucket (see details)
custom_argsarray[string]NoCustom arguments to pass to the LoRAX launcher

Example Usage

from predibase import DeploymentConfig

# Basic configuration
config = DeploymentConfig(
    base_model="qwen3-8b",
    accelerator="a100_80gb_100",
    min_replicas=0,
    max_replicas=1
)

# Advanced configuration
from predibase import ServingComputeSpec, ServingComputeRequests, ComputeRequest

config = DeploymentConfig(
    base_model="qwen3-8b",
    compute_spec=ServingComputeSpec(
      region="us-west-2",  # Optional unless there are multiple regions registered in your VPC setup
      requests=ServingComputeRequests(
        inference=ComputeRequest(
          sku="a100_80gb_100",
        )
      ),
    ),
    max_total_tokens=4094,
    quantization="fp8",
    request_logging_enabled=True,
    preloaded_adapters=["my-adapter/1", "my-adapter/2"],
    prefix_caching=True
)

Additional Pointers

Supported models and revisions

  • For base_model, use the short names provided in the list of available models or you can provide the path to a Hugging Face model (e.g. “meta-llama/Meta-Llama-3-8B”).
  • You can optionally specify a Hugging Face revision for the base model by specifying the base_model param in the format model@revision.

Speculative decoding

  • By default, all Predibase deployments of supported models will leverage speculative decoding to improve model performance by default.
    • When speculator is set to auto, the deployment will use a pre-configured speculator based on its base model or, if not available, one of the preloaded Turbo or Turbo LoRA adapters provided by the user.
    • Conversely, when speculator is disabled, no Turbo LoRA or Turbo adapters may be preloaded (i.e. may not be specified in preloadedAdapters).

Model caching

Availability: Model caching is exclusively available to VPC customers. Cached models are stored in a private VPC bucket with read access restricted to the VPC customer only. Behavior:
  • Deployments always load model weights fom the S3 cache, if available.
  • The cacheModel parameter controls whether model weights are written to the cache.
Versioning:
  • By default, weights are cached at the latest model revision
  • To cache a specific revision, specify the base_model parameter with the format: model@revision