The UpdateDeploymentConfig class defines the parameters used for updating private serverless deployments. The parameters and their definitions match DeploymentConfig, except for accelerator, which cannot be changed via an update.

Parameters

ParameterTypeRequiredDefaultDescription
base_modelstringYesThe base model to deploy. Can be a short name from our models list or Hugging Face model path
cooldown_timeintegerNo3600Time in seconds before scaling down idle replicas
hf_tokenstringNoHugging Face token for private models
min_replicasintegerNo0Minimum number of replicas
max_replicasintegerNo1Maximum number of replicas
scale_up_thresholdintegerNo1Number of queued requests before scaling up additional replicas
quantizationstringNoQuantization method (none, fp8, bitsandbytes-nf4). Default is based on the model and accelerator.
uses_guaranteed_capacitybooleanNofalseWhether to use guaranteed capacity
max_total_tokensintegerNoMaximum number of tokens per request
lorax_image_tagstringNoTag for the LoRAX image
request_logging_enabledbooleanNofalseWhether to enable request logging
direct_ingressbooleanNofalseCreates a direct endpoint to the LLM, bypassing the Predibase control plane
preloaded_adaptersarray[string]NoList of adapter IDs to preload on deployment initialization
speculatorstringNoSpeculator to use for the deployment (auto, disabled, or adapter ID of a Turbo or Turbo LoRA)
prefix_cachingbooleanNofalseWhether to enable prefix caching
custom_argsarray[string]NoCustom arguments to pass to the LoRAX launcher

Example Usage

from predibase import UpdateDeploymentConfig

# Update scaling configuration
pb.deployments.update(
    deployment_ref="my-qwen3-8b",
    config=UpdateDeploymentConfig(
        min_replicas=1,    # Change to always-on (disable scale-to-zero)
        max_replicas=2,    # Allow scaling to 2 replicas for higher load
        cooldown_time=1800 # Change cooldown time
    )
)