The DeploymentConfig class defines the parameters used for creating private serverless deployments. This configuration is used to specify model, resource allocation, and other deployment options.

Parameters

ParameterTypeRequiredDefaultDescription
base_modelstringYesThe base model to deploy. Can be a short name from our models list or Hugging Face model path
acceleratorstringNoType of accelerator to use. If not specified, Predibase chooses the best suitable option
cooldown_timeintegerNo3600Time in seconds before scaling down idle replicas
hf_tokenstringNoHugging Face token for private models
min_replicasintegerNo0Minimum number of replicas
max_replicasintegerNo1Maximum number of replicas
scale_up_thresholdintegerNo1Number of simultaneous requests before scaling up
quantizationstringNoQuantization method (none, fp8, bitsandbytes-nf4). Default is based on the model and accelerator.
uses_guaranteed_capacitybooleanNofalseWhether to use guaranteed capacity
max_total_tokensintegerNoMaximum number of tokens per request
lorax_image_tagstringNoTag for the LoRAX image
request_logging_enabledbooleanNofalseWhether to enable request logging
direct_ingressbooleanNofalseCreates a direct endpoint to the LLM, bypassing the Predibase control plane
preloaded_adaptersarray[string]NoList of adapter IDs to preload on deployment initialization
speculatorstringNoSpeculator to use for the deployment (auto, disabled, or adapter ID of a Turbo or Turbo LoRA)
prefix_cachingbooleanNofalseWhether to enable prefix caching
custom_argsarray[string]NoCustom arguments to pass to the LoRAX launcher

Example Usage

from predibase import DeploymentConfig

# Basic configuration
config = DeploymentConfig(
    base_model="qwen3-8b",
    min_replicas=0,
    max_replicas=1
)

# Advanced configuration
config = DeploymentConfig(
    base_model="qwen3-8b",
    max_total_tokens=4094,
    quantization="fp8",
    request_logging_enabled=True,
    preloaded_adapters=["my-adapter/1", "my-adapter/2"],
    prefix_caching=True
)

Additional Pointers

Supported models and revisions

  • For base_model, use the short names provided in the list of available models or you can provide the path to a Hugging Face model (e.g. “meta-llama/Meta-Llama-3-8B”).
  • You can optionally specify a Hugging Face revision for the base model by specifying the base_model param in the format model@revision.

Speculative decoding

  • By default, all Predibase deployments of supported models will leverage speculative decoding to improve model performance by default.
    • When speculator is set to auto, the deployment will use a pre-configured speculator based on its base model or, if not available, one of the preloaded Turbo or Turbo LoRA adapters provided by the user.
    • Conversely, when speculator is disabled, no Turbo LoRA or Turbo adapters may be preloaded (i.e. may not be specified in preloadedAdapters).