UpdateDeploymentConfig
Configuration options for model deployments
The UpdateDeploymentConfig
class defines the parameters used for updating private
serverless deployments. The parameters and their definitions match DeploymentConfig
, except for accelerator, which cannot be changed via an update.
Parameters
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
base_model | string | Yes | The base model to deploy. Can be a short name from our models list or Hugging Face model path | |
cooldown_time | integer | No | 3600 | Time in seconds before scaling down idle replicas |
hf_token | string | No | Hugging Face token for private models | |
min_replicas | integer | No | 0 | Minimum number of replicas |
max_replicas | integer | No | 1 | Maximum number of replicas |
scale_up_threshold | integer | No | 1 | Number of queued requests before scaling up additional replicas |
quantization | string | No | Quantization method (none , fp8 , bitsandbytes-nf4 ). Default is based on the model and accelerator. | |
uses_guaranteed_capacity | boolean | No | false | Whether to use guaranteed capacity |
max_total_tokens | integer | No | Maximum number of tokens per request | |
lorax_image_tag | string | No | Tag for the LoRAX image | |
request_logging_enabled | boolean | No | false | Whether to enable request logging |
direct_ingress | boolean | No | false | Creates a direct endpoint to the LLM, bypassing the Predibase control plane |
preloaded_adapters | array[string] | No | List of adapter IDs to preload on deployment initialization | |
speculator | string | No | Speculator to use for the deployment (auto , disabled , or adapter ID of a Turbo or Turbo LoRA) | |
prefix_caching | boolean | No | false | Whether to enable prefix caching |
cache_model | boolean | No | false | If true, caches the HF weights of the model in a private S3 bucket (see details) |
custom_args | array[string] | No | Custom arguments to pass to the LoRAX launcher |
Example Usage
Additional Pointers
Model caching
Availability: Model caching is exclusively available to VPC customers. Cached models are stored in a private VPC bucket with read access restricted to the VPC customer only.
Behavior:
- Deployments always load model weights fom the S3 cache, if available.
- The
cacheModel
parameter controls whether model weights are written to the cache.
Versioning:
- By default, weights are cached at the latest model revision
- To cache a specific revision, specify the
base_model
parameter with the format:model@revision