DeploymentConfig
Below is the class definition for the Deployment Config. It inherits from UpdateDeploymentConfig.
class DeploymentConfig(BaseModel):
base_model: str # required
accelerator: str | None = Field(default=None)
custom_args: list[str] | None = Field(default=None)
cooldown_time: PositiveInt | None = Field(default=None, )
hf_token: str | None = Field(default=None)
min_replicas: NonNegativeInt | None = Field(default=None)
max_replicas: PositiveInt | None = Field(default=None)
scale_up_threshold: PositiveInt | None = Field(default=None)
quantization: str | None = Field(default=None)
uses_guaranteed_capacity: bool | None = Field(default=None)
lorax_image_tag: str | None = Field(default=None)
By default, Predibase sets the following default values (subject to change):
accelerator
: Predibase chooses the best suitable option given your tier, provided base_model, and availability.cooldown_time
: 3600 seconds (1 hour)min_replicas
: 0max_replicas
: 1 (Maximum of 3 for free/dev tier and 6 for enterprise)scale_up_threshold
: 1uses_guaranteed_capacity
: False (Whether the deployment should draw from the guaranteed capacity you've purchased for the given accelerator)lorax_image_tag
: Predibase uses a default commit tag for LoRAX that is updated at regular intervals. Advanced customers may make use of this parameter, but it is not recommended for general usage.
To configure autoscaling to 0, set min_replicas
to 0. To configure autoscaling past 1 replica, change max_replicas
to the desired value and modify scale_up_threshold
. scale_up_threshold
represents the number of simultaneous
requests a single LLM replica will handle before scaling up any additional replicas. The value will be highly dependent
on your use case and we suggest experimentation to find an optimal number to meet your throughput needs.
If deploying a "best effort" custom base model,
- You must provide an accelerator
- You must provide
hf_token
(your Huggingface token) if deploying a private model.
Quantization is optional and dependent on the model, but available choices include: none
(which amounts to fp16
quantization), fp8
, and bitsandbytes-nf4
.
For the base_model
, you can use the short names provided in
the list of available models or you can provide the path to a
huggingface model (e.g. meta-llama/Meta-Llama-3-8B
).
Custom args
Predibase deployments are powered by LoRAX, and the custom args parameter can be used to configure most of the launcher arguments available for the lorax CLI. The parameter accepts a list of strings that should be equivalent to the args you would pass to the CLI, wherein each argument name and value(s) becomes an item in the list. So:
lorax-launcher --preloaded-adapter-ids <repoName/1> <repoName/2> --compile
would become:
custom_args = ["--preloaded-adapter-ids", "<repoName/1>", "<repoName/2>", "--compile"]
Below is the list of available arguments:
adapter-id
- Name of the adapter to load. Takes the form ofrepoName/repoVersion
source
- Source of the model to load.adapter-source
- Source of the adapter to load.revision
- Literal revision of the model if loading a huggingface model. Can be a specific commit id or a branch, e.g.refs/pr/2
.validation-workers
- Number of tokenizer workers used for payload validation and truncation inside the router.sharded
- Whether to shard the model across multiple GPUsquantize
- Number of shards to use if you don't want to use all GPUs on a given machine.compile
- Whether you want to compile the model into a CUDA graph.speculative-tokens
- Number of speculative tokens to generate in the model per step.preloaded-adapter-ids
- List of adapter ids to preload during initialization (to avoid cold start times).dtype
- Type to be forced upon the modelmax-concurrent-requests
- Maximum number of concurrent requests for this particular deployment.max-best-of
- Maximum allowed value forbest-of
(which createsN
generations at once and returns the best)max-stop-sequences
- Maximum allowed value forstop-sequences
max-input-length
- Maximum allowed input length (expressed in number of tokens)max-total-tokens
- Maximum number of tokens that can be processed and generated at oncewaiting-served-ratio
- Ratio of waiting queries to running queriesmax-batch-prefill-tokens
- Limits the number of tokens for the prefill operation.max-batch-total-tokens
- Total number of potential tokens within a batch.max-waiting-tokens
- Number of tokens that can be passed before forcing the waiting queries to be put on the batch ( if the size of the batch allows for it).eager-prefill
- Whether to prioritize running prefill before decodeprefix-caching
- Whether to use the prefix caching mechanismmax-active-adapters
- Maximum number of adapters that can be placed on the GPU and accept requests at a time.adapter-cycle-time-s
- Time in seconds between adapter exchanges.adapter-memory-fraction
- Reservation of memory set aside for loading adapters onto the GPU.watermark-gamma
watermark-delta
tokenizer-config-path
- Path to the tokenizer config file.