Skip to main content

DeploymentConfig

Below is the class definition for the Deployment Config. It inherits from UpdateDeploymentConfig.

class DeploymentConfig(BaseModel):
base_model: str # required
accelerator: str | None = Field(default=None)
custom_args: list[str] | None = Field(default=None)
cooldown_time: PositiveInt | None = Field(default=None, )
hf_token: str | None = Field(default=None)
min_replicas: NonNegativeInt | None = Field(default=None)
max_replicas: PositiveInt | None = Field(default=None)
scale_up_threshold: PositiveInt | None = Field(default=None)
quantization: str | None = Field(default=None)
uses_guaranteed_capacity: bool | None = Field(default=None)
max_total_tokens: int | None = Field(default=None)
lorax_image_tag: str | None = Field(default=None)
request_logging_enabled: bool | None = Field(default=None)
direct_ingress: bool | None = Field(default=None)
preloaded_adapters: list[str] | None = Field(default=None)
speculator: str = Field(default=None)
prefix_caching: bool | None = Field(default=None)

By default, Predibase sets the following default values (subject to change):

  • accelerator: None. If not specified, Predibase chooses the best suitable option given your tier, provided base_model, and availability.
  • cooldown_time: 3600 seconds (1 hour)
  • min_replicas: 0
  • max_replicas: 1 (Maximum of 3 for free/dev tier and 6 for enterprise)
  • scale_up_threshold: 1
  • uses_guaranteed_capacity: False (Whether the deployment should draw from the guaranteed capacity you've purchased for the given accelerator)
  • max_total_tokens: None. The maximum number of tokens that the deployment will process per request (counting both input and output tokens). Also sets some other token-related parameters, see notes below for details. If not specified sensible defaults are picked for the given model.
  • lorax_image_tag: None. Predibase uses a default commit tag for LoRAX that is updated at regular intervals. Advanced customers may make use of this parameter, but it is not recommended for general usage.
  • request_logging_enabled: False. Only available for Enterprise SaaS customers. Please contact sales if you would like to modify your plan.
  • direct_ingress: False. Only available for VPC customers. Creates a direct endpoint to the LLM, bypassing the Predibase control plane.
  • preloaded_adapters: None. A list of adapter IDs to preload on deployment initialization. Reduces request latency by preloading adapters into memory. Only predibase-hosted adapters are supported.
  • speculator: None. Specify a Turbo or Turbo LoRA adapter to use as a speculator for the deployment. Specify <repo>/<version> to select a specific adapter, auto to have Predibase pick a reasonable default (see NOTES), and disabled to specify no speculator. If unspecified, default value is auto.
  • prefix_caching: None. Enables prefix caching for the deployment. Prefix caching is an experimental feature that can speed up inference by caching intermediate results. Leaving the value as None will use the default value of the deployed Lorax image (False as of Lorax 0.12.1).

Autoscaling is configured by the min_replicas, max_replicas, and scale_up_threshold parameters. If min_replicas != max_replicas, then Predibase will autoscale up and down between these two values based on traffic load. min_replicas represents the minimum number of replicas that can be running at any given time; if this is set to 0 then your deployment will scale down to no replicas when there is no traffic. scale_up_threshold represents the number of simultaneous requests a single LLM replica will handle before scaling up any additional replicas. The value will be highly dependent on your use case and we suggest experimentation to find an optimal number to meet your throughput needs.

If deploying a "best effort" custom base model:

  • You must provide an accelerator
  • You must provide hf_token (your Huggingface token) if deploying a private model.

Quantization is optional and dependent on the model, but available choices include: none (which amounts to fp16 quantization), fp8, and bitsandbytes-nf4.

Notes
  • For base_model, you can use the short names provided in the list of available models or you can provide the path to a huggingface model (e.g. meta-llama/Meta-Llama-3-8B).
  • You can optionally specify a Hugging Face revision for the base model by specifying the base_model param in the format model@revision.
  • Setting max_total_tokens also sets two other Lorax parameters: max_batch_prefill_tokens (the max number of tokens for the prefill operation) and max_input_length (the maximum number of input tokens). Both parameters are set to max_total_tokens-1.
  • When speculator is set to auto, the deployment will use a pre-configured speculator based on its base model or, if not available, one of the preloaded Turbo or Turbo LoRA adapters provided by the user. Conversely, when speculator is disabled, no Turbo LoRA or Turbo adapters may be preloaded (i.e. may not be specified in preloadedAdapters).

Custom args

Predibase deployments are powered by LoRAX, and the custom args parameter can be used to configure most of the launcher arguments available for the lorax CLI. The parameter accepts a list of strings that should be equivalent to the args you would pass to the CLI, wherein each argument name and value(s) becomes an item in the list. So:

lorax-launcher --compile --eager-prefill true

would become:

custom_args = ["--compile", "--eager-prefill", "true"]

Below is the list of available arguments (other lorax arguments can be directly set using SDK parameters, above):

  • adapter-id - Name of the adapter to load. Takes the form of repoName/repoVersion
  • source - Source of the model to load.
  • adapter-source - Source of the adapter to load.
  • revision - Literal revision of the model if loading a huggingface model. Can be a specific commit id or a branch, e.g. refs/pr/2.
  • validation-workers - Number of tokenizer workers used for payload validation and truncation inside the router.
  • sharded - Whether to shard the model across multiple GPUs
  • quantize- Number of shards to use if you don't want to use all GPUs on a given machine.
  • compile - Whether you want to compile the model into a CUDA graph.
  • speculative-tokens - Number of speculative tokens to generate in the model per step.
  • dtype - Type to be forced upon the model
  • max-concurrent-requests - Maximum number of concurrent requests for this particular deployment.
  • max-best-of - Maximum allowed value for best-of (which creates N generations at once and returns the best)
  • max-stop-sequences - Maximum allowed value for stop-sequences
  • max-input-length - Maximum allowed input length (expressed in number of tokens)
  • max-total-tokens - Maximum number of tokens that can be processed and generated at once
  • waiting-served-ratio - Ratio of waiting queries to running queries
  • max-batch-prefill-tokens - Limits the number of tokens for the prefill operation.
  • max-batch-total-tokens - Total number of potential tokens within a batch.
  • max-waiting-tokens - Number of tokens that can be passed before forcing the waiting queries to be put on the batch ( if the size of the batch allows for it).
  • eager-prefill - Whether to prioritize running prefill before decode
  • max-active-adapters - Maximum number of adapters that can be placed on the GPU and accept requests at a time.
  • adapter-cycle-time-s - Time in seconds between adapter exchanges.
  • adapter-memory-fraction - Reservation of memory set aside for loading adapters onto the GPU.
  • watermark-gamma
  • watermark-delta
  • tokenizer-config-path - Path to the tokenizer config file.