Skip to main content

Private Serverless Deployments

Deploying Base Models

Once you're ready for production, deploy a private instance of a base model which can be used to serve an unlimited number of fine-tuned adapters (via LoRAX). Predibase official supports serving these models and we offer best-effort support for custom base models from Huggingface as well.

Private serverless deployments are billed by gpu-time. See available hardware and pricing.

Deployment configuration

For the base_model, use the short names provided in the list of available models.

For the full list of DeploymentConfig options, see the class reference.

pb.deployments.create(
name="my-mistral-7b",
config=DeploymentConfig(
base_model="mistral-7b-instruct-v0-2",
# cooldown_time=3600, # Value in seconds, defaults to 43200 (12hrs)
min_replicas=0, # Auto-scales to 0 replicas when not in use
max_replicas=1
)
# description="", # Optional
)

By default cooldown_time is set to 43200 seconds (12 hours) and min_replicas is set to 0, which means the deployment will scale down after 12 hours of no requests.

  • If you'd like your deployment to be always-on, set min_replicas=1.
  • When getting started with testing, we recommend cooldown_time=3600.
GPU capacity

We offer reserved GPU capacity in the Enterprise tier. In the Developer tier, you may see additional initialization / scale up time while you wait for the next available GPU.

Fine-tuned adapter deployments

If you are looking for a private instance of your fine-tuned adapter, we recommend deploying a base model (above) and using LoRAX to run inference on your adapter. LoRAX enables you to serve an unlimited number of adapters on a single base model.

If you would still like to have a private deployment of your fine-tuned model, we are able to serve it for you -- reach out to support@predibase.com.

Customize Compute

By default, Predibase will do automatic right-sizing to choose a suitable accelerator for the LLM you intend to deploy. You may also use a specific accelerator if you'd like.

pb.deployments.create(
name="my-mistral-7b",
config=DeploymentConfig(
base_model="mistral-7b-instruct-v0-2",
accelerator="a10_24gb_100",
# cooldown_time=3600, # Value in seconds, defaults to 43200 (12hrs)
min_replicas=0, # Auto-scales to 0 replicas when not in use
max_replicas=1
)
# description="", # Optional
)

Available Accelerators

The available accelerators and associated tiers are listed below. See our pricing.

AcceleratorIDPredibase TiersGPUsGPU SKU
1 A10G 24GBa10_24gb_100Developer1A10G
1 A100 80GBa100_80gb_100Developer1A100
1 A100 80GBa100_80gb_100Enterprise (Predibase AI Cloud)1A100
2 A100 80GB *a100_80gb_200Enterprise (Predibase AI Cloud)2A100
4 A10G 24GBa10_24gb_400Enterprise (VPC)4A10G

*To deploy on 2x A100s or upgrade to Enterprise, please reach out to us at sales@predibase.com

Prompting

Prompt the Base Model

Private serverless LLMs can be prompted via the Python SDK or REST API once they have been deployed.

# Specify the deployment by name
lorax_client = pb.deployments.client("my-mistral-7b-instruct")
print(lorax_client.generate("""<<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>

[INST] What is the best pizza restaurant in New York? [/INST]""", max_new_tokens=100).generated_text)

Prompt a Fine-Tuned Adapter (with LoRAX)

# Specify the private serverless deployment of the base model which was fine-tuned
lorax_client = pb.deployments.client("my-mistral-7b-instruct")

# Specify your adapter_id as "adapter-repo-name/adapter-version-number"
print(lorax_client.generate("""<<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>

[INST] What is the best pizza restaurant in New York? [/INST]""", adapter_id="adapter-repo-name/1",
max_new_tokens=100).generated_text)

Going to production

When you're ready for production, set force_bare_client=True. When this flag is set to False, the SDK runs a sub-process which queries the Predibase API and prints out a helpful message if the deployment is still scaling up, which is useful for experimentation and notebooks. When you're ready for production, set this flag to True to avoid redundant API calls.

lorax_client = pb.deployments.client("mistral-7b-instruct", force_bare_client=True)

Additional Methods

Update

Some configuration parameters can be updated after deployment. Fundamental configuration parameters like base model, quantization, and accelerator cannot be updated, but most common parameters like replication and cooldown time can. See deployments.update for full details.

Example update flow:

pb.deployments.get("my-mistral-7b")

# Returns a Deployment object containing
# UpdateDeploymentConfig(
# custom_args=[], cooldown_time=43200, hf_token=None, min_replicas=0, max_replicas=1, scale_up_threshold=1
# )

pb.deployments.update(
name="my-mistral-7b",
config=UpdateDeploymentConfig(
min_replicas=1, # Increase min replicas to 1
max_replicas=2, # Increase max replicas to 2
cooldown_time=600, # Reduce cooldown time to 10 minutes
custom_args=[], # Remaining parameters are unchanged
hf_token=None,
scaled_up_threshold=1,
),
)

Delete

Deployments can be deleted via the SDK or CLI when you no longer need them.

pb.deployments.delete("my-mistral-7b")

Other helpful methods

  • List LLM Deployments - Method for fetching a list of LLM deployments
  • Get LLM Status - Method used for checking in your deployment status and see if it is ready for prompting