Private Serverless Deployments
Deploying Base Models
Once you're ready for production, deploy a private instance of a base model which can be used to serve an unlimited number of fine-tuned adapters (via LoRAX). Predibase official supports serving these models and we offer best-effort support for custom base models from Huggingface as well.
Private serverless deployments are billed by gpu-time. See available hardware and pricing.
For the base_model
, use the short names provided in
the list of available models.
For the full list of DeploymentConfig
options, see
the class reference.
- Python SDK
- CLI
pb.deployments.create(
name="my-mistral-7b",
config=DeploymentConfig(
base_model="mistral-7b-instruct-v0-2",
# cooldown_time=3600, # Value in seconds, defaults to 3600 (1hr)
min_replicas=0, # Auto-scales to 0 replicas when not in use
max_replicas=1
)
# description="", # Optional
)
pbase deploy llm --deployment-name llama-2-7b --model-name hf://meta-llama/Llama-2-7b-hf --engine-template llm-gpu-small --wait
By default cooldown_time
is set to 3600 seconds (1 hour) and min_replicas
is set to 0, which means the deployment
will scale down after 12 hours of no requests.
- If you'd like your deployment to be always-on, set
min_replicas=1
. - When getting started with testing, we recommend
cooldown_time=3600
.
We offer reserved GPU capacity in the Enterprise tier. In the Developer tier, you may see additional initialization / scale up time while you wait for the next available GPU.
Turbo LoRA and Turbo adapters
While any base model can be trained as a Turbo LoRA, some models require additional deployment configurations to support adapter inference properly. See adapter pre-load requirements here.
- If the base model fine-tuned does not require the adapter to be pre-loaded, you can use your Turbo LoRA adapter as normal (via private deployments and shared endpoints).
- If the base model fine-tuned requires the adapter to be pre-loaded, you'll need create a private deployment with the following
custom_args
:
# Deploy with Turbo LoRA or Turbo adapter
pb.deployments.create(
name="solar-pro-preview-instruct-deployment",
config=DeploymentConfig(
base_model="solar-pro-preview-instruct",
# cooldown_time=3600, # Value in seconds, defaults to 3600 (1hr)
min_replicas=0,
max_replicas=1,
custom_args=[
"--adapter-id", "my-repo/1", # "my-repo" is your adapter repository and "1" is the version number
"--adapter-source", "pbase",
"--predibase-api-token", "<PREDIBASE API TOKEN>"
]
)
)
# Since the adapter is already loaded, prompt without needing to specify the adapter dynamically
lorax_client = pb.deployments.client("solar-pro-preview-instruct-deployment")
print(lorax_client.generate("Where is the best slice shop in NYC?", max_new_tokens=100).generated_text)
Fine-tuned adapter deployments
If you are looking for a private instance of your fine-tuned adapter, we recommend deploying a base model (above) and using LoRAX to run inference on your adapter. LoRAX enables you to serve an unlimited number of adapters on a single base model.
If you would still like to have a private deployment of your fine-tuned model, we are able to serve it for you -- reach out to support@predibase.com.
Customize Compute
By default, Predibase will do automatic right-sizing to choose a suitable accelerator for the LLM you intend to deploy. You may also use a specific accelerator if you'd like.
- Python SDK
- CLI
pb.deployments.create(
name="my-mistral-7b",
config=DeploymentConfig(
base_model="mistral-7b-instruct-v0-2",
accelerator="a10_24gb_100",
# cooldown_time=3600, # Value in seconds, defaults to 3600 (1hr)
min_replicas=0, # Auto-scales to 0 replicas when not in use
max_replicas=1
)
# description="", # Optional
)
pbase deploy llm --deployment-name my-first-llm --model-name google/flan-t5-xl --engine-template llm-gpu-small
Available Accelerators
The available accelerators and associated tiers are listed below. See our pricing.
Accelerator | ID | Predibase Tiers | GPUs | SKU |
---|---|---|---|---|
1 A10G 24GB | a10_24gb_100 | All | 1 | A10G |
1 L40S 48GB | l40s_48gb_100 | All | 1 | L40S |
1 L4 24GB | l4_24gb_100 | VPC | 1 | L4 |
1 A100 80GB | a100_80gb_100 | Developer, Enterprise SaaS | 1 | A100 |
2 A100 80GB | a100_80gb_200 | Enterprise SaaS | 2 | A100 |
4 A10G 24GB | a10_24gb_400 | Enterprise VPC | 4 | A10G |
1 H100 80GB PCIe | h100_80gb_pcie_100 | Enterprise SaaS and VPC | 1 | H100 |
1 H100 80GB SXM | h100_80gb_sxm_100 | Enterprise SaaS and VPC | 1 | H100 |
To deploy on H100s, multi-GPU (A100 or H100), or upgrade to Enterprise, please reach out to us at sales@predibase.com.
Prompting
Prompt the Base Model
Private serverless LLMs can be prompted via the Python SDK or REST API once they have been deployed.
- Python SDK
- REST
# Specify the deployment by name
lorax_client = pb.deployments.client("my-mistral-7b-instruct")
print(lorax_client.generate("""<<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>
[INST] What is the best pizza restaurant in New York? [/INST]""", max_new_tokens=100).generated_text)
# Export environment variables
export PREDIBASE_API_TOKEN="<YOUR TOKEN HERE>" # Settings > My Profile > Generate API Token
export PREDIBASE_TENANT_ID="<YOUR TENANT ID>" # Settings > My Profile > Overview > Tenant ID
export PREDIBASE_DEPLOYMENT="my-llama-2-7b-chat"
# query the LLM deployment
curl -d '{"inputs": "What is your name?", "parameters": {"max_new_tokens": 256}}' \
-H "Content-Type: application/json" \
-X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/$PREDIBASE_DEPLOYMENT/generate \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"
Note: you can also use the /generate_stream
endpoint to have the tokens be streamed from the deployment.
See REST API for more parameters.
Prompt a Fine-Tuned Adapter (with LoRAX)
- Python SDK
- REST
# Specify the private serverless deployment of the base model which was fine-tuned
lorax_client = pb.deployments.client("my-mistral-7b-instruct")
# Specify your adapter_id as "adapter-repo-name/adapter-version-number"
print(lorax_client.generate("""<<SYS>>You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.<</SYS>>
[INST] What is the best pizza restaurant in New York? [/INST]""", adapter_id="adapter-repo-name/1",
max_new_tokens=100).generated_text)
# Export environment variables
export PREDIBASE_API_TOKEN="<YOUR TOKEN HERE>" # Settings > My Profile > Generate API Token
export PREDIBASE_TENANT_ID="<YOUR TENANT ID>" # Settings > My Profile > Overview > Tenant ID
export PREDIBASE_DEPLOYMENT="my-llama-2-7b-chat"
# query the LLM deployment
curl -d '{"inputs": "What is your name?", "parameters": {"api_token": "${PREDIBASE_API_TOKEN}", "adapter_source": "pbase", "adapter_id": "<finetuned_model_repo_name>/<finetuned_model_version>", "max_new_tokens": 256}}' \
-H "Content-Type: application/json" \
-X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/$PREDIBASE_DEPLOYMENT/generate \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"
Note: you can also use the /generate_stream
endpoint to have the tokens be streamed from the deployment.
See REST API for more parameters.
Going to production
When you're ready for production, set force_bare_client=True
. When this flag is set to False
, the SDK runs a sub-process which queries the Predibase API and prints out a helpful message if the deployment is still scaling up, which is useful for experimentation and notebooks. When you're ready for production, set this flag to True
to avoid redundant API calls.
lorax_client = pb.deployments.client("mistral-7b-instruct", force_bare_client=True)
Additional Methods
Update
Some configuration parameters can be updated after deployment. Fundamental configuration parameters like base model, quantization, and accelerator cannot be updated, but most common parameters like replication and cooldown time can. See deployments.update for full details.
Example update flow:
deployment = pb.deployments.get("my-mistral-7b")
# Returns a Deployment object containing
# UpdateDeploymentConfig(
# custom_args=[], cooldown_time=3600, hf_token=None, min_replicas=0, max_replicas=1, scale_up_threshold=1
# )
config = deployment.config
# Update some parameters
config.min_replicas = 1
config.max_replicas = 2
config.cooldown_time = 600
# Leave remaining parameters unchanged
# Will update the min/max replicas and cooldown time, while leaving the remaining parameters unchanged. Note that we are
# providing the unchanged parameters as part of the `config` struct. If we instead provided a struct with those
# parameters set to `None`, they would be reset to their default values.
pb.deployments.update(
name="my-mistral-7b",
config=config,
)
Delete
Deployments can be deleted via the SDK or CLI when you no longer need them.
- Python SDK
- CLI
pb.deployments.delete("my-mistral-7b")
pbase delete llm --deployment-name my-llama-2-7b-chat
Other helpful methods
- List LLM Deployments - Method for fetching a list of LLM deployments
- Get LLM Status - Method used for checking in your deployment status and see if it is ready for prompting