llm.deploy
- Deploy Pretrained LLM
- Deploy Fine-tuned LLM
# Pick a Huggingface pretrained LLM to deploy.
# The URI must look something like "hf://meta-llama/Llama-2-7b-hf".
llm = pc.LLM(uri)
# Asynchronous deployment
llm.deploy(deployment_name)
# Synchronous (blocking) deployment
llm.deploy(...).get()
# Get the fine-tuned llm to deploy using pc.get_model
# For example, pc.get_model(name="Llama-2-7b-hf-code_alpaca_800", version=3)
# returns model version #3 of the model repo named "Llama-2-7b-hf-code_alpaca_800"
model = pc.get_model(model_repo_name, optional_version_number)
# Asynchronous deployment
model.deploy(deployment_name)
# Synchronous (blocking) deployment
model.deploy(...).get()
This method initiates deployment of your HuggingFaceLLM object in Predibase. The method by itself is async but users who want to track deployment progress may chain .get()
on the function, which will block and provide incremental logs the operation.
Parameters:
deployment_name: str
The name of the LLM deployment. This name will show up in the UI in the Prompt editor and you will use this name when prompting the LLM deployment through the SDK via the prompt method.
NOTE: To ensure your deployment is properly reachable, the deployment name you provide must be RFC1123 subdomain compliant.
hf_token: Optional[str]
The HuggingFace token that will be used to deploy the model from Huggigface Hub. This is especially important to specify if the model you would like to serve is gated or private to your organization.
The default is a Predibase-managed token.
auto_suspend_seconds: Optional[int]
The duration in seconds after which your LLM deployment will automatically scale down if it doesn't receive any requests.
The default is 0 seconds, so the LLM will not autoscale down when there are no requests to it.
max_input_length: Optional[int]
The max sequence length that the model will process (in tokens). Anything beyond this limit will be truncated.
The default is 1024.
max_total_tokens: Optional[int]
This is the token budget that the model will have per request. As an example, if max_total_tokens
is 1000 tokens and the prompt length (in tokens) is 800, the model will only be able to generate new 200 tokens. Since this is a per-request parameter, the larger this value, the larger amount each request will take in the GPU, and the less effective batching can be as larger values will allow for fewer batches.
The default is 2048.
max_batch_prefill_tokens: Optional[int]
An upper bound on the number of tokens for the prefill operation.
The default is 4096.
quantization_kwargs: Optional[Dict]
Quantization parameters that will be used to serve quantized models.
Here's an example of a valid dictionary:
{
"quantize": "gptq",
"parameters": [
{"name": "GPTQ_BITS", "value": "4"},
{"name": "GPTQ_GROUPSIZE", "value": "128"},
]
}
Another valid example is:
{
"quantize": "bitsandbytes",
}
The default is null.
Returns:
llm.deploy
: A LLMDeploymentJob object
llm.deploy.get
: A LLMDeployment object
Example Usage:
Deploy a pretrained LLM with the name "llama-2-7b".
llm = pc.LLM("hf://meta-llama/Llama-2-7b-hf")
llm_deployment = llm.deploy(deployment_name="llama-2-7b").get()
Predibase Deployment URI
After a deployment is initialized, Predibase will create a URL that points to it that will take the form of:
llm_deployment = pc.LLM("pb://deployments/deployment-name")
where deployment-name
is the name you provided in the deploy
command above.
Supported OSS LLMs
See the updated list of LLMs that we support for serving.
Serving Quantized Models
If your model is quantized, you’ll need to specify the correct quantization method when doing LLM.deploy. We support the quantization methods here. Note that at the moment, we do not support fine-tuning any post-quantized models.
Example: Serving Marcoroni-7B-v3-AWQ
llm = pc.LLM(“hf://TheBloke/Marcoroni-7B-v3-AWQ”)
quantization={“quantize”: “awq”}
llm_deployment = llm.deploy(deployment_name=“marcoroni-7b-awq”, quantization_kwargs=quantization).get()