Serving Fine-Tuned LLMs
Predibase allows serving fine-tuned LLMs both dynamically (using LoRAX) or statically within a dedicated deployment:
- LoRA eXchange (LoRAX) - using LoRAX, any number of fine-tuned LLM adapters can be served together alongside a single copy of the base model weights. This allows you to dramatically reduce the cost of serving multiple fine-tuned LLMs. With LoRAX, fine-tuned adapters are loaded in dynamically at runtime based on each set of request parameters.
- Static Deployments - the base model parameters and a single set of fine-tuned adapter weights are loaded together into a single deployment during initialization. This avoids the 200ms - 1s cold start time incurred when dynamically loading an adapter the first time, and ensures that no traffic for another deployment will interfere with the SLOs of your deployment. If you anticipate having enough request volume to necessitate scaling to multiple replicas of the same adapter, then you may prefer this approach.
Note that when using Serverless LLMs, all fine-tuned models are served with LoRAX.
LoRAX
- Select the deployed pretrained (base) LLM that is the same as the Huggingface model used for fine-tuning
- You may need to deploy it if you haven't already.
- You can list LLM deployments see what LLMs are already deployed. Predibase Cloud users may use one of the serverless LLM deployments for LoRAX deployments.
- Select your fine-tuned LLM using the
get_model
method. - Create an adapter on the deployment.
- View the default prompt template used for fine-tuning.
- Prompt!
Full example:
base_dep = pc.LLM("pb://deployments/<deployment_name>")
model = pc.get_model(<your_finetuned_model_repo_name>)
# Attach the adapter to the (client-side) deployment object
ft_dep = base_dep.with_adapter(model)
# View prompt template used for fine-tuning
ft_dep_template = ft_dep.default_prompt_template
print(ft_dep_template)
# Now prompt!
# In the code alpaca example, from the prompt template, we can see that our
# model was fine-tuned using a template that accepts an {instruction} and an {input}.
result = ft_dep.prompt(
{
"instruction": "Write an algorithm in Java to reverse the words in a string.",
"input": "The quick brown fox"
},
max_new_tokens=256)
print(result.response)
Fine-tuned LLMs served using LoRAX may only be queried via the SDK at this time. Because dynamic adapters don't create new deployments, you won't see them in the Predibase UI Query Editor dropdown menu. Only dedicated deployments will show up in the Predibase UI Query Editor.
Model Versions
You can prompt any model version from within a Model Repo that was
successfully trained (status: Ready
).
In this example, we're going to prompt a serverless LLM using a fine-tuned adapter model in a repo called
my_model_repo
at version 123
:
- Python SDK
- REST
llm = pc.LLM("pb://deployments/llama-2-7b-chat")
adapter = pc.get_model("my_model_repo", version=123)
ft_llm = llm.with_adapter(adapter)
result = ft_llm.prompt("What is your name?")
print(result.response)
curl -d '{"inputs": "What is your name?", "parameters": {"adapter_id": "my_model_repo/123"}}' \
-H "Content-Type: application/json" \
-X POST https://api.predibase.com/v1/llms/llama-2-7b-chat \
-H "Authorization: Bearer ${API_TOKEN}"
See REST API for more parameters.
If no version is specifid, the latest version in the repo will be used by default:
- Python SDK
- REST
llm = pc.LLM("pb://deployments/llama-2-7b-chat")
adapter = pc.get_model("my_model_repo")
ft_llm = llm.with_adapter(adapter)
result = ft_llm.prompt("What is your name?")
print(result.response)
curl -d '{"inputs": "What is your name?", "parameters": {"adapter_id": "my_model_repo"}}' \
-H "Content-Type: application/json" \
-X POST https://api.predibase.com/v1/llms/llama-2-7b-chat \
-H "Authorization: Bearer ${API_TOKEN}"
See REST API for more parameters.
Huggingface Hub
You can also prompt adapters that have been trained outside of Predibase that are hosted on the Huggingface Hub.
The same restriction that the adapter was trained on a base model of the same architecture applies.
To prompt a model from Huggingface, set adapter_source
to hub
in your request parameters:
curl -d '{"inputs": "What is your name?", "parameters": {"adapter_id": "my_organization/my_adapter", "adapter_source": "hub"}}' \
-H "Content-Type: application/json" \
-X POST https://api.predibase.com/v1/llms/llama-2-7b-chat \
-H "Authorization: Bearer ${API_TOKEN}"
See REST API for more parameters.
This feature is only supported via REST at this time. Official Python SDK support for Huggingface adapters coming soon.
Static Deployments
Deploy your fine-tuned LLM for on its own serving engine. Once deployed, you can use the prompt method in the SDK to query your model or use the Query Editor in the Predibase UI.
Only VPC and Premium SaaS users with the Admin role will be able to create a dedicated deployment for a fine-tuned LLM.
- Python SDK
- CLI
finetuned_llm = model.deploy("llama-2-7b-finetuned").get()
result = finetuned_llm.prompt(
{
"instruction": "Write an algorithm in Java to reverse the words in a string.",
"input": "The quick brown fox"
},
max_new_tokens=256)
print(result.response)
pbase deploy llm --deployment-name llama-2-7b-finetuned --model-name pb://models/llama-2-7b-finetuned --wait
pbase prompt llm \
instruction="Write an algorithm in Java to reverse the words in a string." \
input="The quick brown fox" \
--deployment-name llama-2-7b-finetuned