Skip to main content

Dedicated Deployments

Deploying Pretrained Models

Predibase supports the serving of pretrained models from Huggingface.

To deploy a pretrained model:

  1. Select the LLM from Huggingface you'd like to deploy
  2. Deploy the LLM
llm = pc.LLM("hf://meta-llama/Llama-2-7b-chat-hf")
llm_deployment = llm.deploy(deployment_name="my-llama-2-7b-chat").get()

Deploying Fine-Tuned Models

We recommend using LoRAX to dynamically prompt fine-tuned models on a common base model. However, in some cases it may be desired to have a dedicated deployment for a single fine-tuned model.

Using this method, the resulting deployment can be prompted the same way as an ordinary base model deployment, but the fine-tuned model will be used (without needing to specify the adapter_id):

finetuned_llm = model.deploy("llama-2-7b-finetuned").get()
result = finetuned_llm.prompt(
"instruction": "Write an algorithm in Java to reverse the words in a string.",
"input": "The quick brown fox"


Engine Templates

By default, Predibase will do automatic right-sizing to choose a suitable engine for the LLM you intend to deploy.

deployment = llm.deploy("my-first-llm").get()

Customize the Engine Template

To deploy an LLM using a specific engine template, you can use:

deployment = llm.deploy("my-first-llm", engine_template="llm-gpu-small").get()

Available Engine Templates

Engine TemplateGPUsGPU SKUvCPUsRAMDisk

Predibase Deployment URI

After a deployment is initialized, Predibase will create a URL that points to it that will take the form of:

# Select your deployment via: llm_deployment = pc.LLM("pb://deployments/deployment-name")

where deployment-name is the name you provided in the deploy command above.


Prompt the Base Model

Dedicated LLMs can be prompted via the Python SDK or REST API once they have been deployed.

llm = pc.LLM("pb://deployments/my-llama-2-7b-chat")
result = llm.prompt("What is your name?", max_new_tokens=256)

Prompt a Fine-Tuned Adapter (with LoRAX)

Any fine-tuned model in Predibase can be prompted immediately after training completes using LoRAX if its base model is the same as that of the dedicated LLM.

llm = pc.LLM("pb://deployments/my-llama-2-7b-chat")

# Attach the adapter to the (client-side) deployment object
adapter = pc.get_model(name="<finetuned_model_repo_name>", version="<finetuned_model_version>")
ft_llm = llm.with_adapter(adapter)

# View prompt template used for fine-tuning
ft_llm_template = ft_llm.default_prompt_template

result = ft_llm.prompt("What is your name?", max_new_tokens=256)

Model Versions

You can prompt any model version from within a Model Repository that was successfully trained (status: Ready). In the example above, we prompted a dedicated deployment using a fine-tuned adapter model in a repo called fine_tuned_model_repo at version model_version:

If no version is specified, the latest version in the repo will be used by default.

Delete a Deployment

Deployments can be deleted via the SDK to free up compute resources.


Other helpful methods

  • List LLM Deployments - Various options for fetching a list of LLM deployments
  • LLM Status - Multiple methods used for checking in your deployment status and to see if they are ready for prompting