Skip to main content

Serverless Endpoints

info

VPC customers do not have access to the shared serverless deployments and should start with deploying an LLM.

Prompt Pretrained Models

Prompting a pretrained serverless model is as simple as:

# Select your (serverless) LLM deployment
llm = pc.LLM("pb://deployments/llama-2-7b-chat")

result = llm.prompt("What is your name?", max_new_tokens=256)
print(result.response)

Prompt Fine-tuned Models (with LoRAX)

If the base model is one of the available serverless endpoints, you can prompt your fine-tuned model immediately after training with the additional two lines shown below. If the base model is not listed above, you will need to use a dedicated deployment to prompt your fine-tuned model.

llm = pc.LLM("pb://deployments/llama-2-7b-chat")

# Attach the adapter to the (client-side) deployment object
adapter = pc.get_model(name="<finetuned_model_repo_name>", version="model_version")
ft_llm = llm.with_adapter(adapter)

# View prompt template used for fine-tuning
ft_llm_template = ft_llm.default_prompt_template
print(ft_llm_template)

result = ft_llm.prompt("What is your name?", max_new_tokens=256)
print(result.response)

Model Versions

You can prompt any model version from within a Model Repository that was successfully trained (status: Ready). In the example above, we prompted a serverless LLM using a fine-tuned adapter model in a repo called fine_tuned_model_repo at version model_version:

If no version is specified, the latest version in the repository will be used by default.