Skip to main content

Quickstart

This quickstart will show you how to prompt, fine-tune, and deploy LLMs in Predibase. We'll be following a code generation use case where our end result will be a fine-tuned Llama 2 7B model that takes in natural language as input and returns code as output.

For Python SDK users, we'd recommend using an interactive notebook environment, such as Jupyter or Google Colab.

Open in Colab Notebook

Setup

Make sure you've installed the SDK and configured your API token.

If you're using the Python SDK, you'll first need to initialize your PredibaseClient object. All SDK examples below will assume this has already been done.

from predibase import PredibaseClient

# If you've already run `pbase login`, you don't need to provide any credentials here.
#
# If you're running in an environment where the `pbase login` command-line tool isn't available,
# you can also set your API token using `pc = PredibaseClient(token="<your token here>")`.
pc = PredibaseClient()

Prompt a deployed LLM

For our code generation use case, let's first see how Llama 2 7B performs out of the box.

info

If you are in the Developer or Enterprise SaaS tier, you have access to shared serverless LLM deployments, including Llama 2 7B. If you are in a VPC environment, you'll need to deploy an LLM before you can query it.

llm_deployment = pc.LLM("pb://deployments/llama-2-7b")
result = llm_deployment.prompt("""
Below is an instruction that describes a task, paired with an input
that may provide further context. Write a response that appropriately
completes the request.

### Instruction: Write an algorithm in Java to reverse the words in a string.

### Input: The quick brown fox

### Response:
""", max_new_tokens=256)
print(result.response)

Fine-tune a pretrained LLM

Next we'll upload a dataset and fine-tune to see if we can get better performance.

The Code Alpaca dataset is used for fine-tuning large language models to follow instructions to produce code from natural language and consists of the following columns:

  • instruction that describes a task
  • input when additional context is required for the instruction
  • the expected output

Download the Code Alpaca dataset

For the sake of this quickstart, we've created a version of the Code Alpaca dataset with fewer rows so that the model trains significantly faster.

wget https://predibase-public-us-west-2.s3.us-west-2.amazonaws.com/datasets/code_alpaca_800.csv

Upload the dataset to Predibase and start fine-tuning

Next, upload the dataset to Predibase, set the prompt template to be used for fine-tuning, and start the fine-tuning job.

The fine-tuning job should take around 35-45 minutes total. Queueing time depends on how quickly we're able acquire resources and what other jobs might be ahead in the queue. The training time itself should be around 25-30 minutes. As the model trains, you can receive updated metrics in your notebook or terminal. You can also see metrics and visualizations in the Predibase UI.

# Upload the dataset to Predibase (estimated time: 2 minutes due to creation of Predibase dataset with dataset profile)
# If you've already uploaded the dataset before, you can skip uploading and get the dataset directly with
# "dataset = pc.get_dataset("code_alpaca_800", "file_uploads")".
dataset = pc.upload_dataset("code_alpaca_800.csv")

# Define the template used to prompt the model for each example
# Note the 4-space indentation, which is necessary for the YAML templating.
prompt_template = """Below is an instruction that describes a task, paired with an input
that may provide further context. Write a response that appropriately
completes the request.

### Instruction: {instruction}

### Input: {input}

### Response:
"""

# Specify the Huggingface LLM you want to fine-tune
# Kick off a fine-tuning job on the uploaded dataset
llm = pc.LLM("hf://meta-llama/Llama-2-7b-hf")
job = llm.finetune(
prompt_template=prompt_template,
target="output",
dataset=dataset,
# repo="optional-custom-model-repository-name"
)

# Wait for the job to finish and get training updates and metrics
model = job.get()

Prompt your fine-tuned LLM

LoRA eXchange (LoRAX) allows you to prompt your fine-tuned LLM without needing to create a new deployment for each model you want to prompt. Predibase automatically loads your fine-tuned weights on top of a serverless LLM deployment on demand. While this means that there will be a small amount of additional latency, the benefit is that a single LLM deployment can support many different fine-tuned model versions without requiring additional compute.

info

Predibase provides serverless LLM deployments for Developer and Enterprise SaaS tier customers (priced per-token). VPC users need deploy their own base model.

# Since our model was fine-tuned from a Llama-2-7b base, we'll use the serverless deployment with the same model type.
base_deployment = pc.LLM("pb://deployments/llama-2-7b")

# Now we just specify the adapter to use, which is the model we fine-tuned.
# If you lost the model, you can run 'model = pc.get_model(name="<finetuned_model_repo_name>", version="model_version")'
adapter_deployment = base_deployment.with_adapter(model)

# Recall that our model was fine-tuned using a template that accepts an {instruction}
# and an {input}. This template is automatically applied when prompting.
result = adapter_deployment.prompt(
{
"instruction": "Write an algorithm in Java to reverse the words in a string.",
"input": "The quick brown fox"
},
max_new_tokens=256)

print(result.response)

Create a Dedicated Deployment

If you'd like to create a private deployment, you may deploy a model and then run inference on any number of fine-tuned adapters using that deployment, made possible with LoRAX. Once deployed, you can prompt your deployment using the SDK or UI. Deploying a 7B parameters model should take around 10 minutes.

# Deploy a dedicated deployment
llm = pc.LLM("hf://meta-llama/Llama-2-7b-chat-hf")
llm_deployment = llm.deploy(deployment_name="my-llama-2-7b").get()

# Get the deployment URL using the name that was just used
base_deployment = pc.LLM("pb://deployments/my-llama-2-7b")

# Now we just specify the adapter to use, which is the model we fine-tuned.
# If you lost the model, you can run 'model = pc.get_model(name="<finetuned_model_repo_name>", version="model_version")'
adapter_deployment = base_deployment.with_adapter(model)

# Recall that our model was fine-tuned using a template that accepts an {instruction}
# and an {input}. This template is automatically applied when prompting.
result = adapter_deployment.prompt(
{
"instruction": "Write an algorithm in Java to reverse the words in a string.",
"input": "The quick brown fox"
},
max_new_tokens=256)

print(result.response)