If you’ve fine-tuned a classification adapter with Predibase, or want to use a sequence classification model from Huggingface, you can use Predibase to serve and query your model.

Quick Start

First, install the Predibase Python SDK:
pip install -U predibase

Deploying a Classification Model

Below is an example of deploying a LLM, preloaded with two classification adapters trained with Predibase. Each one contains LoRA weights and a classification head.
from predibase import Predibase, ClassificationDeploymentConfig

pb = Predibase(api_token="<API_TOKEN>") 

deployment = pb.deployments.create(
    name="my-classification-model",
    config=ClassificationDeploymentConfig(
        base_model='qwen2-5-3b-instruct',
        accelerator="a100_80gb_100",
        min_replicas=0,
        max_replicas=1,
        preloaded_adapters=["classification/1", "classification/2"],
    ),
)
You do not need to preload all the adapters you want to use, but this reduces the overhead of swapping portions of the model during inference.
Classification deployments only support classification tasks. You cannot do next token generation with the base model.

Querying the model

SDK

Currently only REST API support is available for classification. SDK support is coming soon!

REST

You will need to provide your Tenant ID and API Token, which can be found in the Settings section of the Predibase platform. You will also need the name of the deployment you created (e.g. my-classification-model from the above example).
export TENANT_ID=...
export PB_API_TOKEN=...
export DEPLOYMENT=...
Then you can query multiple sequences by running:
curl -H "Content-Type: application/json" \
     -X POST https://serving.app.predibase.com/${TENANT_ID}/deployments/v2/llms/${DEPLOYMENT}/classify \
     -H "Authorization: Bearer ${PB_API_TOKEN}" \
     --data '{"input": ["prompt 1", "prompt 2"], "model": "classification/1"}'
You will get a response that looks something like:
{
  "api_token": "",
  "id": "classify-example-1234",
  "object": "list",
  "created": 1755208477,
  "model": "classification/1",
  "data": [
    {
      "api_token": "",
      "index": 0,
      "label": "label0",
      "probs": [0.7, 0.3],
      "num_classes": 2
    },
    {
      "api_token": "",
      "index": 1,
      "label": "label1",
      "probs": [0.4, 0.6],
      "num_classes": 2
    },
  ],
  "usage": {
    ...
  }
}
The inference server will automatically handle batching the inputs across requests, and you can run inference with multiple adapters at the same time.