Warning: This is a beta feature. APIs and behavior may change.

Use the batch inference API to process inference requests in bulk at a lower cost.

Dataset Format

Batch inference takes in a dataset as input, with each row being used as a prompt to the model. The input dataset must conform to the OpenAI batch inference format.

Example dataset row:

{
  "custom_id": "1",  # Each row must have a unique string ID
  "method": "POST",
  "url": "/v1/chat/completions",
  "body": {
    "model": "",  # Empty string means prompt the base model without adapters
    "messages": [
      {
        "role": "user",
        "content": "Formulate an equation to calculate the height of a triangle given the angle, side lengths and opposite side length."
      }
    ],
    "max_tokens": 1000
  }
}

To use an adapter, specify the adapter ID in the model field. The base model will be specified when creating the batch job.

Once your JSONL file is ready, upload it to Predibase.

For information about preparing your dataset, see Dataset Preparation.

Creating a Batch Job

Warning: Currently only base models up to 16B parameters are supported.

from predibase import Predibase
from predibase.beta.config import BatchInferenceServerConfig

# Configure the batch job
config = BatchInferenceServerConfig(
    base_model="qwen3-8b",
    lorax_image_tag=None,  # Optional
    hf_token=None,  # Optional
    quantization=None,  # Optional
)

# Or use an existing deployment's configuration
dep = pb.deployments.get("my_deployment")
config_from_deployment = BatchInferenceServerConfig.from_deployment(dep)

# Create the batch job
job = pb.beta.batch_inference.create(
    dataset="my_inference_dataset",
    server_config=config,
)

Monitoring Jobs

# Check specific job status
print(pb.beta.batch_inference.get(job).status)
print(pb.beta.batch_inference.get("<JOB_UUID>").status)

# List all batch jobs
jobs = pb.beta.batch_inference.list()
for j in jobs:
    print(j)

Getting Results

Once the job status is completed, download the results:

pb.beta.batch_inference.download_results(job, dest="path/to/my_output.jsonl")
pb.beta.batch_inference.download_results("<JOB_UUID>", dest="path/to/my_output.jsonl")

Example result row:

{
  "custom_id": "1",
  "response": {
    "status_code": 200,
    "body": {
      "id": "null",
      "choices": [
        {
          "finish_reason": "stop",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": " To calculate the height of a triangle when the angle, side lengths, and the length ...",
            "refusal": null,
            "role": "assistant",
            "audio": null,
            "function_call": null,
            "tool_calls": null
          }
        }
      ],
      "created": 1737575137,
      "model": "predibase/qwen3-8b",
      "object": "text_completion",
      "service_tier": null,
      "system_fingerprint": "0.1.0-native",
      "usage": {
        "completion_tokens": 379,
        "prompt_tokens": 30,
        "total_tokens": 409,
        "completion_tokens_details": null,
        "prompt_tokens_details": null
      }
    }
  },
  "error": null
}

Note: Result rows may not be in the same order as the input dataset. Use the custom_id field to match results with inputs.

Pricing

Batch inference is priced at $0.50 per million tokens, with no difference between input and output tokens.