Warning: This is a beta feature. APIs and behavior may change.
Use the batch inference API to process inference requests in bulk at a lower
cost.
Batch inference takes in a dataset as input, with each row being used as a
prompt to the model. The input dataset must conform to the
OpenAI batch inference format.
Example dataset row:
{
"custom_id": "1", # Each row must have a unique string ID
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "", # Empty string means prompt the base model without adapters
"messages": [
{
"role": "user",
"content": "Formulate an equation to calculate the height of a triangle given the angle, side lengths and opposite side length."
}
],
"max_tokens": 1000
}
}
To use an adapter, specify the adapter ID in the model
field. The base model
will be specified when creating the batch job.
Once your JSONL file is ready,
upload it to Predibase.
For information about preparing your dataset, see
Dataset Preparation.
Creating a Batch Job
Warning: Currently only base models up to 16B parameters are supported.
from predibase import Predibase
from predibase.beta.config import BatchInferenceServerConfig
# Configure the batch job
config = BatchInferenceServerConfig(
base_model="qwen3-8b",
lorax_image_tag=None, # Optional
hf_token=None, # Optional
quantization=None, # Optional
)
# Or use an existing deployment's configuration
dep = pb.deployments.get("my_deployment")
config_from_deployment = BatchInferenceServerConfig.from_deployment(dep)
# Create the batch job
job = pb.beta.batch_inference.create(
dataset="my_inference_dataset",
server_config=config,
)
Monitoring Jobs
# Check specific job status
print(pb.beta.batch_inference.get(job).status)
print(pb.beta.batch_inference.get("<JOB_UUID>").status)
# List all batch jobs
jobs = pb.beta.batch_inference.list()
for j in jobs:
print(j)
Getting Results
Once the job status is completed
, download the results:
pb.beta.batch_inference.download_results(job, dest="path/to/my_output.jsonl")
pb.beta.batch_inference.download_results("<JOB_UUID>", dest="path/to/my_output.jsonl")
Example result row:
{
"custom_id": "1",
"response": {
"status_code": 200,
"body": {
"id": "null",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": " To calculate the height of a triangle when the angle, side lengths, and the length ...",
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1737575137,
"model": "predibase/qwen3-8b",
"object": "text_completion",
"service_tier": null,
"system_fingerprint": "0.1.0-native",
"usage": {
"completion_tokens": 379,
"prompt_tokens": 30,
"total_tokens": 409,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
}
},
"error": null
}
Note: Result rows may not be in the same order as the input dataset. Use the
custom_id
field to match results with inputs.
Pricing
Batch inference is priced at $0.50 per million tokens, with no difference
between input and output tokens.