Skip to main content

Batch Inference (BETA)

⚠️ This is a beta feature. APIs and behavior may change.

Use the batch inference API to process inference requests in bulk at a lower cost.

Preparing Your Dataset

Batch inference takes in a dataset as input, and each row in the dataset will be used as a prompt to the model. The input dataset must conform to the OpenAI batch inference format.

Here's a (formatted for display) example of what a row in your dataset might look like:

{ 
"custom_id": "1", # Each row in your dataset must have a unique string ID.
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "", # Empty string means prompt the base model without any adapters applied.
"messages": [
{
"role": "user",
"content": "Formulate an equation to calculate the height of a triangle given the angle, side lengths and opposite side length."
}
],
"max_tokens": 1000
}
}

To use an adapter for any row, specify the adapter ID in the model field as described above. The base model for your inference job will be specified later. Other parameters are as described in the LoRAX OpenAI-compatible API.

Once your JSONL file is ready, you can upload it to Predibase. Now you're ready to start batch inference!

Running Batch Inference

⚠️ Currently only base models up to 16B parameters are supported for batch inference.

When you launch a batch inference job, Predibase takes care of deploying your target base model and loading any necessary adapters. Note that batch jobs may not schedule immediately!

Many of the options available when creating a deployment are available for configuration during batch inference. To start your inference job, use the Predibase SDK:

from predibase import Predibase
from predibase.beta.config import BatchInferenceServerConfig

# Parameters have the same meaning as those used to create deployments.
config = BatchInferenceServerConfig(
base_model="mistral-7b-instruct-v0-2",
lorax_image_tag=None, # Optional.
hf_token=None, # Optional.
quantization=None, # Optional.
)

# Or, if you want to re-use the configuration of an existing deployment:
dep = pb.deployments.get("my_deployment")
config_from_deployment = BatchInferenceServerConfig.from_deployment(dep)

job = pb.beta.batch_inference.create(
dataset="my_inference_dataset",
server_config=config,
)
# => Successfully requested batch inference over my_inference_dataset using mistral-7b-instruct-v0-2 as <JOB UUID>.

Checking Progress of Batch Inference Jobs

# To check the status of a specific job:
print(pb.beta.batch_inference.get(job).status)
print(pb.beta.batch_inference.get("<JOB_UUID>").status)

# To see a list of all batch inference jobs:
jobs = pb.beta.batch_inference.list()
for j in jobs:
print(j)

Downloading Batch Inference Results

Once your job has reached the completed state, you can download the output results like this:

pb.beta.batch_inference.download_results(job, dest="path/to/my_output.jsonl")
pb.beta.batch_inference.download_results("<JOB_UUID>", dest="path/to/my_output.jsonl")

Result rows will look like something this (formatted for readability):

{
"custom_id": "1",
"response": {
"status_code": 200,
"body": {
"id": "null",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": " To calculate the height of a triangle when the angle, side lengths, and the length ...",
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1737575137,
"model": "predibase/Mistral-7B-Instruct-v0.2-dequantized",
"object": "text_completion",
"service_tier": null,
"system_fingerprint": "0.1.0-native",
"usage": {
"completion_tokens": 379,
"prompt_tokens": 30,
"total_tokens": 409,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
}
},
"error": null
}

Note that the order of rows in the downloaded results file may not match the order in the original dataset. You can use the unique custom_id to match result and input rows.

Pricing

Batch inference is priced at $0.50 per million tokens. There is no price difference between input and output tokens.