Skip to main content


This guide shows you to run inference via REST API.

  • You must provide an Authorization API Token as a header in the request. You can find this under Settings > My Profile > Generate API Token.
  • You will also need your Predibase Tenant ID. You can find this under Settings > My Profile > Overview > Tenant ID.


Once you get your Predibase API token and Tenant ID, set them as an environment variable in your terminal.


You'll need the name of the serverless LLM corresponding to the base model that was fine-tuned for PREDIBASE_DEPLOYMENT, as well as the model repo name and version number for the adapter_id.


curl -d '{"inputs": "What is your name?", "parameters": {"api_token": "<YOUR TOKEN HERE>", "adapter_source": "pbase", "adapter_id": "<MODEL REPO NAME>/<MODEL VERSION NUMBER>", "max_new_tokens": 128}}' \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"


  • When querying fine-tuned models, include the prompt template used for fine-tuning in the inputs.
  • You can also use the /generate_stream endpoint to have the tokens be streamed from the deployment. The parameters also follow the same format as the LoRAX generate endpoints.

Request Parameters:

inputs: string

The prompt to be passed to the specified LLM.

parameters: Optional, json

adapter_id: Optional, string

The ID of the adapter to use for the operation. This will be of the following format: model_repo_name/model_version_number. Example: My model repo/3.

Default: no adapter used.

adapter_source: Optional, string

The source of the adapter to use for the operation. Options are pbase for fine-tuned adapters on Predibase, hub for huggingface hub, or s3.

best_of: Optional, integer

Generate best_of sequences and return the one if the highest token logprobs. Defaults to 1.

details: Optional, boolean

Return the token logprobs and ids of the generated tokens

decoder_input_details: Optional, boolean

Return the token logprobs and ids of the input prompt

do_sample: Optional, boolean

Whether or not to use sampling; use greedy decoding otherwise. Defaults to false.

max_new_tokens: Optional, int

The maximum number of new tokens to generate. If not provided, will default to 20.

repitition_penalty: Optional, float64

The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Defaults to no penalty.

return_full_text: Optional, boolean

Whether to prepend the prompt to the generated text. Default false.

seed: Optional, integer

The seed to use for the random number generator. If not provided, will default to a random seed.

stop: Optional, array of strings

Stop generating tokens if a member of stop_sequences is generated.

temperature: Optional, float64

Temperature is used to control the randomness of predictions. Higher values increase diversity and lower values increase determinism. Setting a temperature of 0 is useful for testing and debugging.

top_k: Optional, integer

Top-k is a sampling method where the k highest-probability vocabulary tokens are kept and the probability mass is redistributed among them.

top_p: Optional, float64

Top-p (aka nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. For example, 0.2 corresponds to only the tokens comprising the top 20% probability mass being considered.

truncate: Optional, integer

The number of tokens to truncate the output to. If not provided, will default to user's default truncate.

typical_p: Optional, float64

If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See Typical Decoding for Natural Language Generation for more information

watermark: Optional, boolean