Skip to main content


This guide shows you to run inference via REST API.

  • You must provide an Authorization API Token as a header in the request. You can find this under Settings > My Profile > Generate API Token.
  • You will also need your Predibase Tenant ID. This is a unique identifier to your account. You can find this under Settings > My Profile > Overview > Tenant ID.
  • For dedicated deployments, set PREDIBASE_DEPLOYMENT to the deployment name: llama-predibase
  • For serverless LLM's, set PREDIBASE_DEPLOYMENT to the model name

Once you get your Predibase API token and Tenant ID, set them as an environment variable in your terminal

export PREDIBASE_DEPLOYMENT="llama-predibase"

To query your llama-predibase deployment, run the following:

curl -d '{"inputs": "What is your name?", "parameters": {"max_new_tokens": 20, "temperature": 0.1}}' \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"

Note: you can also use the /generate_stream endpoint to have the tokens be streamed from the deployment. The parameters also follow the same format as the LoRAX generate endpoints.

Request Parameters:

inputs: string

The prompt to be passed to the specified LLM.

parameters: Optional, json

adapter_id: Optional, string

The ID of the adapter to use for the operation. If not provided, no adapter will be used. This will be of the following format: model_repo_name/model_version_number.

Example: My model repo/3.

adapter_source: Optional, string

The source of the adapter to use for the operation. Options are either hub for huggingface hub or s3

do_sample: Optional, boolean

Whether or not to use sampling; use greedy decoding otherwise. Defaults to false.

max_new_tokens: Optional, int

The maximum number of new tokens to generate. If not provided, will default to 20.

best_of: Optional, integer

Generate best_of sequences and return the one if the highest token logprobs. Defaults to 1.

repitition_penalty: Optional, float64

The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Defaults to no penalty.

return_full_text: Optional, boolean

Whether to prepend the prompt to the generated text. Default false.

stop: Optional, array of strings

Stop generating tokens if a member of stop_sequences is generated.

seed: Optional, integer

The seed to use for the random number generator. If not provided, will default to a random seed.

temperature: Optional, float64

Temperature is used to control the randomness of predictions. Higher values increase diversity and lower values increase determinism. Setting a temperature of 0 is useful for testing and debugging.

top_k: Optional, integer

Top-k is a sampling method where the k highest-probability vocabulary tokens are kept and the probability mass is redistributed among them.

top_p: Optional, float64

Top-p (aka nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. For example, 0.2 corresponds to only the tokens comprising the top 20% probability mass being considered.

truncate: Optional, integer

The number of tokens to truncate the output to. If not provided, will default to user's default truncate.

typical_p: Optional, float64

If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See Typical Decoding for Natural Language Generation for more information

watermark: Optional, boolean

decoder_input_details: Optional, boolean

Return the decoder input token logprobs and ids