Skip to main content

REST API

This guide shows you to run inference via REST API.

  • You must provide an Authorization API Token as a header in the request. You can find this under Settings > My Profile > Generate API Token.
  • You will also need your Predibase Tenant ID. You can find this under Settings > My Profile > Overview > Tenant ID.

Example

Once you get your Predibase API token and Tenant ID, set them as an environment variable in your terminal.

export PREDIBASE_API_TOKEN="<YOUR TOKEN HERE>"
export PREDIBASE_TENANT_ID="<YOUR TENANT ID>"

For PREDIBASE_DEPLOYMENT, the base model must correspond to the base model that was fine-tuned:

  • For shared LLMs, use the model name, ex. "mistral-7b-instruct".
  • For private serverless deployments, use deployment name you used to deploy, ex. "my-dedicated-mistral-7b".

You will also need the model repo name and version number for the adapter_id.

export PREDIBASE_DEPLOYMENT="<SERVERLESS MODEL NAME>"

curl -d '{"inputs": "What is your name?", "parameters": {"api_token": "<YOUR TOKEN HERE>", "adapter_source": "pbase", "adapter_id": "<MODEL REPO NAME>/<MODEL VERSION NUMBER>", "max_new_tokens": 128}}' \
-H "Content-Type: application/json" \
-X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/$PREDIBASE_DEPLOYMENT/generate \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"

Notes

  • When querying fine-tuned models, include the prompt template used for fine-tuning in the inputs.
  • You can also use the /generate_stream endpoint to have the tokens be streamed from the deployment. The parameters also follow the same format as the LoRAX generate endpoints.

Request Parameters:

inputs: string

The prompt to be passed to the specified LLM.

parameters: Optional, json

adapter_id: Optional, string

The ID of the adapter to use for the operation. This will be of the following format: model_repo_name/model_version_number. Example: My model repo/3.

Default: no adapter used.

adapter_source: Optional, string

The source of the adapter to use for the operation. Options are pbase for fine-tuned adapters on Predibase, hub for huggingface hub, or s3.

best_of: Optional, integer

Generate best_of sequences and return the one if the highest token logprobs. Defaults to 1.

details: Optional, boolean

Return the token logprobs and ids of the generated tokens

decoder_input_details: Optional, boolean

Return the token logprobs and ids of the input prompt

do_sample: Optional, boolean

Whether or not to use sampling; use greedy decoding otherwise. Defaults to false.

max_new_tokens: Optional, int

The maximum number of new tokens to generate. If not provided, will default to 20.

repitition_penalty: Optional, float64

The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Defaults to no penalty.

return_full_text: Optional, boolean

Whether to prepend the prompt to the generated text. Default false.

seed: Optional, integer

The seed to use for the random number generator. If not provided, will default to a random seed.

stop: Optional, array of strings

Stop generating tokens if a member of stop_sequences is generated.

temperature: Optional, float64

Temperature is used to control the randomness of predictions. Higher values increase diversity and lower values increase determinism. Setting a temperature of 0 is useful for testing and debugging.

top_k: Optional, integer

Top-k is a sampling method where the k highest-probability vocabulary tokens are kept and the probability mass is redistributed among them.

top_p: Optional, float64

Top-p (aka nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. For example, 0.2 corresponds to only the tokens comprising the top 20% probability mass being considered.

truncate: Optional, integer

The number of tokens to truncate the output to. If not provided, will default to user's default truncate.

typical_p: Optional, float64

If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See Typical Decoding for Natural Language Generation for more information

watermark: Optional, boolean

Response Headers

info

These headers should be considered a beta feature, and are subject to change in the future.

  • x-total-tokens: The number of tokens in both the input prompt and the output.
  • x-prompt-tokens: The number of tokens in the prompt.
  • x-generated-tokens: The number of generated tokens.
  • x-total-time: The total time the request took in the inference server, in milliseconds.
  • x-time-per-token: The average time it took to generate each output token, in milliseconds.
  • x-queue-time: The time the request was in the internal inference server queue, in milliseconds.