REST API
This guide shows you to run inference via REST API.
- You must provide an Authorization API Token as a header in the request. You can find this under Settings > My Profile > Generate API Token.
- You will also need your Predibase Tenant ID. You can find this under Settings > My Profile > Overview > Tenant ID.
Example
Once you get your Predibase API token and Tenant ID, set them as an environment variable in your terminal.
export PREDIBASE_API_TOKEN="<YOUR TOKEN HERE>"
export PREDIBASE_TENANT_ID="<YOUR TENANT ID>"
- Prompt Your Fine-tuned Adapter
- Prompt Base Model
For PREDIBASE_DEPLOYMENT
, the base model must correspond to the base model that was fine-tuned:
- For shared LLMs, use the model name, ex. "mistral-7b-instruct".
- For private serverless deployments, use deployment name you used to deploy, ex. "my-dedicated-mistral-7b".
You will also need the model repo name and version number for the adapter_id
.
export PREDIBASE_DEPLOYMENT="<SERVERLESS MODEL NAME>"
curl -d '{"inputs": "What is your name?", "parameters": {"api_token": "<YOUR TOKEN HERE>", "adapter_source": "pbase", "adapter_id": "<MODEL REPO NAME>/<MODEL VERSION NUMBER>", "max_new_tokens": 128}}' \
-H "Content-Type: application/json" \
-X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/$PREDIBASE_DEPLOYMENT/generate \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"
For PREDIBASE_DEPLOYMENT
:
- For shared LLMs, use the model name, ex. "mistral-7b-instruct".
- For private serverless deployments, use deployment name you used to deploy, ex. "my-dedicated-mistral-7b".
export PREDIBASE_DEPLOYMENT="<MODEL NAME>"
curl -d '{"inputs": "What is your name?", "parameters": {"max_new_tokens": 20, "temperature": 0.1}}' \
-H "Content-Type: application/json" \
-X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/$PREDIBASE_DEPLOYMENT/generate \
-H "Authorization: Bearer ${PREDIBASE_API_TOKEN}"
Notes
- When querying fine-tuned models, include the prompt template used for fine-tuning in the inputs.
- You can also use the
/generate_stream
endpoint to have the tokens be streamed from the deployment. The parameters also follow the same format as the LoRAX generate endpoints.
Request Parameters:
inputs: string
The prompt to be passed to the specified LLM.
parameters: Optional, json
adapter_id: Optional, string
The ID of the adapter to use for the operation. This will be of the following format: model_repo_name/model_version_number
.
Example: My model repo/3
.
Default: no adapter used.
adapter_source: Optional, string
The source of the adapter to use for the operation. Options are pbase
for fine-tuned adapters on Predibase, hub
for huggingface hub, or s3
.
best_of: Optional, integer
Generate best_of sequences and return the one if the highest token logprobs. Defaults to 1.
details: Optional, boolean
Return the token logprobs and ids of the generated tokens
decoder_input_details: Optional, boolean
Return the token logprobs and ids of the input prompt
do_sample: Optional, boolean
Whether or not to use sampling; use greedy decoding otherwise. Defaults to false.
max_new_tokens: Optional, int
The maximum number of new tokens to generate. If not provided, will default to 20.
repitition_penalty: Optional, float64
The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Defaults to no penalty.
return_full_text: Optional, boolean
Whether to prepend the prompt to the generated text. Default false.
seed: Optional, integer
The seed to use for the random number generator. If not provided, will default to a random seed.
stop: Optional, array of strings
Stop generating tokens if a member of stop_sequences
is generated.
temperature: Optional, float64
Temperature is used to control the randomness of predictions. Higher values increase diversity and lower values increase determinism. Setting a temperature of 0 is useful for testing and debugging.
top_k: Optional, integer
Top-k is a sampling method where the k highest-probability vocabulary tokens are kept and the probability mass is redistributed among them.
top_p: Optional, float64
Top-p (aka nucleus sampling) is an alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. For example, 0.2 corresponds to only the tokens comprising the top 20% probability mass being considered.
truncate: Optional, integer
The number of tokens to truncate the output to. If not provided, will default to user's default truncate.
typical_p: Optional, float64
If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up to typical_p or higher are kept for generation. See Typical Decoding for Natural Language Generation for more information
watermark: Optional, boolean
Watermarking with A Watermark for Large Language Models
Response Headers
These headers should be considered a beta feature, and are subject to change in the future.
- x-total-tokens: The number of tokens in both the input prompt and the output.
- x-prompt-tokens: The number of tokens in the prompt.
- x-generated-tokens: The number of generated tokens.
- x-total-time: The total time the request took in the inference server, in milliseconds.
- x-time-per-token: The average time it took to generate each output token, in milliseconds.
- x-queue-time: The time the request was in the internal inference server queue, in milliseconds.