Skip to main content

Shared LLM Deployments

Users of Predibase Cloud can begin querying one of several shared LLM deployments managed by Predibase directly. These deployments are multi-tenant, meaning that latency and throughput may be affected by how many concurrent users are interacting with the Predibase Cloud platform at a time.

In order to ensure fair allocation of these resources, we apply daily token limits to each tenant.

Available Shared LLMs

The exact list of shared deployments is subject to change over time, as they are intended primarily for demonstration purposes to explore the relative tradeoffs between different LLMs.

The current list of shared deployments is:

  • llama-2-7b, a state-of-the-art open-source model released by Meta, with a max token length of 4096.
  • llama-2-13b-chat, a state-of-the-art open-source model released by Meta, with a max token length of 4096.
  • vicuna-13b, an open-source instruction-tuned model trained by fine-tuning LLaMA on 70k user-shared conversations collected from ShareGPT. Vicuna is under Apache License 2.0, but fine-tuned on LLaMa which has a non-commercial bespoke license. It has a max token limit of 2048.

Comparison of Shared LLMs

LLMParametersArchitectureTypeLicenseMax Tokens
llama-2-7b7 millionLLaMA2Auto-RegressiveCommercial4096
llama-2-13b-chat13 billionLLaMA2Auto-Regressive, Instruction-TunedCommercial4096
vicuna-13b13 billionLLaMAAuto-Regressive, Instruction-TunedNon-commercial2048

Daily Token Limits

Every tenant in Predibase Cloud is alotted a default of 20,000 tokens per day that can be used across any of the shared deployments hosted by Predibase. Once the token limit is reached, subsequent queries will be rejected until the next calendar day, at which point the limit resets to its original amount.

A token in language model terms is the fundamental unit of text generated by the model; it can be thought of as a piece of a word. On average, there are approximately 1.3 tokens per word for English language.

Tokens are only consumed from the Daily Token Limit when text is generated, so using a larger input context will not consume more tokens against the daily limit. The number of tokens generated is controlled via the max_new_tokens parameter to the prompting interface. The default max_new_tokens is 128, but can be overridden in each prompt command. For example:

PROMPT 'What is the capital of Italy?'
OPTIONS (max_new_tokens=32)

While max_new_tokens enforces the maximium tokens consumed per query, it does not set a minimum. In many cases, the LLM will stop generating text before reaching the max_new_tokens limit. When this happens, only the actual number of generated tokens will be applied against the Daily Token Limit for subsequent queries.

Pre-flight Token Limit Check

Before executing each query, a pre-flight check will be run to determine whether or not the query could result in exceeeding the daily token limit. This check uses the following calculation to make this determination:

expected_token_usage = max_new_tokens * num_templates * num_models * num_rows

If expected_token_usage plus the current daily token usage exceeds the daily token limit, then the query will be rejected before any requests are sent to the LLM deployment. You can work around this restriction by decreasing max_new_tokens or decreasing the cardinality of the request (number of rows, templates, or models prompted over).