Shared LLM Deployments
Users of Predibase Cloud can begin querying one of several shared LLM deployments managed by Predibase directly. These deployments are multi-tenant, meaning that latency and throughput may be affected by how many concurrent users are interacting with the Predibase Cloud platform at a time.
In order to ensure fair allocation of these resources, we apply daily token limits to each tenant.
Available Shared LLMs
The exact list of shared deployments is subject to change over time, as they are intended primarily for demonstration purposes to explore the relative tradeoffs between different LLMs.
The current list of shared deployments is:
- llama-2-7b, a state-of-the-art open-source model released by Meta, with a max token length of 4096.
- llama-2-13b-chat, a state-of-the-art open-source model released by Meta, with a max token length of 4096.
- vicuna-13b, an open-source instruction-tuned model trained by fine-tuning LLaMA on 70k user-shared conversations collected from ShareGPT. Vicuna is under Apache License 2.0, but fine-tuned on LLaMa which has a non-commercial bespoke license. It has a max token limit of 2048.
Comparison of Shared LLMs
LLM | Parameters | Architecture | Type | License | Max Tokens |
---|---|---|---|---|---|
llama-2-7b | 7 million | LLaMA2 | Auto-Regressive | Commercial | 4096 |
llama-2-13b-chat | 13 billion | LLaMA2 | Auto-Regressive, Instruction-Tuned | Commercial | 4096 |
vicuna-13b | 13 billion | LLaMA | Auto-Regressive, Instruction-Tuned | Non-commercial | 2048 |
Daily Token Limits
Every tenant in Predibase Cloud is alotted a default of 20,000 tokens per day that can be used across any of the shared deployments hosted by Predibase. Once the token limit is reached, subsequent queries will be rejected until the next calendar day, at which point the limit resets to its original amount.
A token in language model terms is the fundamental unit of text generated by the model; it can be thought of as a piece of a word. On average, there are approximately 1.3 tokens per word for English language.
Tokens are only consumed from the Daily Token Limit when text is generated, so using a larger input context will not
consume more tokens against the daily limit. The number of tokens generated is controlled via the max_new_tokens
parameter
to the prompting interface. The default max_new_tokens
is 128
, but can be overridden in each
prompt command. For example:
PROMPT 'What is the capital of Italy?'
OPTIONS (max_new_tokens=32)
While max_new_tokens
enforces the maximium tokens consumed per query, it does not set a minimum. In many cases, the LLM
will stop generating text before reaching the max_new_tokens
limit. When this happens, only the actual number of generated
tokens will be applied against the Daily Token Limit for subsequent queries.
Pre-flight Token Limit Check
Before executing each query, a pre-flight check will be run to determine whether or not the query could result in exceeeding the daily token limit. This check uses the following calculation to make this determination:
expected_token_usage = max_new_tokens * num_templates * num_models * num_rows
If expected_token_usage
plus the current daily token usage exceeds the daily token limit, then the query will be rejected
before any requests are sent to the LLM deployment. You can work around this restriction by decreasing max_new_tokens
or
decreasing the cardinality of the request (number of rows, templates, or models prompted over).