Overview
Predibase provides the interfaces and infrastructure to fine-tune and serve open-source Large Language Models (LLMs). In this section, we will cover how to easily get started with inference.
Definitions
- Model: A pretrained base model that you can deploy and query (e.g. llama-2-7b, mistral-7b)
- Adapter: a set of (LoRA) weights produced from the fine-tuning process to specialize a base model
Inference Options
There are two main ways to run inference on Predibase:
- Private Serverless Deployments: Predibase can host nearly any open-source LLM on your behalf using dedicated hardware ranging from A10Gs to a multiple H100s.
- Shared Endpoints: Predibase hosts the most popular base models that can be queried or fine-tuned (via Adapters). These endpoints are intended for experimentation and fast iteration and are subject to rate limits
See our pricing page for more details here.
LoRAX (LoRA eXchange): Serving fine-tuned models at scale
LoRAX is an open-source framework released by the team at Predibase that allows users to serve up to hundreds of fine-tuned models (i.e. adapters) on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency. You can choose to use LoRAX with our shared endpoints or private serverless deployments.
Private Serverless Features
Autoscaling
Predibase offers seamless autoscaling, allowing you to scale down to 0 replicas when the deployment is not in use and automatically scaling up to multiple replicas to meet your demand. Min and max replicas can be configured in the UI or SDK.
Monitoring
After creating your deployment, you can view live updating charts in the UI by going to your private deployment's page > Health tab. View metrics and graphs, such as:
- Requests per second
- Throughput (generated tokens per second)
- LoRAX inference time
- Queue duration
- Replicas
- GPU utilization
Deployment Statuses
Private Serverless Deployments that are created in Predibase can be in any of the following states:
- Pending — Deployment record has been created, but deployment has not been fully created yet
- Initializing — The first replica is in the process of being spun up
- Ready — At least 1 replica is up and live
- Standby — 0 replicas are up but deployment is ready to scale up on request
- Stopped — 0 replicas are up and deployment will not scale up until moved to Standby
- Errored — 0 replicas are up and last Initializing state led to an error
- Updating — at least 1 replica is up and either:
- needs to be re-initialized following a config change OR
- the LLM is in the process of being re-initialized
- Deleted — The deployment has been deleted
OpenAI-compatible API
For users migrating from OpenAI, Predibase supports OpenAI compatible endpoints that serve as a drop-in replacement for the OpenAI SDK. Learn more here.
Billing
For private serverless deployments, we offer usage-based pricing billed by the second. You'll only be charged for replicas that are scaled up. For example:
- If you configure your deployment to scale to 0 when idle (
min_replicas=0
), you won't be billed while its scaled down. - If you configure your deployment to scale up to 2 when there is increased load (
max_replicas=2
), you will be billed at 2x the price.
See our latest pricing.
Shared endpoints are provided for testing and experimentation and are free to use with rate limits.
Timeouts and Retries
For production applications, we recommend setting up client-side timeouts and retries, which you can configure based on your requirements. The Predibase and Lorax SDKs have automatic retries built into them, but you can override those configurations to suit your needs.
Additionally, we also have automatic network-level retries for network errors.
Here's an example of how to configure client-side retries using the Predibase SDK.
import requests
from requests.adapters import HTTPAdapter, Retry
pb = Predibase(api_token="<PREDIBASE API TOKEN>")
prompting_client = pb.deployments.client("mistral-7b-instruct")
# Configure HTTP Session object
session = requests.Session()
retries = Retry(
total=5, # 5 retries total
backoff_factor=1, # Exponential back-off
status_forcelist=[ # Retry on server errors
500,
501,
502,
503,
504,
],
)
session.mount("https://", HTTPAdapter(max_retries=retries))
# Configure the client to use the session
prompting_client.session = session
response = prompting_client.generate("[INST] What are some popular tourist spots in San Francisco? [/INST]")
print(response.generated_text)
For server-side timeouts, we can work with you to set them up if you require them.