Predibase supports a wide range of embedding models for text embeddings and
similarity search. This guide helps you:
- Find available models for your use case
- Understand model capabilities and requirements
- Choose between different model options
Quick Start
First, install the Predibase Python SDK:
Creating Private Deployments
For production use cases, create your own private embedding model deployment:
Base Model Only
from predibase import Predibase, DeploymentConfig
pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")
# Create a production deployment with UAE-Large
deployment = pb.deployments.create(
name="my-embedding-model",
config=DeploymentConfig(
base_model="WhereIsAI/UAE-Large-V1", # High-performance embedding model
min_replicas=1, # Always keep one replica running
max_replicas=2, # Scale up to 2 replicas under load
accelerator="a10_24gb_100", # Uses A10G GPU
speculator="disabled",
disable_adapters=True, # If you plan on not using adapters
max_total_tokens=512 # This model requires setting max tokens = 512 to start on an a10g GPU
)
)
With Adapters
You can also create deployments with adapters; we recommend preloading them to avoid the overhead of dynamically swapping large portions of the model during inference:
# Create the deployment with preloaded adapters
deployment = pb.deployments.create(
name="my-embedding-model",
config=DeploymentConfig(
base_model="predibase/stella_en_400M_v5_vllm",
min_replicas=0,
max_replicas=1,
accelerator="a10_24gb_100",
speculator="disabled",
preloaded_adapters=["adapter_id/1","adapter_id/2","adapter_id/3","adapter_id/4","adapter_id/5"], # Preload multiple adapters
task='embed', # Specify the task
truncate_dim=256 # Specify the output dimension
)
)
Running Inference
Python SDK
Base Model
# Generate embeddings for a single chunk of data with the base model
text = "Generate embeddings using your dedicated deployment."
response = pb.embeddings.create(model="my-embedding-model", input=text)
print(f"Generated {len(response.data[0].embedding)}-dimensional embedding")
# Process a batch of documents with the base model
documents = [
"First document for embedding",
"Second document with different content",
"Third document to process in batch"
]
batch_embeddings = [pb.embeddings.create(model="my-embedding-model", input=doc).data[0].embedding for doc in documents]
Predibase Adapters
# Generate inference via the OpenAI embeddings API
import openai
base_url = "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1"
api_token = "<PREDIBASE_API_TOKEN>"
# initialize the OpenAI client
client = openai.OpenAI(base_url=base_url, api_key=api_token)
# List available adapters (optional)
models = client.models.list()
print("AVAILABLE MODELS:")
for model in models.data:
print(model.id)
# Generate response
responses = client.embeddings.create(
input=[
"Generate embeddings using your dedicated deployment.\n",
],
model="my-repo/1",
)
print(responses)
HF Adapters
import os
os.environ["HUGGINGFACE_HUB_TOKEN"] = "<YOUR HUGGINGFACE TOKEN>" # Required for private adapters
# Generate inference via the OpenAI embeddings API
import openai
base_url = "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1"
api_token = "<PREDIBASE_API_TOKEN>"
# initialize the OpenAI client
client = openai.OpenAI(base_url=base_url, api_key=api_token)
# List available adapters (optional)
models = client.models.list()
print("AVAILABLE MODELS:")
for model in models.data:
print(model.id)
# Generate response
responses = client.embeddings.create(
input=[
"Generate embeddings using your dedicated deployment.\n",
],
model="<org>/<adapter>", # Hugging Face adapter path
)
print(responses)
REST API
Base Model
curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"input": [
"Generate embeddings using your dedicated deployment.\n"
],
"model": ""
}'
Predibase Adapters
curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"input": [
"Generate embeddings using your dedicated deployment.\n"
],
"model": "my-repo/1" # <ADAPTER_REPO>/<ADAPTER_VERSION>
}'
HF Adapters
# Configure HF API token for private adapters
export HUGGINGFACE_HUB_TOKEN=<YOUR HUGGINGFACE TOKEN> # Required for private adapters
curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"input": [
"Generate embeddings using your dedicated deployment.\n"
],
"model": "<org>/<adapter>", # Hugging Face adapter path
}'
Supported Models
The following embedding models are officially supported for deployment on Predibase:
Model Name | Architecture | Output Dimensions | License | Always-On Shared Endpoint |
---|
WhereIsAI/UAE-Large-V1 | BERT | 1024 | MIT | ❌ |
dunzhang/stella_en_1.5B_v5 | Qwen | 1024 | Apache 2.0 | ❌ |
distilbert-base-uncased | DistilBERT | 768 | Apache 2.0 | ❌ |
We are able to add new models to our catalog on a case-by-case basis. If you
have a specific model in mind, please reach out to us at
support@predibase.com.
Model Details
BERT-based Models
Best for: High-quality embeddings with proven architecture
-
WhereIsAI/UAE-Large-V1
- Strong performance on similarity tasks
- 1024-dimensional embeddings
- MIT license
- Efficient inference
-
distilbert-base-uncased
- Compressed BERT architecture
- 768-dimensional embeddings
- Apache 2.0 license
- Fast inference speed
Qwen-based Models
Best for: State-of-the-art embedding quality
- dunzhang/stella_en_1.5B_v5
- Large model with 1.5B parameters
- 1024-dimensional embeddings
- Apache 2.0 license
- Advanced semantic understanding
Best Practices
-
Input Size
- Check model documentation for maximum input length
- Consider truncating or chunking long inputs
- Balance between context and performance
-
Batch Processing
- Implement custom batching for large datasets
- Monitor memory usage during batch processing
- Consider async processing for large workloads
-
Deployment Configuration
- Use auto-scaling for cost optimization
- Monitor performance metrics
- Choose appropriate GPU based on workload
-
Model Selection
- Consider embedding dimensions vs. quality
- Match model size to hardware capabilities
- Evaluate licensing requirements