Predibase supports a wide range of embedding models for text embeddings and similarity search. This guide helps you:

  • Find available models for your use case
  • Understand model capabilities and requirements
  • Choose between different model options

Quick Start

First, install the Predibase Python SDK:

pip install -U predibase

Creating Private Deployments

For production use cases, create your own private embedding model deployment:

Base Model Only

from predibase import Predibase, DeploymentConfig

pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")

# Create a production deployment with UAE-Large
deployment = pb.deployments.create(
    name="my-embedding-model",
    config=DeploymentConfig(
        base_model="WhereIsAI/UAE-Large-V1",  # High-performance embedding model
        min_replicas=1,     # Always keep one replica running
        max_replicas=2,     # Scale up to 2 replicas under load
        accelerator="a10_24gb_100", # Uses A10G GPU
        speculator="disabled",
        disable_adapters=True, # If you plan on not using adapters
        max_total_tokens=512 # This model requires setting max tokens = 512 to start on an a10g GPU 
    )
)

With Adapters

You can also create deployments with adapters; we recommend preloading them to avoid the overhead of dynamically swapping large portions of the model during inference:

# Create the deployment with preloaded adapters
deployment = pb.deployments.create(
    name="my-embedding-model",
    config=DeploymentConfig(
        base_model="predibase/stella_en_400M_v5_vllm", 
        min_replicas=0,  
        max_replicas=1,    
        accelerator="a10_24gb_100",
        speculator="disabled",
        preloaded_adapters=["adapter_id/1","adapter_id/2","adapter_id/3","adapter_id/4","adapter_id/5"], # Preload multiple adapters
        task='embed', # Specify the task
        truncate_dim=256 # Specify the output dimension 
    )
)

Running Inference

Python SDK

Base Model

# Generate embeddings for a single chunk of data with the base model
text = "Generate embeddings using your dedicated deployment."
response = pb.embeddings.create(model="my-embedding-model", input=text)
print(f"Generated {len(response.data[0].embedding)}-dimensional embedding")

# Process a batch of documents with the base model
documents = [
    "First document for embedding",
    "Second document with different content",
    "Third document to process in batch"
]
batch_embeddings = [pb.embeddings.create(model="my-embedding-model", input=doc).data[0].embedding for doc in documents]

Predibase Adapters

# Generate inference via the OpenAI embeddings API
import openai

base_url = "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1"
api_token = "<PREDIBASE_API_TOKEN>"

# initialize the OpenAI client
client = openai.OpenAI(base_url=base_url, api_key=api_token)

# List available adapters (optional)
models = client.models.list()
print("AVAILABLE MODELS:")
for model in models.data:
    print(model.id)

# Generate response 
responses = client.embeddings.create(
    input=[
        "Generate embeddings using your dedicated deployment.\n",
    ],
    model="my-repo/1",
)

print(responses)

HF Adapters

import os
os.environ["HUGGINGFACE_HUB_TOKEN"] = "<YOUR HUGGINGFACE TOKEN>" # Required for private adapters

# Generate inference via the OpenAI embeddings API
import openai

base_url = "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1"
api_token = "<PREDIBASE_API_TOKEN>"

# initialize the OpenAI client
client = openai.OpenAI(base_url=base_url, api_key=api_token)

# List available adapters (optional)
models = client.models.list()
print("AVAILABLE MODELS:")
for model in models.data:
    print(model.id)

# Generate response 
responses = client.embeddings.create(
    input=[
        "Generate embeddings using your dedicated deployment.\n",
    ],
    model="<org>/<adapter>",  # Hugging Face adapter path
)

print(responses)

REST API

Base Model

curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
  "input": [
    "Generate embeddings using your dedicated deployment.\n"
  ],
  "model": "" 
}'

Predibase Adapters

curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
  "input": [
    "Generate embeddings using your dedicated deployment.\n"
  ],
  "model": "my-repo/1" # <ADAPTER_REPO>/<ADAPTER_VERSION>
}'

HF Adapters

# Configure HF API token for private adapters
export HUGGINGFACE_HUB_TOKEN=<YOUR HUGGINGFACE TOKEN> # Required for private adapters

curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
  "input": [
    "Generate embeddings using your dedicated deployment.\n"
  ],
  "model": "<org>/<adapter>",  # Hugging Face adapter path
}'

Supported Models

The following embedding models are officially supported for deployment on Predibase:

Model NameArchitectureOutput DimensionsLicenseAlways-On Shared Endpoint
WhereIsAI/UAE-Large-V1BERT1024MIT
dunzhang/stella_en_1.5B_v5Qwen1024Apache 2.0
distilbert-base-uncasedDistilBERT768Apache 2.0

We are able to add new models to our catalog on a case-by-case basis. If you have a specific model in mind, please reach out to us at support@predibase.com.

Model Details

BERT-based Models

Best for: High-quality embeddings with proven architecture

  • WhereIsAI/UAE-Large-V1

    • Strong performance on similarity tasks
    • 1024-dimensional embeddings
    • MIT license
    • Efficient inference
  • distilbert-base-uncased

    • Compressed BERT architecture
    • 768-dimensional embeddings
    • Apache 2.0 license
    • Fast inference speed

Qwen-based Models

Best for: State-of-the-art embedding quality

  • dunzhang/stella_en_1.5B_v5
    • Large model with 1.5B parameters
    • 1024-dimensional embeddings
    • Apache 2.0 license
    • Advanced semantic understanding

Best Practices

  1. Input Size

    • Check model documentation for maximum input length
    • Consider truncating or chunking long inputs
    • Balance between context and performance
  2. Batch Processing

    • Implement custom batching for large datasets
    • Monitor memory usage during batch processing
    • Consider async processing for large workloads
  3. Deployment Configuration

    • Use auto-scaling for cost optimization
    • Monitor performance metrics
    • Choose appropriate GPU based on workload
  4. Model Selection

    • Consider embedding dimensions vs. quality
    • Match model size to hardware capabilities
    • Evaluate licensing requirements