Embedding Models

Predibase supports a wide range of embedding models for text embeddings and similarity search. This guide helps you:

Find available models for your use case
Understand model capabilities and requirements
Choose between different model options

Quick Start

First, install the Predibase Python SDK:

pip install -U predibase

Creating Private Deployments

For production use cases, create your own private embedding model deployment:

Base Model Only

from predibase import Predibase, DeploymentConfig

pb = Predibase(api_token="<PREDIBASE_API_TOKEN>")

# Create a production deployment with UAE-Large
deployment = pb.deployments.create(
    name="my-embedding-model",
    config=DeploymentConfig(
        base_model="WhereIsAI/UAE-Large-V1",  # High-performance embedding model
        min_replicas=1,     # Always keep one replica running
        max_replicas=2,     # Scale up to 2 replicas under load
        accelerator="a10_24gb_100", # Uses A10G GPU
        speculator="disabled",
        disable_adapters=True, # If you plan on not using adapters
        max_total_tokens=512 # This model requires setting max tokens = 512 to start on an a10g GPU 
    )
)

With Adapters

You can also create deployments with adapters; we recommend preloading them to avoid the overhead of dynamically swapping large portions of the model during inference:

# Create the deployment with preloaded adapters
deployment = pb.deployments.create(
    name="my-embedding-model",
    config=DeploymentConfig(
        base_model="predibase/stella_en_400M_v5_vllm", 
        min_replicas=0,  
        max_replicas=1,    
        accelerator="a10_24gb_100",
        speculator="disabled",
        preloaded_adapters=["adapter_id/1","adapter_id/2","adapter_id/3","adapter_id/4","adapter_id/5"], # Preload multiple adapters
        task='embed', # Specify the task
        truncate_dim=256 # Specify the output dimension 
    )
)

Running Inference

Python SDK

Base Model

# Generate embeddings for a single chunk of data with the base model
text = "Generate embeddings using your dedicated deployment."
response = pb.embeddings.create(model="my-embedding-model", input=text)
print(f"Generated {len(response.data[0].embedding)}-dimensional embedding")

# Process a batch of documents with the base model
documents = [
    "First document for embedding",
    "Second document with different content",
    "Third document to process in batch"
]
batch_embeddings = [pb.embeddings.create(model="my-embedding-model", input=doc).data[0].embedding for doc in documents]

Predibase Adapters

# Generate inference via the OpenAI embeddings API
import openai

base_url = "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1"
api_token = "<PREDIBASE_API_TOKEN>"

# initialize the OpenAI client
client = openai.OpenAI(base_url=base_url, api_key=api_token)

# List available adapters (optional)
models = client.models.list()
print("AVAILABLE MODELS:")
for model in models.data:
    print(model.id)

# Generate response 
responses = client.embeddings.create(
    input=[
        "Generate embeddings using your dedicated deployment.\n",
    ],
    model="my-repo/1",
)

print(responses)

HF Adapters

import os
os.environ["HUGGINGFACE_HUB_TOKEN"] = "<YOUR HUGGINGFACE TOKEN>" # Required for private adapters

# Generate inference via the OpenAI embeddings API
import openai

base_url = "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1"
api_token = "<PREDIBASE_API_TOKEN>"

# initialize the OpenAI client
client = openai.OpenAI(base_url=base_url, api_key=api_token)

# List available adapters (optional)
models = client.models.list()
print("AVAILABLE MODELS:")
for model in models.data:
    print(model.id)

# Generate response 
responses = client.embeddings.create(
    input=[
        "Generate embeddings using your dedicated deployment.\n",
    ],
    model="<org>/<adapter>",  # Hugging Face adapter path
)

print(responses)

REST API

Base Model

curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
  "input": [
    "Generate embeddings using your dedicated deployment.\n"
  ],
  "model": "" 
}'

Predibase Adapters

curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
  "input": [
    "Generate embeddings using your dedicated deployment.\n"
  ],
  "model": "my-repo/1" # <ADAPTER_REPO>/<ADAPTER_VERSION>
}'

HF Adapters

# Configure HF API token for private adapters
export HUGGINGFACE_HUB_TOKEN=<YOUR HUGGINGFACE TOKEN> # Required for private adapters

curl -X POST "https://serving.app.predibase.com/<TENANT_ID>/deployments/v2/llms/<DEPLOYMENT_NAME>/v1/embeddings" \
-H "Authorization: Bearer <PREDIBASE_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
  "input": [
    "Generate embeddings using your dedicated deployment.\n"
  ],
  "model": "<org>/<adapter>",  # Hugging Face adapter path
}'

Supported Models

The following embedding models are officially supported for deployment on Predibase:

Model Name	Architecture	Output Dimensions	License	Always-On Shared Endpoint
WhereIsAI/UAE-Large-V1	BERT	1024	MIT	❌
dunzhang/stella_en_1.5B_v5	Qwen	1024	Apache 2.0	❌
distilbert-base-uncased	DistilBERT	768	Apache 2.0	❌

We are able to add new models to our catalog on a case-by-case basis. If you have a specific model in mind, please reach out to us at support@predibase.com.

Model Details

BERT-based Models

Best for: High-quality embeddings with proven architecture

WhereIsAI/UAE-Large-V1
- Strong performance on similarity tasks
- 1024-dimensional embeddings
- MIT license
- Efficient inference
distilbert-base-uncased
- Compressed BERT architecture
- 768-dimensional embeddings
- Apache 2.0 license
- Fast inference speed

Qwen-based Models

Best for: State-of-the-art embedding quality

dunzhang/stella_en_1.5B_v5
- Large model with 1.5B parameters
- 1024-dimensional embeddings
- Apache 2.0 license
- Advanced semantic understanding

Best Practices

Input Size
- Check model documentation for maximum input length
- Consider truncating or chunking long inputs
- Balance between context and performance
Batch Processing
- Implement custom batching for large datasets
- Monitor memory usage during batch processing
- Consider async processing for large workloads
Deployment Configuration
- Use auto-scaling for cost optimization
- Monitor performance metrics
- Choose appropriate GPU based on workload
Model Selection
- Consider embedding dimensions vs. quality
- Match model size to hardware capabilities
- Evaluate licensing requirements

Getting Started

Inference

Fine-Tuning

Account

Integrations

Examples

Resources

Quick Start

Creating Private Deployments

Base Model Only

With Adapters

Running Inference

Python SDK

Base Model

Predibase Adapters

HF Adapters

REST API

Base Model

Predibase Adapters

HF Adapters

Supported Models

Model Details

BERT-based Models

Qwen-based Models

Best Practices

Getting Started

Inference

Fine-Tuning

Account

Integrations

Examples

Resources

​Quick Start

​Creating Private Deployments

​Base Model Only

​With Adapters

​Running Inference

​Python SDK

​Base Model

​Predibase Adapters

​HF Adapters

​REST API

​Base Model

​Predibase Adapters

​HF Adapters

​Supported Models

​Model Details

​BERT-based Models

​Qwen-based Models

​Best Practices

Quick Start

Creating Private Deployments

Base Model Only

With Adapters

Running Inference

Python SDK

Base Model

Predibase Adapters

HF Adapters

REST API

Base Model

Predibase Adapters

HF Adapters

Supported Models

Model Details

BERT-based Models

Qwen-based Models

Best Practices