Vision Language Models

Overview

Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by incorporating image inputs alongside text. While LLMs process purely linguistic context, VLMs can jointly reason about images and text by transforming pixel data into embeddings that align with text representations.

Vision Language Model support is currently in beta. If you encounter any issues, please reach out at [email protected].

Key Applications

Image Recognition & Reasoning

Identify objects, activities, and structured elements within images for document analysis and product categorization

Image Captioning

Generate descriptive captions for accessibility, social media, and content tagging

Vision Q&A

Answer questions about image content for customer support, education, and smart image search

Multimodal Generation

Create text from vision prompts or combine text and images for synthetic datasets

Fine-tuning Process

Dataset Requirements

To fine-tune VLMs on Predibase, your dataset must contain one column:

messages: Conversations between a user and an assistant

We currently do NOT support tool calling with VLMs

Dataset Preparation

If you have a raw dataset in this format:

Prompt: "What is this a picture of?"
Completion: "A dog"
Image: <URL_TO_IMAGE>

Use the following code to format it for Predibase fine-tuning:

processor.py

from PIL import Image
import pandas as pd
import requests
import io

def format_row(row):
    prompt = row["Prompt"]
    image = row["Image"]
    completion = row["Completion"]
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image
                    }
                },
                {
                    "type": "text",
                    "text": prompt,
                }
            ]
        }
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": completion
                }
            ]
        }
    ]

    return {
        "messages": messages
    }

train_dataset = train_dataset.apply(format_row, axis=1, result_type="expand")

If you want to use a base64 encoded image instead of an image URL, you can use the following (assuming you have local filepaths in your “Image” column):

processor.py

from PIL import Image
import pandas as pd
import requests
import io

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def format_row(row):
    prompt = row["Prompt"]
    image = row["Image"]
    encoded_image = encode_image(image)
    completion = row["Completion"]
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encoded_image}"
                    }
                },
                {
                    "type": "text",
                    "text": prompt,
                }
            ]
        }
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": completion
                }
            ]
        }
    ]

    return {
        "messages": messages
    }

train_dataset = train_dataset.apply(format_row, axis=1, result_type="expand")

Uploading Your Dataset

Due to the size of VLM datasets, we recommend two approaches:

S3 Upload (Recommended)
- Upload datasets to S3 first
- Connect to Predibase via the UI
- Supports datasets up to 1 GB

Direct File Upload

dataset = pb.datasets.from_file("path/to/local/file", name="doc_vqa_test")

Inference

For inference, we suggest using OpenAI chat completions:

from openai import OpenAI

# Initialize client
api_token = "<PREDIBASE_API_TOKEN>"
tenant_id = "<PREDIBASE_TENANT_ID>"
model_name = "<DEPLOYMENT_NAME>"  # Ex. "qwen2-5-vl-7b-instruct"
adapter = "<ADAPTER_REPO_NAME>/<VERSION_NUMBER>"  # Ex. "adapter-repo/1" (optional)
base_url = f"https://serving.app.predibase.com/{tenant_id}/deployments/v2/llms/{model_name}/v1"

client = OpenAI(
    api_key=api_token,
    base_url=base_url
)

# Chat completion
completion = client.chat.completions.create(
    model=adapter,  # Use empty string "" for base model
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is this an image of?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": <IMAGE_URL_HERE>
                    }
                }
            ]
        }
    ],
    max_tokens=100
)
print(completion.choices[0].message.content)

Getting Started

Inference

Fine-Tuning

Account

Integrations

Examples

Resources

Vision Language Models

Overview

Key Applications

Image Recognition & Reasoning

Image Captioning

Vision Q&A

Multimodal Generation

Fine-tuning Process

Dataset Requirements

Dataset Preparation

Uploading Your Dataset

Inference

Getting Started

Inference

Fine-Tuning

Account

Integrations

Examples

Resources

​Overview

​Key Applications

Image Recognition & Reasoning

Image Captioning

Vision Q&A

Multimodal Generation

​Fine-tuning Process

​Dataset Requirements

​Dataset Preparation

​Uploading Your Dataset

​Inference

Overview

Key Applications

Fine-tuning Process

Dataset Requirements

Dataset Preparation

Uploading Your Dataset

Inference