Overview
Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by
incorporating image inputs alongside text. While LLMs process purely linguistic
context, VLMs can jointly reason about images and text by transforming pixel
data into embeddings that align with text representations.
Vision Language Model support is currently in beta. If you encounter any
issues, please reach out at support@predibase.com .
Key Applications
Image Recognition & Reasoning Identify objects, activities, and structured elements within images for
document analysis and product categorization
Image Captioning Generate descriptive captions for accessibility, social media, and content
tagging
Vision Q&A Answer questions about image content for customer support, education, and
smart image search
Multimodal Generation Create text from vision prompts or combine text and images for synthetic
datasets
Fine-tuning Process
Dataset Requirements
To fine-tune VLMs on Predibase, your dataset must contain one column:
messages
: Conversations between a user and an assistant
We currently do NOT support tool calling with VLMs
Dataset Preparation
If you have a raw dataset in this format:
Prompt: "What is this a picture of?"
Completion: "A dog"
Image: <URL_TO_IMAGE>
Use the following code to format it for Predibase fine-tuning:
from PIL import Image
import pandas as pd
import requests
import io
def format_row ( row ):
prompt = row[ "Prompt" ]
image = row[ "Image" ]
completion = row[ "Completion" ]
messages = [
{
"role" : "user" ,
"content" : [
{
"type" : "image_url" ,
"image_url" : {
"url" : image
}
},
{
"type" : "text" ,
"text" : prompt,
}
]
}
{
"role" : "assistant" ,
"content" : [
{
"type" : "text" ,
"text" : completion
}
]
}
]
return {
"messages" : messages
}
train_dataset = train_dataset.apply(format_row, axis = 1 , result_type = "expand" )
If you want to use a base64 encoded image instead of an image URL, you can use the following (assuming you have local filepaths in your “Image” column):
from PIL import Image
import pandas as pd
import requests
import io
def encode_image ( image_path ):
with open (image_path, "rb" ) as image_file:
return base64.b64encode(image_file.read()).decode( "utf-8" )
def format_row ( row ):
prompt = row[ "Prompt" ]
image = row[ "Image" ]
encoded_image = encode_image(image)
completion = row[ "Completion" ]
messages = [
{
"role" : "user" ,
"content" : [
{
"type" : "image_url" ,
"image_url" : {
"url" : f "data:image/jpeg;base64, { encoded_image } "
}
},
{
"type" : "text" ,
"text" : prompt,
}
]
}
{
"role" : "assistant" ,
"content" : [
{
"type" : "text" ,
"text" : completion
}
]
}
]
return {
"messages" : messages
}
train_dataset = train_dataset.apply(format_row, axis = 1 , result_type = "expand" )
Uploading Your Dataset
Due to the size of VLM datasets, we recommend two approaches:
S3 Upload (Recommended)
Upload datasets to S3 first
Connect to Predibase via the UI
Supports datasets up to 1 GB
Direct File Upload
dataset = pb.datasets.from_file( "path/to/local/file" , name = "doc_vqa_test" )
Inference
For inference, we suggest using OpenAI chat completions:
from openai import OpenAI
# Initialize client
api_token = "<PREDIBASE_API_TOKEN>"
tenant_id = "<PREDIBASE_TENANT_ID>"
model_name = "<DEPLOYMENT_NAME>" # Ex. "qwen2-5-vl-7b-instruct"
adapter = "<ADAPTER_REPO_NAME>/<VERSION_NUMBER>" # Ex. "adapter-repo/1" (optional)
base_url = f "https://serving.app.predibase.com/ { tenant_id } /deployments/v2/llms/ { model_name } /v1"
client = OpenAI(
api_key = api_token,
base_url = base_url
)
# Chat completion
completion = client.chat.completions.create(
model = adapter, # Use empty string "" for base model
messages = [
{
"role" : "user" ,
"content" : [
{
"type" : "text" ,
"text" : "What is this an image of?"
},
{
"type" : "image_url" ,
"image_url" : {
"url" : < IMAGE_URL_HERE >
}
}
]
}
],
max_tokens = 100
)
print (completion.choices[ 0 ].message.content)