Overview

Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by incorporating visual inputs alongside text. While LLMs process purely linguistic context, VLMs can jointly reason about images and text by transforming pixel data into embeddings that align with text representations.

Vision Language Model support is currently in beta. If you encounter any issues, please reach out at support@predibase.com.

Key Applications

Visual Recognition & Reasoning

Identify objects, activities, and structured elements within images for document analysis and product categorization

Image Captioning

Generate descriptive captions for accessibility, social media, and content tagging

Visual Q&A

Answer questions about image content for customer support, education, and smart image search

Multimodal Generation

Create text from visual prompts or combine text and images for synthetic datasets

Fine-tuning Process

Dataset Requirements

To fine-tune VLMs on Predibase, your dataset must contain three columns:

  • prompt: The text input/question about the image
  • completion: The expected response/answer
  • images: The image data as a bytestream

Currently, we support one image input per prompt-completion pair. Support for multiple images and multi-turn chat is coming soon.

Dataset Preparation

If you have a raw dataset in this format:

Prompt: "What is this a picture of?"
Completion: "A dog"
Image: <URL_TO_IMAGE>

You can use the following code to format your dataset for Predibase fine-tuning:

processor.py
from transformers import AutoProcessor
from PIL import Image
import pandas as pd
import requests
import io

# Make sure you have access to the model you want to fine-tune
processor = AutoProcessor.from_pretrained(model_id)

def to_bytes(image_url):
    if isinstance(image_url, list):
        image_url = image_url[0]

    if isinstance(image_url, str):
        image = Image.open(requests.get(image_url, stream=True).raw)
        image = Image.new("RGB", image.size, (255, 255, 255))
    else:
        image = image_url

    img_byte_array = io.BytesIO()
    image.save(img_byte_array, format='PNG')
    return img_byte_array.getvalue()

def format_row(row):
    prompt = row["prompt"]
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": processor.image_token + prompt,
                }
            ]
        }
    ]
    try:
        formatted_prompt = processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
    except:
        formatted_prompt = processor.image_token + prompt + processor.tokenizer.eos_token

    completion = row["completion"]
    image_bytes = to_bytes(row["images"])

    return {
        "prompt": formatted_prompt,
        "completion": completion,
        "images": image_bytes
    }

train_dataset = train_dataset.apply(format_row, axis=1, result_type="expand")

The formatted text inputs should each contain an image token (e.g. <|image|>), and the images column should contain raw byte strings.

Uploading Your Dataset

Due to the size of VLM datasets, we recommend two approaches:

  1. S3 Upload (Recommended)

    • Upload datasets to S3 first
    • Connect to Predibase via the UI
    • Supports datasets up to 1 GB
  2. Direct File Upload

    dataset = pb.datasets.from_file("path/to/local/file", name="doc_vqa_test")

During beta, we support datasets up to 200 rows for VLM fine-tuning. Support for larger datasets and more models (like Qwen-VL) is coming very soon.

Training Configuration

VLM fine-tuning follows the same process as regular LLM fine-tuning, with two key differences:

  1. You must select a VLM as the base model (see supported VLMs)
  2. Your dataset must follow the format described above

Currently, we only support:

  • SFT task type (GRPO coming soon!)
  • LoRA adapter support (Turbo and Turbo LoRA support coming soon)

Inference

Image Input Formats

There are three ways to include images in your prompts:

URL
prompt = "![](https://example.com/image.png) What is this a picture of?"
Local File
import base64

with open("image.png", "rb") as f:
    byte_string = base64.b64encode(f.read()).decode()

prompt = f"![](data:image/png;base64,{byte_string}) What is this an image of?"
Byte String
import base64

byte_string = your_byte_string
encoded_byte_string = base64.b64encode(byte_string).decode()

prompt = f"![](data:image/png;base64,{encoded_byte_string}) What is this an image of?"

For best results, place the image before the text in your prompts. This allows the text tokens to properly attend to the image tokens, leading to better understanding and more accurate responses.

Basic Generation

Use either our shared endpoints or create a private deployment:

client = pb.deployments.client("llama-3-2-11b-vision-instruct")
response = client.generate(
    prompt,
    adapter_id="my-repo/1",
    max_new_tokens=128
)
print(response.generated_text)

Batch Evaluation

For evaluating your model on a test dataset, use this helper script:

evaluation.py
import pandas as pd
from predibase import Predibase
import base64
import ast

pb = Predibase(api_token="...")

def evaluate_vlm(dataset_path, deployment_name, adapter_id, output_file):
    base_img_str = "![](data:image/png;base64,{})"

    # Load and preprocess dataset
    dataset = pd.read_csv(dataset_path)
    dataset.rename(columns={"completion": "ground_truth"}, inplace=True)

    # Extract raw prompt
    dataset["raw_prompt"] = dataset["prompt"].apply(
        lambda x: x.split("<|image|>")[1].split("<|eot_id|>")[0]
    )

    # Format images
    dataset['images'] = dataset['images'].apply(ast.literal_eval)
    dataset['images'] = dataset['images'].apply(
        lambda x: base64.b64encode(x).decode('utf-8')
    )
    dataset['images'] = dataset['images'].apply(
        lambda x: base_img_str.format(x)
    )

    # Update prompts with formatted images
    for index, row in dataset.iterrows():
        dataset.at[index, 'prompt'] = row['prompt'].replace(
            "<|image|>",
            row['images']
        )

    dataset.drop(columns=['images'], inplace=True)

    # Run inference
    client = pb.deployments.client(deployment_name, force_bare_client=True)
    adapter_completions = []

    for index, row in dataset.iterrows():
        try:
            completion = client.generate(
                row['prompt'],
                adapter_id=adapter_id,
                max_new_tokens=128,
                temperature=1
            )
            adapter_completions.append(completion.generated_text)
        except Exception as e:
            print(f"Failed at index {index}: {e}")
            adapter_completions.append(None)

    # Format output
    dataset.drop(columns=['prompt'], inplace=True)
    dataset['adapter_completions'] = adapter_completions
    dataset = dataset[['raw_prompt', 'ground_truth', 'adapter_completions']]

    # Save results
    dataset.to_csv(output_file, index=False)
    return dataset


# Example usage
results = evaluate_vlm(
    dataset_path="path/to/dataset.csv",
    deployment_name="llama-3-2-11b-vision-instruct",
    adapter_id="my-repo/1",
    output_file="evaluation_results.csv"
)

For any questions about vision-language models or support for larger datasets and additional models, please reach out to our team at support@predibase.com.