Skip to main content

Fine-Tuning Visual Language Models (Beta)

What are Visual Language Models?

Visual Language Models (VLMs) extend the capabilities of traditional Large Language Models (LLMs) by incorporating visual inputs alongside text. While LLMs generate or comprehend text based on purely linguistic context, VLMs take this a step further by jointly processing images and text. Essentially, they transform raw pixel data into a form that can be aligned with text embeddings, enabling the model to reason across both modalities.

The core architecture is typically based on transformers, similar to those used in LLMs. However, instead of just attending to sequences of tokens, VLMs also attend to visual features extracted from images using vision encoders (like CNNs, Vision Transformers, SigLIP, etc.). This enables them to understand complex associations between what they "see" in an image and what they "read" in a caption or query.

Visual Language Model Architecture

Common Use-Cases for Visual Language Models (VLMs)

  1. Visual Recognition & Image Reasoning: Identifying objects, activities, or structured elements within images, enhancing tasks like document analysis and product categorization.
  2. Image Captioning: Generating descriptive captions for accessibility, social media, and content tagging.
  3. Visual Question Answering (VQA): Answering questions about an image's content, useful in customer support, education, and smarter image search.
  4. Multimodal Content Generation: Creating text based on visual prompts or combining text and images for synthetic datasets.
  5. Image-Based Search: Enabling more intuitive search engines by combining visual and text-based queries.

Fine-Tuning VLMs in Predibase

Dataset Preparation

To fine-tune VLMs on Predibase, your dataset must contain 3 columns:

  • prompt
  • completion
  • images

In particular, Predibase currently only supports one image input per prompt-completion pair and requires this image to be a bytestream. We're working on adding support for multiple image inputs, as well as multi-turn chat support for chat based datasets.

If you have a dataset that currently uses the OpenAI chat completions format, you can format it to the Predibase compatible fine-tuning format using the code below:

Raw dataset:

Prompt: "What is this a picture of?"
Completion: "A dog"
Image: <URL_TO_IMAGE>

To apply the chat template, run the following code. Note that this step is more or less required in order to insert the image token into the text input:

from transformers import AutoProcessor
from PIL import Image

import pandas as pd
import requests
import io

processor = AutoProcessor.from_pretrained(model_id)

def to_bytes(image_url):
image = Image.open(requests.get(image_url, stream=True).raw)
image = Image.new("RGB", image.size, (255,255,255))
img_byte_array = io.BytesIO()
image.save(img_byte_array, format='PNG')
return img_byte_array.getvalue()

def format_row(row):
prompt = row["prompt"]
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt,
},
{
"type": "image",
"text": None,
"index": 0,
}
]
}
]
formatted_prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
completion = row["completion"]
image_bytes = to_bytes(row["images"])

return {"prompt": formatted_prompt, "completion": completion, "images": image_bytes}

train_dataset: pd.DataFrame = train_dataset.apply(format_row, axis=1, result_type="expand")

Once you run this code, your text inputs should each contain an image token (this varies depending on the model, but it should look something like <|image|>), and images should contain raw byte strings.

You can save this dataset in parquet format via dataset.to_parquet() and upload this to Predibase. Since the feature is currently in Beta, we only allow uploading relatively small datasets (~1000 rows) for VLM fine-tuning. We're working on supporting larger datasets for VLM fine-tuning in the coming weeks.

Training

Training a VLM in either the UI or the SDK takes the same process as training an LLM in Predibase. The only differences are:

  1. You must select a VLM as the base model.
  2. Your dataset must be formatted correctly as above.

That's all it takes!

Note: We currently only support LoRA based fine-tuning for VLMs. Stay tuned for Turbo and Turbo LoRA support!

Inference

When it comes to running inference, we suggest using URLs in your requests to add the image. For example:

"What is this a picture of? ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)"