Overview
Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by incorporating image inputs alongside text. While LLMs process purely linguistic context, VLMs can jointly reason about images and text by transforming pixel data into embeddings that align with text representations.
Vision Language Model support is currently in beta. If you encounter any
issues, please reach out at support@predibase.com.
Key Applications
Image Recognition & Reasoning
Identify objects, activities, and structured elements within images for
document analysis and product categorization
Image Captioning
Generate descriptive captions for accessibility, social media, and content
tagging
Vision Q&A
Answer questions about image content for customer support, education, and
smart image search
Multimodal Generation
Create text from vision prompts or combine text and images for synthetic
datasets
Fine-tuning Process
Dataset Requirements
To fine-tune VLMs on Predibase, your dataset must contain one column:messages
: Conversations between a user and an assistant
We currently do NOT support tool calling with VLMs
Dataset Preparation
If you have a raw dataset in this format:processor.py
processor.py
Uploading Your Dataset
Due to the size of VLM datasets, we recommend two approaches:-
S3 Upload (Recommended)
- Upload datasets to S3 first
- Connect to Predibase via the UI
- Supports datasets up to 1 GB
-
Direct File Upload