Learn how to fine-tune Vision Language Models (VLMs) to process both images and text
Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by incorporating image inputs alongside text. While LLMs process purely linguistic context, VLMs can jointly reason about images and text by transforming pixel data into embeddings that align with text representations.
Vision Language Model support is currently in beta. If you encounter any issues, please reach out at support@predibase.com.
Identify objects, activities, and structured elements within images for document analysis and product categorization
Generate descriptive captions for accessibility, social media, and content tagging
Answer questions about image content for customer support, education, and smart image search
Create text from vision prompts or combine text and images for synthetic datasets
To fine-tune VLMs on Predibase, your dataset must contain one column:
messages
: Conversations between a user and an assistantWe currently do NOT support tool calling with VLMs
If you have a raw dataset in this format:
Use the following code to format it for Predibase fine-tuning:
If you want to use a base64 encoded image instead of an image URL, you can use the following (assuming you have local filepaths in your “Image” column):
Due to the size of VLM datasets, we recommend two approaches:
S3 Upload (Recommended)
Direct File Upload
For inference, we suggest using OpenAI chat completions:
Learn how to fine-tune Vision Language Models (VLMs) to process both images and text
Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by incorporating image inputs alongside text. While LLMs process purely linguistic context, VLMs can jointly reason about images and text by transforming pixel data into embeddings that align with text representations.
Vision Language Model support is currently in beta. If you encounter any issues, please reach out at support@predibase.com.
Identify objects, activities, and structured elements within images for document analysis and product categorization
Generate descriptive captions for accessibility, social media, and content tagging
Answer questions about image content for customer support, education, and smart image search
Create text from vision prompts or combine text and images for synthetic datasets
To fine-tune VLMs on Predibase, your dataset must contain one column:
messages
: Conversations between a user and an assistantWe currently do NOT support tool calling with VLMs
If you have a raw dataset in this format:
Use the following code to format it for Predibase fine-tuning:
If you want to use a base64 encoded image instead of an image URL, you can use the following (assuming you have local filepaths in your “Image” column):
Due to the size of VLM datasets, we recommend two approaches:
S3 Upload (Recommended)
Direct File Upload
For inference, we suggest using OpenAI chat completions: