Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by
incorporating image inputs alongside text. While LLMs process purely linguistic
context, VLMs can jointly reason about images and text by transforming pixel
data into embeddings that align with text representations.
Vision Language Model support is currently in beta. If you encounter any
issues, please reach out at support@predibase.com.
If you want to use a base64 encoded image instead of an image URL, you can use the following (assuming you have local filepaths in your “Image” column):