Vision Language Models (VLMs) extend traditional Large Language Models (LLMs) by
incorporating visual inputs alongside text. While LLMs process purely linguistic
context, VLMs can jointly reason about images and text by transforming pixel
data into embeddings that align with text representations.
Vision Language Model support is currently in beta. If you encounter any
issues, please reach out at support@predibase.com.
There are three ways to include images in your prompts:
URL
prompt = " What is this a picture of?"
Local File
import base64with open("image.png", "rb") as f: byte_string = base64.b64encode(f.read()).decode()prompt = f" What is this an image of?"
Byte String
import base64byte_string = your_byte_stringencoded_byte_string = base64.b64encode(byte_string).decode()prompt = f" What is this an image of?"
For best results, place the image before the text in your prompts. This allows the text tokens to properly attend to the image tokens, leading to better understanding and more accurate responses.
For evaluating your model on a test dataset, use this helper script:
evaluation.py
import pandas as pdfrom predibase import Predibaseimport base64import astpb = Predibase(api_token="...")def evaluate_vlm(dataset_path, deployment_name, adapter_id, output_file): base_img_str = "" # Load and preprocess dataset dataset = pd.read_csv(dataset_path) dataset.rename(columns={"completion": "ground_truth"}, inplace=True) # Extract raw prompt dataset["raw_prompt"] = dataset["prompt"].apply( lambda x: x.split("<|image|>")[1].split("<|eot_id|>")[0] ) # Format images dataset['images'] = dataset['images'].apply(ast.literal_eval) dataset['images'] = dataset['images'].apply( lambda x: base64.b64encode(x).decode('utf-8') ) dataset['images'] = dataset['images'].apply( lambda x: base_img_str.format(x) ) # Update prompts with formatted images for index, row in dataset.iterrows(): dataset.at[index, 'prompt'] = row['prompt'].replace( "<|image|>", row['images'] ) dataset.drop(columns=['images'], inplace=True) # Run inference client = pb.deployments.client(deployment_name, force_bare_client=True) adapter_completions = [] for index, row in dataset.iterrows(): try: completion = client.generate( row['prompt'], adapter_id=adapter_id, max_new_tokens=128, temperature=1 ) adapter_completions.append(completion.generated_text) except Exception as e: print(f"Failed at index {index}: {e}") adapter_completions.append(None) # Format output dataset.drop(columns=['prompt'], inplace=True) dataset['adapter_completions'] = adapter_completions dataset = dataset[['raw_prompt', 'ground_truth', 'adapter_completions']] # Save results dataset.to_csv(output_file, index=False) return dataset# Example usageresults = evaluate_vlm( dataset_path="path/to/dataset.csv", deployment_name="llama-3-2-11b-vision-instruct", adapter_id="my-repo/1", output_file="evaluation_results.csv")
For any questions about vision-language models or support for larger datasets and additional models, please reach out to our team at support@predibase.com.