This directory contains examples for working with vision-language models that can process both images and text.
Demonstrates using vision models through Ollama backend with chat interface.
Key Features:
- Loading and processing images
- Using vision models for image understanding
- Chat-based interaction with images
Shows how to use OpenAI-compatible vision models (including local VLLM servers).
Examples using LiteLLM backend for vision model access.
Sample image used in the examples for testing vision capabilities.
- Multimodal Input: Combining text and images in prompts
- Vision Understanding: Asking questions about image content
- Backend Flexibility: Using different backends (Ollama, OpenAI, LiteLLM) for vision
- Image Processing: Loading and formatting images for LLM consumption
from mellea import start_session
from mellea.stdlib.components import Message
# Load image
with open("pointing_up.jpg", "rb") as f:
image_data = f.read()
# Create session with vision model
m = start_session(model_id="llava:7b")
# Ask about the image
response = m.chat(
Message(
role="user",
content="What do you see in this image?",
images=[image_data]
)
)- Ollama: granite3.2-vision, llava, bakllava, llava-phi3, moondream, qwen2.5vl:7b
- OpenAI: gpt-4-vision-preview, gpt-4o
- LiteLLM: Various vision models through unified interface
Pull a vision-capable model before running these examples:
ollama pull granite3.2-vision # ~2.4 GB — primary recommended model
ollama pull qwen2.5vl:7b # ~4.7 GB — used in vision_openai_examples.py- See
test/backends/test_vision_*.pyfor more examples - See
mellea/stdlib/components/chat.pyfor Message API