Demonstrates multimodal image understanding using a local vision model. The program loads a GGUF language model and its companion multimodal projector (mmproj), reads an image file from disk, attaches it to the conversation, and asks the model to describe what it sees. Everything runs offline.
You need two GGUF files: the language model and its vision projector (mmproj).
Gemma 3 (recommended) — ggml-org/gemma-3
| Model | Size | Links |
|---|---|---|
| Gemma 3 4B | ~2.4 GB (Q4_K_S) | model + mmproj |
| Gemma 3 12B | ~7.1 GB (Q4_K_M) | model + mmproj |
Other vision models — ggml-org/multimodal GGUFs
Any model supported by llama.cpp's mtmd library works: LLaVA, MiniCPM-V, Qwen-VL, InternVL, Pixtral, etc.
make
# Requires a vision-capable GGUF model and its mmproj file
./local-vision model.gguf mmproj.gguf photo.jpgIf no image path is given, it defaults to image.jpg in the current directory.
- Configuring a multimodal local model with
adam_settings_set_localandadam_settings_set_mmproj - Loading and attaching image data with
adam_history_attach - Auto-detecting image media type from file extension
- Reading binary files and passing raw bytes to the library