Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

Local Vision

Demonstrates multimodal image understanding using a local vision model. The program loads a GGUF language model and its companion multimodal projector (mmproj), reads an image file from disk, attaches it to the conversation, and asks the model to describe what it sees. Everything runs offline.

Models

You need two GGUF files: the language model and its vision projector (mmproj).

Gemma 3 (recommended) — ggml-org/gemma-3

Model Size Links
Gemma 3 4B ~2.4 GB (Q4_K_S) model + mmproj
Gemma 3 12B ~7.1 GB (Q4_K_M) model + mmproj

Other vision modelsggml-org/multimodal GGUFs

Any model supported by llama.cpp's mtmd library works: LLaVA, MiniCPM-V, Qwen-VL, InternVL, Pixtral, etc.

Build & Run

make

# Requires a vision-capable GGUF model and its mmproj file
./local-vision model.gguf mmproj.gguf photo.jpg

If no image path is given, it defaults to image.jpg in the current directory.

What It Demonstrates

  • Configuring a multimodal local model with adam_settings_set_local and adam_settings_set_mmproj
  • Loading and attaching image data with adam_history_attach
  • Auto-detecting image media type from file extension
  • Reading binary files and passing raw bytes to the library