Name	Name	Last commit message	Last commit date
parent directory ..
Makefile	Makefile
README.md	README.md
main.c	main.c

Name

Last commit message

Last commit date

Local Vision

Demonstrates multimodal image understanding using a local vision model. The program loads a GGUF language model and its companion multimodal projector (mmproj), reads an image file from disk, attaches it to the conversation, and asks the model to describe what it sees. Everything runs offline.

Models

You need two GGUF files: the language model and its vision projector (mmproj).

Gemma 3 (recommended) — ggml-org/gemma-3

Model	Size	Links
Gemma 3 4B	~2.4 GB (Q4_K_S)	model + mmproj
Gemma 3 12B	~7.1 GB (Q4_K_M)	model + mmproj

Other vision models — ggml-org/multimodal GGUFs

Any model supported by llama.cpp's mtmd library works: LLaVA, MiniCPM-V, Qwen-VL, InternVL, Pixtral, etc.

Build & Run

make

# Requires a vision-capable GGUF model and its mmproj file
./local-vision model.gguf mmproj.gguf photo.jpg

If no image path is given, it defaults to image.jpg in the current directory.

What It Demonstrates

Configuring a multimodal local model with adam_settings_set_local and adam_settings_set_mmproj
Loading and attaching image data with adam_history_attach
Auto-detecting image media type from file extension
Reading binary files and passing raw bytes to the library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Local Vision

Models

Build & Run

What It Demonstrates

FilesExpand file tree

local-vision

Directory actions

More options

Directory actions

More options

Latest commit

History

local-vision

Folders and files

parent directory

README.md

Local Vision

Models

Build & Run

What It Demonstrates