[mtmd] Add PaliGemma2 support (SigLIP + Gemma2 backbone)#22528
[mtmd] Add PaliGemma2 support (SigLIP + Gemma2 backbone)#22528shichiachi3-cyber wants to merge 5 commits intoggml-org:masterfrom
Conversation
Adds PaliGemma2 (SigLIP + Gemma2 backbone) to llama.cpp / libmtmd: - gguf-py: add PALIGEMMA2 to VisionProjectorType and tensor mapping - convert_hf_to_gguf.py: add PaliGemma2VisionModel (mmproj) and PaliGemma2TextModel (Gemma2 with language_model.* prefix strip) - clip-impl.h: add PROJECTOR_TYPE_PALIGEMMA2 enum + TN_MM_INP_PROJ_B - clip-model.h: add mm_input_proj_b for linear projector bias - clip.cpp: weight loading, n_patches (no pooling), n_mmproj_embd - siglip.cpp: add PALIGEMMA2 projector branch (direct linear, no pool) - mtmd.cpp: enable non-causal attention for image prefix tokens PaliGemma2 uses a simpler projector than Gemma3 (linear only, no pool2d or soft_emb_norm). Image prefix tokens require bidirectional attention, handled via the existing llama_set_causal_attn(false) path already used by Gemma3 (PR ggml-org#12615). Supports paligemma2-3b-pt-224 (256 tokens) and paligemma2-3b-pt-448 (1024 tokens) out of the box. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
Extend convert_hf_to_gguf.py for PaliGemmaForConditionalGeneration: - Gemma2Model.modify_tensors: strip language_model.* prefix and skip non-LM tensors (vision_tower.*, multi_modal_projector.*) so the model can be converted from PaliGemmaForConditionalGeneration checkpoints; safe for standalone Gemma2ForCausalLM models (no-op when prefix absent) - Gemma2Model.set_vocab: fall back to _set_vocab_gpt2 when tokenizer.model is absent, matching Gemma3Model behaviour (PaliGemma2 ships tokenizer.json) - PaliGemma2VisionModel: MmprojModel for the SigLIP + linear projector - PaliGemma2TextModel: registered under PaliGemmaForConditionalGeneration with explicit GEMMA2 arch for documentation; actual conversion is routed through Gemma2Model via the text_config.architectures re-dispatch path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
Four fixes found during end-to-end testing: 1. clip.cpp load_hparams: add PROJECTOR_TYPE_PALIGEMMA2 case (n_merge=1, bilinear resize) — was falling into default throw 2. mtmd.cpp init_vision: add PROJECTOR_TYPE_PALIGEMMA2 case using mtmd_image_preprocessor_fixed_size, no img_beg/img_end wrapper tokens (PaliGemma2 injects image as raw prefix tokens) 3. siglip.cpp projector: remove incorrect ggml_transpose on mm_input_proj_w; weight stored as [in=1152, out=2304] in ggml format — ggml_mul_mat(w, cur) already computes w^T*cur correctly 4. clip.cpp set_input: add PROJECTOR_TYPE_PALIGEMMA2 to the "do nothing" case (standard SigLIP, no extra position inputs) After these fixes, end-to-end inference runs successfully: - SigLIP encodes 224px image → 256 tokens ✓ - Projector maps 1152d → 2304d (Gemma2 hidden size) ✓ - Gemma2 LM receives and processes image + text tokens ✓ - EXIT 0, no assertions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
…2 PT)
mtmd-cli previously called exit(1) when a model has no chat template,
which blocked PaliGemma2 PT (and similar pre-trained VLMs) from running.
Add raw_prompt mode: when no chat template is available and no
--chat-template override is given, warn instead of exit and bypass
common_chat_format_single, passing the user content directly to
mtmd_tokenize. This matches the pre-training format of PaliGemma2 PT:
[256 image tokens] \n [task_prefix] [completion...]
Example (caption):
llama-mtmd-cli -m paligemma2-text.gguf --mmproj paligemma2-mmproj.gguf \
--image photo.jpg -p $'\ncaption en '
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
…ROJ_B - convert_hf_to_gguf.py: remove PaliGemma2TextModel (never instantiated; convert_hf main() re-dispatches PaliGemmaForConditionalGeneration to Gemma2Model via text_config.architectures = ["Gemma2ForCausalLM"]) - clip-impl.h: simplify TN_MM_INP_PROJ_B to a plain string constant - clip.cpp: use TN_MM_INP_PROJ_B directly without string_format() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
|
Hi @shichiachi3-cyber, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Thanks for the heads up from the bot. To clarify explicitly: This PR was developed with AI coding assistance (Claude Sonnet 4.6 / Anthropic). The commit messages include My role as the human author:
The AI served as a coding pair-programmer — implementing changes I directed, not making autonomous decisions. Hope this clarifies things. Happy to answer any questions. |
Overview
Adds native support for https://huggingface.co/collections/google/paligemma2-release-67500e1e1dbfdd4dee27ba48
(SigLIP vision encoder + Gemma2 backbone) to llama.cpp / libmtmd.
PaliGemma2 is Google's vision-language model family (3B/10B/28B, 224px/448px) designed for OCR, VQA, image
captioning, object detection, and document understanding.
Architecture
PaliGemma2 is simpler than Gemma3 vision:
llama_set_causal_attn(false))
Supported variants:
┌────────────────────────────────────────────────┬────────────┬──────────────┐
│ Variant │ Resolution │ Image tokens │
├────────────────────────────────────────────────┼────────────┼──────────────┤
│ paligemma2-3b-pt-224, paligemma2-3b-mix-224, … │ 224px │ 256 (16×16) │
├────────────────────────────────────────────────┼────────────┼──────────────┤
│ paligemma2-3b-pt-448, paligemma2-10b-pt-448, … │ 448px │ 1024 (32×32) │
└────────────────────────────────────────────────┴────────────┴──────────────┘
Changes
GGUF conversion (convert_hf_to_gguf.py)
GGUF types (gguf-py/)
libmtmd (tools/mtmd/)
loading
Usage
Step 1: Convert (requires tokenizer.model from gemma-2-2b if not present)
python convert_hf_to_gguf.py google/paligemma2-3b-mix-224 \
--mmproj --outfile pali2-mmproj.gguf
python convert_hf_to_gguf.py google/paligemma2-3b-mix-224 \
--outtype bf16 --outfile pali2-text.gguf
Step 2: Run inference (prefix completion format, no chat template)
./llama-mtmd-cli -m pali2-text.gguf --mmproj pali2-mmproj.gguf
--image photo.jpg -p $'\nanswer en What is in this image? '
Caption
./llama-mtmd-cli -m pali2-text.gguf --mmproj pali2-mmproj.gguf
--image photo.jpg -p $'\ncaption en '
Notes
required; the CLI now warns instead of exiting when no template is found.
Testing
Verified with paligemma2-3b-pt-224 and paligemma2-3b-mix-224 on CPU (GCP e2, no GPU):
cc @ngxson (SigLIP/Gemma3 author)