Skip to content

[mtmd] Add PaliGemma2 support (SigLIP + Gemma2 backbone)#22528

Open
shichiachi3-cyber wants to merge 5 commits intoggml-org:masterfrom
shichiachi3-cyber:feat/paligemma2-support
Open

[mtmd] Add PaliGemma2 support (SigLIP + Gemma2 backbone)#22528
shichiachi3-cyber wants to merge 5 commits intoggml-org:masterfrom
shichiachi3-cyber:feat/paligemma2-support

Conversation

@shichiachi3-cyber
Copy link
Copy Markdown

Overview

Adds native support for https://huggingface.co/collections/google/paligemma2-release-67500e1e1dbfdd4dee27ba48
(SigLIP vision encoder + Gemma2 backbone) to llama.cpp / libmtmd.

PaliGemma2 is Google's vision-language model family (3B/10B/28B, 224px/448px) designed for OCR, VQA, image
captioning, object detection, and document understanding.

Architecture

PaliGemma2 is simpler than Gemma3 vision:

  • Vision encoder: SigLIP (already in llama.cpp via siglip.cpp)
  • Projector: single nn.Linear(in=1152, out=hidden_size) with bias — no pooling, no RMS norm
  • LM backbone: Gemma2 (already in llama.cpp)
  • Attention: image prefix uses bidirectional attention (same path as Gemma3, via
    llama_set_causal_attn(false))

Supported variants:

┌────────────────────────────────────────────────┬────────────┬──────────────┐
│ Variant │ Resolution │ Image tokens │
├────────────────────────────────────────────────┼────────────┼──────────────┤
│ paligemma2-3b-pt-224, paligemma2-3b-mix-224, … │ 224px │ 256 (16×16) │
├────────────────────────────────────────────────┼────────────┼──────────────┤
│ paligemma2-3b-pt-448, paligemma2-10b-pt-448, … │ 448px │ 1024 (32×32) │
└────────────────────────────────────────────────┴────────────┴──────────────┘

Changes

GGUF conversion (convert_hf_to_gguf.py)

  • Add PaliGemma2VisionModel (MmprojModel) for PaliGemmaForConditionalGeneration
  • Extend Gemma2Model.modify_tensors to strip language_model.* prefix from PaliGemma2 checkpoints
  • Extend Gemma2Model.set_vocab to fall back to _set_vocab_gpt2 when tokenizer.model is absent

GGUF types (gguf-py/)

  • Add PALIGEMMA2 = "paligemma2" to VisionProjectorType
  • Add multi_modal_projector.linear to V_MM_INP_PROJ tensor mapping

libmtmd (tools/mtmd/)

  • clip-impl.h: PROJECTOR_TYPE_PALIGEMMA2 enum + TN_MM_INP_PROJ_B constant
  • clip-model.h: mm_input_proj_b field (projector has bias)
  • clip.cpp: load_hparams (n_merge=1, bilinear), n_patches (no pooling), n_mmproj_embd, set_input, weight
    loading
  • siglip.cpp: PROJECTOR_TYPE_PALIGEMMA2 projector branch (direct linear + bias, no pool/norm)
  • mtmd.cpp: init_vision (fixed-size preprocessor), mtmd_decode_use_non_causal (bidirectional)
  • mtmd-cli.cpp: warn instead of exit(1) when no chat template; use raw prompt mode for PT/mix models

Usage

Step 1: Convert (requires tokenizer.model from gemma-2-2b if not present)

python convert_hf_to_gguf.py google/paligemma2-3b-mix-224 \
--mmproj --outfile pali2-mmproj.gguf
python convert_hf_to_gguf.py google/paligemma2-3b-mix-224 \
--outtype bf16 --outfile pali2-text.gguf

Step 2: Run inference (prefix completion format, no chat template)

./llama-mtmd-cli -m pali2-text.gguf --mmproj pali2-mmproj.gguf
--image photo.jpg -p $'\nanswer en What is in this image? '

Caption

./llama-mtmd-cli -m pali2-text.gguf --mmproj pali2-mmproj.gguf
--image photo.jpg -p $'\ncaption en '

Notes

  • PaliGemma2 uses prefix completion format (not chat/instruction format). The --chat-template flag is not
    required; the CLI now warns instead of exiting when no template is found.
  • tokenizer.model is not included in PaliGemma2's HF repo; use one from google/gemma-2-2b (same vocabulary).
  • This PR closes Add PaliGemma Support #7553 which attempted PaliGemma support using the pre-libmtmd architecture.

Testing

Verified with paligemma2-3b-pt-224 and paligemma2-3b-mix-224 on CPU (GCP e2, no GPU):

  • GGUF conversion: mmproj 833MB, text 5.2GB ✓
  • SigLIP shape: [1, 256, 1152], matches HuggingFace transformers output ✓
  • Projector shape: [256, 2304], numerically verified against HF weights ✓
  • 448px: GGUF position_embd.weight [1152, 1024] (1024 tokens) ✓
  • paligemma2-3b-mix-224 correctly identifies colors: red→"red", blue→"blue, 0" ✓
  • Full pipeline: EXIT 0, image encoded in ~8s (CPU), 256 tokens decoded ✓

cc @ngxson (SigLIP/Gemma3 author)

shichiachi3-cyber and others added 5 commits April 29, 2026 22:36
Adds PaliGemma2 (SigLIP + Gemma2 backbone) to llama.cpp / libmtmd:

- gguf-py: add PALIGEMMA2 to VisionProjectorType and tensor mapping
- convert_hf_to_gguf.py: add PaliGemma2VisionModel (mmproj) and
  PaliGemma2TextModel (Gemma2 with language_model.* prefix strip)
- clip-impl.h: add PROJECTOR_TYPE_PALIGEMMA2 enum + TN_MM_INP_PROJ_B
- clip-model.h: add mm_input_proj_b for linear projector bias
- clip.cpp: weight loading, n_patches (no pooling), n_mmproj_embd
- siglip.cpp: add PALIGEMMA2 projector branch (direct linear, no pool)
- mtmd.cpp: enable non-causal attention for image prefix tokens

PaliGemma2 uses a simpler projector than Gemma3 (linear only, no
pool2d or soft_emb_norm). Image prefix tokens require bidirectional
attention, handled via the existing llama_set_causal_attn(false) path
already used by Gemma3 (PR ggml-org#12615).

Supports paligemma2-3b-pt-224 (256 tokens) and paligemma2-3b-pt-448
(1024 tokens) out of the box.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
Extend convert_hf_to_gguf.py for PaliGemmaForConditionalGeneration:

- Gemma2Model.modify_tensors: strip language_model.* prefix and skip
  non-LM tensors (vision_tower.*, multi_modal_projector.*) so the model
  can be converted from PaliGemmaForConditionalGeneration checkpoints;
  safe for standalone Gemma2ForCausalLM models (no-op when prefix absent)
- Gemma2Model.set_vocab: fall back to _set_vocab_gpt2 when tokenizer.model
  is absent, matching Gemma3Model behaviour (PaliGemma2 ships tokenizer.json)
- PaliGemma2VisionModel: MmprojModel for the SigLIP + linear projector
- PaliGemma2TextModel: registered under PaliGemmaForConditionalGeneration
  with explicit GEMMA2 arch for documentation; actual conversion is routed
  through Gemma2Model via the text_config.architectures re-dispatch path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
Four fixes found during end-to-end testing:

1. clip.cpp load_hparams: add PROJECTOR_TYPE_PALIGEMMA2 case
   (n_merge=1, bilinear resize) — was falling into default throw

2. mtmd.cpp init_vision: add PROJECTOR_TYPE_PALIGEMMA2 case
   using mtmd_image_preprocessor_fixed_size, no img_beg/img_end
   wrapper tokens (PaliGemma2 injects image as raw prefix tokens)

3. siglip.cpp projector: remove incorrect ggml_transpose on
   mm_input_proj_w; weight stored as [in=1152, out=2304] in ggml
   format — ggml_mul_mat(w, cur) already computes w^T*cur correctly

4. clip.cpp set_input: add PROJECTOR_TYPE_PALIGEMMA2 to the
   "do nothing" case (standard SigLIP, no extra position inputs)

After these fixes, end-to-end inference runs successfully:
  - SigLIP encodes 224px image → 256 tokens ✓
  - Projector maps 1152d → 2304d (Gemma2 hidden size) ✓
  - Gemma2 LM receives and processes image + text tokens ✓
  - EXIT 0, no assertions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
…2 PT)

mtmd-cli previously called exit(1) when a model has no chat template,
which blocked PaliGemma2 PT (and similar pre-trained VLMs) from running.

Add raw_prompt mode: when no chat template is available and no
--chat-template override is given, warn instead of exit and bypass
common_chat_format_single, passing the user content directly to
mtmd_tokenize. This matches the pre-training format of PaliGemma2 PT:

  [256 image tokens] \n [task_prefix] [completion...]

Example (caption):
  llama-mtmd-cli -m paligemma2-text.gguf --mmproj paligemma2-mmproj.gguf \
    --image photo.jpg -p $'\ncaption en '

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
…ROJ_B

- convert_hf_to_gguf.py: remove PaliGemma2TextModel (never instantiated;
  convert_hf main() re-dispatches PaliGemmaForConditionalGeneration to
  Gemma2Model via text_config.architectures = ["Gemma2ForCausalLM"])
- clip-impl.h: simplify TN_MM_INP_PROJ_B to a plain string constant
- clip.cpp: use TN_MM_INP_PROJ_B directly without string_format()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Max Shih (Recursia Lab) <shichiachi3@gmail.com>
@shichiachi3-cyber shichiachi3-cyber requested review from a team and CISC as code owners April 29, 2026 17:04
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 29, 2026

Hi @shichiachi3-cyber, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@shichiachi3-cyber
Copy link
Copy Markdown
Author

Thanks for the heads up from the bot. To clarify explicitly:

This PR was developed with AI coding assistance (Claude Sonnet 4.6 / Anthropic). The commit messages include
Co-Authored-By: Claude Sonnet 4.6 for transparency.

My role as the human author:

  • Architectural decisions and design choices (following Gemma3 vision as template, choosing fixed_size
    preprocessor, bidirectional attention approach)
  • Debugging and root-cause analysis of all 8 runtime issues encountered during testing (incorrect transpose
    in siglip.cpp, missing switch cases in load_hparams/init_vision/set_input, etc.)
  • End-to-end testing and validation (GGUF conversion, SigLIP shape verification against HuggingFace
    transformers, 224px/448px token count verification)
  • All commits are DCO-signed by me (Signed-off-by: Max Shih (Recursia Lab))

The AI served as a coding pair-programmer — implementing changes I directed, not making autonomous decisions.
All code was reviewed before commit.

Hope this clarifies things. Happy to answer any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant