[OpenVINO] Support GLM-Edge-V models#1791
Open
openvino-agent wants to merge 3 commits into
Open
Conversation
GLM-Edge-V is a vision-language model whose config reports model_type="glm" (same as the text-only GLM decoder) but carries a nested `vision_config` (SigLIP encoder + conv/GLU adapter with learned boi/eoi parameters). It merges 578 image embeddings into the text stream by replacing `boi_token_id` (59256) placeholders, Gemma3-style. Implementation: - model_patcher.py: GlmEdgeVImageEmbeddingsModelPatcher (vision branch as a standalone graph) and GlmEdgeVLMModelPatcher (bypass the baked-in vision merge, run the decoder stack on inputs_embeds with a stateful KV cache). - model_configs.py: GLMEdgeVOpenVINOConfig, registered for `glm` + the image-text-to-text task only (text-only GLM export path is untouched). The text-embeddings/language submodels reuse the plain GLM text-generation config. - modeling_visual_language.py: _OVGlmEdgeVForCausalLM runtime class (vision embeddings, masked_scatter merge on boi_token_id, chat-template preprocessing) and dispatch entry for model_type "glm". - utils.py: is_multi_modal_text_generation_model() gates multimodal export routing on (model_type, vision_config) so "glm" only routes multimodal when a vision tower is present; wired into convert.py and __main__.py. Docs: add GLM-Edge-V to the supported models list. Tests: add glm_edge_v fixture + arch to the visual-causal-LM integration tests (verified token-for-token against transformers on a tiny-random model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The default export loads the bf16 checkpoint and traces it through the 16-bit path (`__make_16bit_traceable`: ModuleExtension wrapping + bf16 boi/eoi parameters). This corrupted the numerically-fragile SigLIP vision tower (image-feature max-abs diff vs transformers ~1.37), producing OpenVINO outputs that diverged from transformers despite the language model being fine. transformers runs bf16/fp16 fine because PyTorch upcasts bf16 matmuls/norms to fp32 internally; OpenVINO's 16-bit trace does not, so the vision features drift. Force fp32 export for GLM-Edge-V (model_type="glm" + vision_config). The vision tower is then exact (diff 4e-5) and an fp32 export matches transformers token-for-token; the language model is still int8-compressed by the default size-based weight compression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Set up environment:
Export the model:
optimum-cli export openvino -m zai-org/glm-edge-v-2b ./glm-edge-v-2b --task=image-text-to-text --trust-remote-codeInference the model:
Fixes # (issue)
Before submitting