Skip to content

[OpenVINO] Support GLM-Edge-V models#1791

Open
openvino-agent wants to merge 3 commits into
huggingface:mainfrom
openvino-agent:worktree-support-glm-edge-v
Open

[OpenVINO] Support GLM-Edge-V models#1791
openvino-agent wants to merge 3 commits into
huggingface:mainfrom
openvino-agent:worktree-support-glm-edge-v

Conversation

@openvino-agent

Copy link
Copy Markdown
Contributor

What does this PR do?

Set up environment:

pip install transformers==4.48.0
pip install optimum-intel@

Export the model:

optimum-cli export openvino -m zai-org/glm-edge-v-2b ./glm-edge-v-2b --task=image-text-to-text --trust-remote-code

Inference the model:

from transformers import AutoProcessor
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM, AutoImageProcessor, AutoTokenizer
from optimum.intel.openvino import OVModelForVisualCausalLM

#model_id = "zai-org/glm-edge-v-2b"
model_id = "./glm-edge-v-2b"
url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
image = Image.open(requests.get(url, stream=True).raw)

messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "describe this image"}]}]

processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
f32_config = {"INFERENCE_PRECISION_HINT": "f32",
              "KV_CACHE_PRECISION": "f32", "DYNAMIC_QUANTIZATION_GROUP_SIZE": 0
              }
model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True, ov_config=f32_config)

inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_dict=True, tokenize=True, return_tensors="pt"
)

generate_kwargs = {
    **inputs,
    "pixel_values": torch.tensor(processor(image).pixel_values),
}
output = model.generate(**generate_kwargs, max_new_tokens=100)
print(tokenizer.decode(output[0][len(inputs["input_ids"][0]):], skip_special_tokens=True))

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

rkazants and others added 3 commits June 14, 2026 19:48
GLM-Edge-V is a vision-language model whose config reports model_type="glm"
(same as the text-only GLM decoder) but carries a nested `vision_config`
(SigLIP encoder + conv/GLU adapter with learned boi/eoi parameters). It merges
578 image embeddings into the text stream by replacing `boi_token_id` (59256)
placeholders, Gemma3-style.

Implementation:
- model_patcher.py: GlmEdgeVImageEmbeddingsModelPatcher (vision branch as a
  standalone graph) and GlmEdgeVLMModelPatcher (bypass the baked-in vision merge,
  run the decoder stack on inputs_embeds with a stateful KV cache).
- model_configs.py: GLMEdgeVOpenVINOConfig, registered for `glm` + the
  image-text-to-text task only (text-only GLM export path is untouched). The
  text-embeddings/language submodels reuse the plain GLM text-generation config.
- modeling_visual_language.py: _OVGlmEdgeVForCausalLM runtime class (vision
  embeddings, masked_scatter merge on boi_token_id, chat-template preprocessing)
  and dispatch entry for model_type "glm".
- utils.py: is_multi_modal_text_generation_model() gates multimodal export
  routing on (model_type, vision_config) so "glm" only routes multimodal when a
  vision tower is present; wired into convert.py and __main__.py.

Docs: add GLM-Edge-V to the supported models list.
Tests: add glm_edge_v fixture + arch to the visual-causal-LM integration tests
(verified token-for-token against transformers on a tiny-random model).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The default export loads the bf16 checkpoint and traces it through the 16-bit
path (`__make_16bit_traceable`: ModuleExtension wrapping + bf16 boi/eoi
parameters). This corrupted the numerically-fragile SigLIP vision tower
(image-feature max-abs diff vs transformers ~1.37), producing OpenVINO outputs
that diverged from transformers despite the language model being fine.

transformers runs bf16/fp16 fine because PyTorch upcasts bf16 matmuls/norms to
fp32 internally; OpenVINO's 16-bit trace does not, so the vision features drift.

Force fp32 export for GLM-Edge-V (model_type="glm" + vision_config). The vision
tower is then exact (diff 4e-5) and an fp32 export matches transformers
token-for-token; the language model is still int8-compressed by the default
size-based weight compression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants