[OpenVINO] Support GLM-Edge-V models by openvino-agent · Pull Request #1791 · huggingface/optimum-intel

openvino-agent · 2026-06-15T07:46:05Z

What does this PR do?

Set up environment:

pip install transformers==4.48.0
pip install optimum-intel@

Export the model:

optimum-cli export openvino -m zai-org/glm-edge-v-2b ./glm-edge-v-2b --task=image-text-to-text --trust-remote-code

Inference the model:

from transformers import AutoProcessor
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM, AutoImageProcessor, AutoTokenizer
from optimum.intel.openvino import OVModelForVisualCausalLM

#model_id = "zai-org/glm-edge-v-2b"
model_id = "./glm-edge-v-2b"
url = "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"
image = Image.open(requests.get(url, stream=True).raw)

messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "describe this image"}]}]

processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
f32_config = {"INFERENCE_PRECISION_HINT": "f32",
              "KV_CACHE_PRECISION": "f32", "DYNAMIC_QUANTIZATION_GROUP_SIZE": 0
              }
model = OVModelForVisualCausalLM.from_pretrained(model_id, trust_remote_code=True, ov_config=f32_config)

inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_dict=True, tokenize=True, return_tensors="pt"
)

generate_kwargs = {
    **inputs,
    "pixel_values": torch.tensor(processor(image).pixel_values),
}
output = model.generate(**generate_kwargs, max_new_tokens=100)
print(tokenizer.decode(output[0][len(inputs["input_ids"][0]):], skip_special_tokens=True))

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

GLM-Edge-V is a vision-language model whose config reports model_type="glm" (same as the text-only GLM decoder) but carries a nested `vision_config` (SigLIP encoder + conv/GLU adapter with learned boi/eoi parameters). It merges 578 image embeddings into the text stream by replacing `boi_token_id` (59256) placeholders, Gemma3-style. Implementation: - model_patcher.py: GlmEdgeVImageEmbeddingsModelPatcher (vision branch as a standalone graph) and GlmEdgeVLMModelPatcher (bypass the baked-in vision merge, run the decoder stack on inputs_embeds with a stateful KV cache). - model_configs.py: GLMEdgeVOpenVINOConfig, registered for `glm` + the image-text-to-text task only (text-only GLM export path is untouched). The text-embeddings/language submodels reuse the plain GLM text-generation config. - modeling_visual_language.py: _OVGlmEdgeVForCausalLM runtime class (vision embeddings, masked_scatter merge on boi_token_id, chat-template preprocessing) and dispatch entry for model_type "glm". - utils.py: is_multi_modal_text_generation_model() gates multimodal export routing on (model_type, vision_config) so "glm" only routes multimodal when a vision tower is present; wired into convert.py and __main__.py. Docs: add GLM-Edge-V to the supported models list. Tests: add glm_edge_v fixture + arch to the visual-causal-LM integration tests (verified token-for-token against transformers on a tiny-random model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The default export loads the bf16 checkpoint and traces it through the 16-bit path (`__make_16bit_traceable`: ModuleExtension wrapping + bf16 boi/eoi parameters). This corrupted the numerically-fragile SigLIP vision tower (image-feature max-abs diff vs transformers ~1.37), producing OpenVINO outputs that diverged from transformers despite the language model being fine. transformers runs bf16/fp16 fine because PyTorch upcasts bf16 matmuls/norms to fp32 internally; OpenVINO's 16-bit trace does not, so the vision features drift. Force fp32 export for GLM-Edge-V (model_type="glm" + vision_config). The vision tower is then exact (diff 4e-5) and an fp32 export matches transformers token-for-token; the language model is still int8-compressed by the default size-based weight compression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-06-15T16:07:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rkazants and others added 3 commits June 14, 2026 19:48

Add recommendation to add trust_remote_code

d60e438

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenVINO] Support GLM-Edge-V models#1791

[OpenVINO] Support GLM-Edge-V models#1791
openvino-agent wants to merge 3 commits into
huggingface:mainfrom
openvino-agent:worktree-support-glm-edge-v

openvino-agent commented Jun 15, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

openvino-agent commented Jun 15, 2026

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants