Skip to content

Add OpenVINO export support for glm4_moe_lite (GLM-4.7-Flash)#1699

Draft
openvino-agent wants to merge 1 commit intohuggingface:mainfrom
openvino-agent:support/glm4_moe_lite
Draft

Add OpenVINO export support for glm4_moe_lite (GLM-4.7-Flash)#1699
openvino-agent wants to merge 1 commit intohuggingface:mainfrom
openvino-agent:support/glm4_moe_lite

Conversation

@openvino-agent
Copy link
Copy Markdown

What does this PR do?

Adds native OpenVINO IR export support for the glm4_moe_lite model type, which powers the GLM-4.7-Flash family of models (e.g. THUDM/GLM-4.7-Flash).

Architecture overview

Glm4MoeLiteForCausalLM is a decoder-only transformer (available in Transformers ≥ 5.0) that combines:

  • Multi-head Latent Attention (MLA) — same LoRA-compressed KV projection used in MiniCPM3/DeepSeek, with separate qk_nope_head_dim/qk_rope_head_dim key dimensions and a smaller v_head_dim
  • Hybrid MLP layers — alternating dense (Glm4MoeLiteMLP) and sparse Mixture-of-Experts (Glm4MoeLiteMoE / Glm4MoeLiteNaiveMoe) layers

Changes

optimum/exporters/openvino/model_configs.py

  • Import Glm4MoeLitePatcher
  • Add Glm4MoeLiteOpenVINOConfig registered under glm4_moe_lite for text-generation and text-generation-with-past tasks. Inherits from MiniCPM3OpenVINOConfig to reuse OVMiniCPM3DummyPastKeyValuesGenerator (which correctly handles MLA-style KV cache shapes: k_head_dim = qk_nope_head_dim + qk_rope_head_dim, v_head_dim).

optimum/exporters/openvino/model_patcher.py

  • Add glm4_moe_lite_naive_moe_forward — a fully vectorized replacement for Glm4MoeLiteNaiveMoe.forward. The original implementation uses a nonzero()-driven Python for loop over active experts, which is incompatible with torch.jit.trace (loop count varies per input). The patch replaces it with batched matrix multiplications (torch.bmm) over all experts simultaneously, producing a consistent graph regardless of routing decisions.
  • Add Glm4MoeLitePatcher — patches the experts sub-module inside every sparse MoE layer during export and restores the original on exit.

Tests

  • tests/openvino/utils_tests.py — model ID entry for glm4_moe_lite and expected INT8 node count (42)
  • tests/openvino/test_decoder.py — add glm4_moe_lite to SUPPORTED_ARCHITECTURES (Transformers ≥ 5.0) and EXPECTED_NUM_SDPA
  • tests/openvino/test_export.py — add glm4_moe_lite export test (Transformers ≥ 5.0)
  • tests/openvino/test_exporters_cli.py — add CLI export test for text-generation-with-past
  • tests/openvino/test_quantization.py — add to SUPPORTED_ARCHITECTURES_WITH_AUTO_COMPRESSION

Docs

  • docs/source/openvino/models.mdx — list Glm4MoeLite (GLM-4.7-Flash) as a supported model

Before submitting

  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

- Add Glm4MoeLiteOpenVINOConfig in model_configs.py that inherits from
  MiniCPM3OpenVINOConfig to reuse the MLA-style PKV dummy generator
  (key head dim = qk_nope_head_dim + qk_rope_head_dim, value head dim = v_head_dim)
- Add Glm4MoeLitePatcher in model_patcher.py with a vectorized replacement
  for Glm4MoeLiteNaiveMoe.forward to avoid dynamic control flow (nonzero +
  loop over experts) that breaks torch.jit.trace
- Add glm4_moe_lite test entries to test_decoder.py, test_export.py,
  test_exporters_cli.py, test_quantization.py, and utils_tests.py
- Update docs/source/openvino/models.mdx to list Glm4_moe_lite support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants