Skip to content

[Feat] Support GLM-4.7 MTP in vLLM-ATOM plugin#722

Open
kliuae wants to merge 19 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_enable_glm4_mtp_merge
Open

[Feat] Support GLM-4.7 MTP in vLLM-ATOM plugin#722
kliuae wants to merge 19 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_enable_glm4_mtp_merge

Conversation

@kliuae
Copy link
Copy Markdown
Contributor

@kliuae kliuae commented May 8, 2026

Motivation

This PR builds on top of the MTP framework in #557, adds MTP support to GLM-4.7 model for vLLM-ATOM.
Currently this PR contains changes from #557, and will be more concise once it gets upstreamed.

Technical Details

  • Register Glm4MoeMTPModel
  • Add glm4_moe_mtp modeling
  • Fix RoPE double apply in mha when ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=0

Test Plan

Accuracy test with lm_eval

Model: zai-org/GLM-4.7-FP8

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 \
  vllm serve zai-org/GLM-4.7-FP8 \
  -tp 8 \
  --max-num-seqs 1024 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --load-format fastsafetensors \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype fp8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1

lm_eval command

lm_eval --model local-completions   --model_args model=zai-org/GLM-4.7-FP8,base_url=http://localhost:8000/v1/completions,num_concurrent=64,tokenized_requests=False  --tasks gsm8k --num_fewshot 5

Test Result

gsm8k

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match _ 0.9439 _ 0.0063
strict-match 5 exact_match _ 0.9439 _ 0.0063

Submission Checklist

whx-sjtu and others added 14 commits April 23, 2026 10:49
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@wuhuikx wuhuikx requested review from ganyi1996ppo and whx-sjtu May 10, 2026 03:18
kliuae and others added 3 commits May 11, 2026 07:40
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae-amd <kuanfliu@amd.com>
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

Hi, @kliuae

Could you help resolve the conflicts here? Meanwhile the GLM 4.7 MTP should be added into the atom-vllm nightly and benchmark workflow

Thank you

kliuae added 2 commits May 14, 2026 06:10
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

logger.info(f"Construct ATOM model {model_arch} for vLLM plugin mode")
self.model = model_cls(self.atom_config)
self._adapt_mtp_layers_for_vllm()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to skip this if glm4 mtp layer doesn't need to mask input_embeding according to positions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Like deepseek mtp, glm4 mtp masks inputs at 0 so it can use this

self.model = model_cls(self.atom_config)
self._adapt_mtp_layers_for_vllm()
# Mirror nested attributes required by vLLM speculative decoding.
self._expose_spec_decode_attrs()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also need to skip this if glm4 mtp layers doesn't need to share llm_head weights with main model.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lm_head is not shared between mtp and the main model in glm4, but as its inner predictor doesn't carry lm_head, the syncing of lm_head won't be triggered. I think the current logic here doesn't affect glm4 mtp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants