[Feat] Support GLM-4.7 MTP in vLLM-ATOM plugin by kliuae · Pull Request #722 · ROCm/ATOM

kliuae · 2026-05-08T10:21:52Z

Motivation

This PR builds on top of the MTP framework in #557, adds MTP support to GLM-4.7 model for vLLM-ATOM.
Currently this PR contains changes from #557, and will be more concise once it gets upstreamed.

Technical Details

Register Glm4MoeMTPModel
Add glm4_moe_mtp modeling
Fix RoPE double apply in mha when ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=0

Test Plan

Accuracy test with lm_eval

Model: zai-org/GLM-4.7-FP8

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 \
  vllm serve zai-org/GLM-4.7-FP8 \
  -tp 8 \
  --max-num-seqs 1024 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --load-format fastsafetensors \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype fp8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1

lm_eval command

lm_eval --model local-completions   --model_args model=zai-org/GLM-4.7-FP8,base_url=http://localhost:8000/v1/completions,num_concurrent=64,tokenized_requests=False  --tasks gsm8k --num_fewshot 5

Test Result

gsm8k

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	_	0.9439	_	0.0063
		strict-match	5	exact_match	_	0.9439	_	0.0063

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Signed-off-by: kliuae-amd <kuanfliu@amd.com>

zejunchen-zejun · 2026-05-14T05:29:02Z

Hi, @kliuae

Could you help resolve the conflicts here? Meanwhile the GLM 4.7 MTP should be added into the atom-vllm nightly and benchmark workflow

Thank you

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

whx-sjtu · 2026-05-14T07:29:37Z


        logger.info(f"Construct ATOM model {model_arch} for vLLM plugin mode")
        self.model = model_cls(self.atom_config)
        self._adapt_mtp_layers_for_vllm()


You might need to skip this if glm4 mtp layer doesn't need to mask input_embeding according to positions.

Thanks for pointing this out. Like deepseek mtp, glm4 mtp masks inputs at 0 so it can use this

whx-sjtu · 2026-05-14T07:31:18Z

        self.model = model_cls(self.atom_config)
        self._adapt_mtp_layers_for_vllm()
        # Mirror nested attributes required by vLLM speculative decoding.
        self._expose_spec_decode_attrs()


You might also need to skip this if glm4 mtp layers doesn't need to share llm_head weights with main model.

lm_head is not shared between mtp and the main model in glm4, but as its inner predictor doesn't carry lm_head, the syncing of lm_head won't be triggered. I think the current logic here doesn't affect glm4 mtp.

whx-sjtu and others added 14 commits April 23, 2026 10:49

adapt mtp for glm5 (vllm plugin)

922aa8e

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

add patch to support mtp>1

3b82d15

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

fix model load failure of draft model

3f7d3d4

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

adapt full graph with mtp enabled

4c9c960

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

fix MLA MTP acceptance issue

75c46e6

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

fall back to vllm-style mtp position

ca42e27

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

fix embedding sharing failure for mtp

a7f6918

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

fix lint

90aa06b

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

fix comment

4a663a4

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

remove warnig log

9050626

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

Merge branch 'whx-sjtu/atom-support-vllm-glm5-mtp'

7e311e1

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

add mtp support for glm4

2acfa65

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge main

669ba3f

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

fix rope double apply for mha

f1e2d7e

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx requested review from ganyi1996ppo and whx-sjtu May 10, 2026 03:18

kliuae and others added 3 commits May 11, 2026 07:40

merge main

e4f91ef

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

guard vllm forward context retrieval

9183721

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge main

4c58a9e

Signed-off-by: kliuae-amd <kuanfliu@amd.com>

kliuae added 2 commits May 14, 2026 06:10

add glm4.7 mtp to workflow

e36bb36

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

5e1c753

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

whx-sjtu reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Support GLM-4.7 MTP in vLLM-ATOM plugin#722

[Feat] Support GLM-4.7 MTP in vLLM-ATOM plugin#722
kliuae wants to merge 19 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_enable_glm4_mtp_merge

kliuae commented May 8, 2026

Uh oh!

zejunchen-zejun commented May 14, 2026

Uh oh!

whx-sjtu May 14, 2026

Uh oh!

kliuae May 14, 2026

Uh oh!

whx-sjtu May 14, 2026

Uh oh!

kliuae May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kliuae commented May 8, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

zejunchen-zejun commented May 14, 2026

Uh oh!

whx-sjtu May 14, 2026

Choose a reason for hiding this comment

Uh oh!

kliuae May 14, 2026

Choose a reason for hiding this comment

Uh oh!

whx-sjtu May 14, 2026

Choose a reason for hiding this comment

Uh oh!

kliuae May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants