[Feat] Support GLM-4.7 MTP in vLLM-ATOM plugin#722
Conversation
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae-amd <kuanfliu@amd.com>
|
Hi, @kliuae Could you help resolve the conflicts here? Meanwhile the GLM 4.7 MTP should be added into the atom-vllm nightly and benchmark workflow Thank you |
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
|
|
||
| logger.info(f"Construct ATOM model {model_arch} for vLLM plugin mode") | ||
| self.model = model_cls(self.atom_config) | ||
| self._adapt_mtp_layers_for_vllm() |
There was a problem hiding this comment.
You might need to skip this if glm4 mtp layer doesn't need to mask input_embeding according to positions.
There was a problem hiding this comment.
Thanks for pointing this out. Like deepseek mtp, glm4 mtp masks inputs at 0 so it can use this
| self.model = model_cls(self.atom_config) | ||
| self._adapt_mtp_layers_for_vllm() | ||
| # Mirror nested attributes required by vLLM speculative decoding. | ||
| self._expose_spec_decode_attrs() |
There was a problem hiding this comment.
You might also need to skip this if glm4 mtp layers doesn't need to share llm_head weights with main model.
There was a problem hiding this comment.
lm_head is not shared between mtp and the main model in glm4, but as its inner predictor doesn't carry lm_head, the syncing of lm_head won't be triggered. I think the current logic here doesn't affect glm4 mtp.
Motivation
This PR builds on top of the MTP framework in #557, adds MTP support to GLM-4.7 model for vLLM-ATOM.
Currently this PR contains changes from #557, and will be more concise once it gets upstreamed.
Technical Details
Glm4MoeMTPModelglm4_moe_mtpmodelingATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=0Test Plan
Accuracy test with lm_eval
Model: zai-org/GLM-4.7-FP8
Server command:
lm_eval command
Test Result
gsm8k
Submission Checklist