[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint by mmangkad · Pull Request #38650 · vllm-project/vllm

mmangkad · 2026-03-31T17:34:56Z

Purpose

Enable MTP for the official Qwen3.5 NVFP4 checkpoint, which currently fails to initialize with MTP because the Qwen3.5 MTP branch is stored in BF16 rather than modelopt_fp4.

Test Plan

8xB300

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --tensor-parallel-size 8 \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Test Result

python tests/evals/gsm8k/gsm8k_eval.py

Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:38<00:00, 13.44it/s]

Results:
Accuracy: 0.879
Invalid responses: 0.024
Total latency: 98.163 s
Questions per second: 13.437
Total output tokens: 193341
Output tokens per second: 1969.593

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request introduces a specialized quantization configuration handler for Qwen3.5 MTP models, ensuring that NVFP4 checkpoints use bf16 weights for the MTP branch. The implementation uses a helper function and a try...finally block to temporarily override the global configuration during layer initialization. Feedback suggests improving the safety of this implementation by using a shallow copy of the configuration object instead of mutating shared state, which would also simplify the code by removing the need for manual state restoration.

vadiklyutiy · 2026-03-31T21:02:33Z

I think it should be fixed on checkpoint level.

There is exclude_modules in hf_quant_config.json where MPT layers should be added.

mmangkad · 2026-04-01T01:59:46Z

I think it should be fixed on checkpoint level.

There is exclude_modules in hf_quant_config.json where MPT layers should be added.

@vadiklyutiy exactly, hf_quant_config.json already excludes the MTP path, but vllm still initializes Qwen3.5 MTP with modelopt_fp4 and applies that quant config during MTP construction. Because of that, the official Qwen3.5 NVFP4 checkpoint cannot start with method="mtp" before this change.

gaby · 2026-04-01T04:38:15Z

👍 looking forward to this

vadiklyutiy · 2026-04-01T07:24:51Z

this is the fix from modelopt team NVIDIA/Model-Optimizer#1124 (review)

mmangkad · 2026-04-01T10:41:41Z

this is the fix from modelopt team NVIDIA/Model-Optimizer#1124 (review)

The issue here is still on the vllm side: vllm does not currently respect the MTP exclusion when constructing Qwen3.5 MTP, so a re-export alone would not fix this.

ZJY0516 · 2026-04-02T05:37:15Z

I think it should be fixed on checkpoint level.

There is exclude_modules in hf_quant_config.json where MPT layers should be added.

We can accept this workaroud temporarily, because I think many users need this

vadiklyutiy · 2026-04-02T11:52:49Z

this is the fix from modelopt team NVIDIA/Model-Optimizer#1124 (review)

The issue here is still on the vllm side: vllm does not currently respect the MTP exclusion when constructing Qwen3.5 MTP, so a re-export alone would not fix this.

I think the problem is not that vLLM ignores the exclusion, but rather that mtp.fc.weight is missing in the exclusion list, no?

vadiklyutiy · 2026-04-02T15:10:12Z

I think it should be fixed on checkpoint level.
There is exclude_modules in hf_quant_config.json where MPT layers should be added.

We can accept this workaroud temporarily, because I think many users need this

I am ok with accepting workaround temporary, but seem here the thing is overcomplicated because assumes that vllm doesn't respect exclusion list, but it does (see is_layer_excluded() in modelopt.py)

vadiklyutiy · 2026-04-02T17:22:32Z

#38832 is a bit simpler and smaller change that only fix problematic place

Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store the MTP layer's fused expert weights as BF16 unquantized tensors (`mtp.layers.X.mlp.experts.{down,gate_up}_proj`, shape `[num_experts, ...]`) while the rest of the model is NVFP4-quantized per-expert per-projection. However the per-expert MTP linears are not listed in the compressed-tensors `quantization_config.ignore` field. vLLM ends up constructing the MTP `FusedMoE` quantized (registering `w13_weight_packed` / `w2_weight_packed`), and weight loading fails: KeyError: 'layers.0.mlp.experts.w2_weight' in Qwen3_5MultiTokenPredictor.load_fused_expert_weights This mirrors the existing `mtp.fc` workaround (PR vllm-project#38832) but for the experts. We extend the active CT `ignore` list with every per-expert MTP linear before constructing `self.layers`, so the FusedMoE picks `UnquantizedFusedMoEMethod` and registers BF16 `w13_weight` / `w2_weight` matching the checkpoint. Note: this is complementary to (not duplicative of) PR vllm-project#27608, which fixes the orthogonal CT-loader bug that `get_quant_method` doesn't honor `ignore` for FusedMoE. Even with vllm-project#27608 landed, an affected checkpoint would still crash because its `ignore` list is missing the per-expert MTP entries entirely. Once both vllm-project#27608 and corrected checkpoint metadata are in place, this workaround can be removed. Repro / impact (DGX Spark, GB10, BS=1, concurrency=1, prefix=32768, ISL=2048, OSL=1024): | K | Before patch | After patch (out tput) | |---|---------------|------------------------| | 0 | 63.08 t/s | 63.08 t/s (unaffected) | | 1 | crash | 71.08 t/s | | 3 | crash | 84.81 t/s | | 5 | crash | 87.76 t/s | Spec config also requires `moe_backend in {triton, flashinfer_trtllm, flashinfer_cutlass, aiter}` for the unquantized MTP MoE; `marlin` rejects unquantized FusedMoE. This is unrelated and not changed here. Drive-by: update stale PR reference in the existing mtp.fc workaround comment (vllm-project#38650 was closed unmerged; vllm-project#38832 is the merged fix). Assisted-by: Claude Signed-off-by: Serge Panev <spanev@nvidia.com>

mmangkad added 3 commits March 31, 2026 11:36

upd

ecc2aca

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>

Merge branch 'vllm-project:main' into enable-qwen3p5-nvfp4-mtp

6398b31

Merge branch 'vllm-project:main' into enable-qwen3p5-nvfp4-mtp

442a64b

mmangkad requested review from sighingnow and vadiklyutiy as code owners March 31, 2026 17:34

mergify Bot added qwen Related to Qwen models bug Something isn't working labels Mar 31, 2026

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread vllm/model_executor/models/qwen3_5_mtp.py

Comment thread vllm/model_executor/models/qwen3_5_mtp.py

acyngel approved these changes Mar 31, 2026

View reviewed changes

ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2026

vadiklyutiy mentioned this pull request Apr 2, 2026

[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 #38832

Merged

ZJY0516 removed the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2026

mmangkad closed this Apr 2, 2026

mmangkad deleted the enable-qwen3p5-nvfp4-mtp branch April 2, 2026 18:25

Kh4L mentioned this pull request May 7, 2026

[Bugfix] Extend compressed-tensors ignore for Qwen3.5 MTP experts #41994

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint#38650

[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint#38650
mmangkad wants to merge 3 commits into
vllm-project:mainfrom
mmangkad-dev:enable-qwen3p5-nvfp4-mtp

mmangkad commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

vadiklyutiy commented Mar 31, 2026

Uh oh!

mmangkad commented Apr 1, 2026

Uh oh!

gaby commented Apr 1, 2026

Uh oh!

vadiklyutiy commented Apr 1, 2026

Uh oh!

mmangkad commented Apr 1, 2026

Uh oh!

ZJY0516 commented Apr 2, 2026

Uh oh!

vadiklyutiy commented Apr 2, 2026 •

edited

Loading

Uh oh!

vadiklyutiy commented Apr 2, 2026

Uh oh!

vadiklyutiy commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

mmangkad commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

vadiklyutiy commented Mar 31, 2026

Uh oh!

mmangkad commented Apr 1, 2026

Uh oh!

gaby commented Apr 1, 2026

Uh oh!

vadiklyutiy commented Apr 1, 2026

Uh oh!

mmangkad commented Apr 1, 2026

Uh oh!

ZJY0516 commented Apr 2, 2026

Uh oh!

vadiklyutiy commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadiklyutiy commented Apr 2, 2026

Uh oh!

vadiklyutiy commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mmangkad commented Mar 31, 2026 •

edited

Loading

vadiklyutiy commented Apr 2, 2026 •

edited

Loading