Skip to content

[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint#38650

Closed
mmangkad wants to merge 3 commits into
vllm-project:mainfrom
mmangkad-dev:enable-qwen3p5-nvfp4-mtp
Closed

[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint#38650
mmangkad wants to merge 3 commits into
vllm-project:mainfrom
mmangkad-dev:enable-qwen3p5-nvfp4-mtp

Conversation

@mmangkad
Copy link
Copy Markdown
Contributor

@mmangkad mmangkad commented Mar 31, 2026

Purpose

Enable MTP for the official Qwen3.5 NVFP4 checkpoint, which currently fails to initialize with MTP because the Qwen3.5 MTP branch is stored in BF16 rather than modelopt_fp4.

Test Plan

8xB300

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --tensor-parallel-size 8 \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Test Result

python tests/evals/gsm8k/gsm8k_eval.py

Running GSM8K evaluation: 1319 questions, 5-shot
Evaluating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [01:38<00:00, 13.44it/s]

Results:
Accuracy: 0.879
Invalid responses: 0.024
Total latency: 98.163 s
Questions per second: 13.437
Total output tokens: 193341
Output tokens per second: 1969.593

@mergify mergify Bot added qwen Related to Qwen models bug Something isn't working labels Mar 31, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a specialized quantization configuration handler for Qwen3.5 MTP models, ensuring that NVFP4 checkpoints use bf16 weights for the MTP branch. The implementation uses a helper function and a try...finally block to temporarily override the global configuration during layer initialization. Feedback suggests improving the safety of this implementation by using a shallow copy of the configuration object instead of mutating shared state, which would also simplify the code by removing the need for manual state restoration.

Comment thread vllm/model_executor/models/qwen3_5_mtp.py
Comment thread vllm/model_executor/models/qwen3_5_mtp.py
@vadiklyutiy
Copy link
Copy Markdown
Member

I think it should be fixed on checkpoint level.

There is exclude_modules in hf_quant_config.json where MPT layers should be added.

@mmangkad
Copy link
Copy Markdown
Contributor Author

mmangkad commented Apr 1, 2026

I think it should be fixed on checkpoint level.

There is exclude_modules in hf_quant_config.json where MPT layers should be added.

@vadiklyutiy exactly, hf_quant_config.json already excludes the MTP path, but vllm still initializes Qwen3.5 MTP with modelopt_fp4 and applies that quant config during MTP construction. Because of that, the official Qwen3.5 NVFP4 checkpoint cannot start with method="mtp" before this change.

@gaby
Copy link
Copy Markdown

gaby commented Apr 1, 2026

👍 looking forward to this

@vadiklyutiy
Copy link
Copy Markdown
Member

this is the fix from modelopt team NVIDIA/Model-Optimizer#1124 (review)

@mmangkad
Copy link
Copy Markdown
Contributor Author

mmangkad commented Apr 1, 2026

this is the fix from modelopt team NVIDIA/Model-Optimizer#1124 (review)

The issue here is still on the vllm side: vllm does not currently respect the MTP exclusion when constructing Qwen3.5 MTP, so a re-export alone would not fix this.

@ZJY0516
Copy link
Copy Markdown
Member

ZJY0516 commented Apr 2, 2026

I think it should be fixed on checkpoint level.

There is exclude_modules in hf_quant_config.json where MPT layers should be added.

We can accept this workaroud temporarily, because I think many users need this

@ZJY0516 ZJY0516 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2026
@vadiklyutiy
Copy link
Copy Markdown
Member

vadiklyutiy commented Apr 2, 2026

this is the fix from modelopt team NVIDIA/Model-Optimizer#1124 (review)

The issue here is still on the vllm side: vllm does not currently respect the MTP exclusion when constructing Qwen3.5 MTP, so a re-export alone would not fix this.

I think the problem is not that vLLM ignores the exclusion, but rather that mtp.fc.weight is missing in the exclusion list, no?

@vadiklyutiy
Copy link
Copy Markdown
Member

I think it should be fixed on checkpoint level.
There is exclude_modules in hf_quant_config.json where MPT layers should be added.

We can accept this workaroud temporarily, because I think many users need this

I am ok with accepting workaround temporary, but seem here the thing is overcomplicated because assumes that vllm doesn't respect exclusion list, but it does (see is_layer_excluded() in modelopt.py)

@vadiklyutiy
Copy link
Copy Markdown
Member

#38832 is a bit simpler and smaller change that only fix problematic place

@ZJY0516 ZJY0516 removed the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2026
@mmangkad mmangkad closed this Apr 2, 2026
@mmangkad mmangkad deleted the enable-qwen3p5-nvfp4-mtp branch April 2, 2026 18:25
Kh4L added a commit to Kh4L/vllm that referenced this pull request May 7, 2026
Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store the MTP layer's
fused expert weights as BF16 unquantized tensors
(`mtp.layers.X.mlp.experts.{down,gate_up}_proj`, shape `[num_experts, ...]`)
while the rest of the model is NVFP4-quantized per-expert per-projection.
However the per-expert MTP linears are not listed in the
compressed-tensors `quantization_config.ignore` field. vLLM ends up
constructing the MTP `FusedMoE` quantized (registering `w13_weight_packed`
/ `w2_weight_packed`), and weight loading fails:

  KeyError: 'layers.0.mlp.experts.w2_weight'
    in Qwen3_5MultiTokenPredictor.load_fused_expert_weights

This mirrors the existing `mtp.fc` workaround (PR vllm-project#38832) but for the
experts. We extend the active CT `ignore` list with every per-expert MTP
linear before constructing `self.layers`, so the FusedMoE picks
`UnquantizedFusedMoEMethod` and registers BF16 `w13_weight` /
`w2_weight` matching the checkpoint.

Note: this is complementary to (not duplicative of) PR vllm-project#27608, which
fixes the orthogonal CT-loader bug that `get_quant_method` doesn't
honor `ignore` for FusedMoE. Even with vllm-project#27608 landed, an affected
checkpoint would still crash because its `ignore` list is missing the
per-expert MTP entries entirely. Once both vllm-project#27608 and corrected
checkpoint metadata are in place, this workaround can be removed.

Repro / impact (DGX Spark, GB10, BS=1, concurrency=1, prefix=32768,
ISL=2048, OSL=1024):

  | K | Before patch  | After patch (out tput) |
  |---|---------------|------------------------|
  | 0 | 63.08 t/s     | 63.08 t/s (unaffected) |
  | 1 | crash         | 71.08 t/s              |
  | 3 | crash         | 84.81 t/s              |
  | 5 | crash         | 87.76 t/s              |

Spec config also requires `moe_backend in {triton, flashinfer_trtllm,
flashinfer_cutlass, aiter}` for the unquantized MTP MoE; `marlin`
rejects unquantized FusedMoE. This is unrelated and not changed here.

Drive-by: update stale PR reference in the existing mtp.fc workaround
comment (vllm-project#38650 was closed unmerged; vllm-project#38832 is the merged fix).

Assisted-by: Claude
Signed-off-by: Serge Panev <spanev@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants