[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint#38650
[Bugfix] Enable MTP for the official Qwen3.5 NVFP4 checkpoint#38650mmangkad wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a specialized quantization configuration handler for Qwen3.5 MTP models, ensuring that NVFP4 checkpoints use bf16 weights for the MTP branch. The implementation uses a helper function and a try...finally block to temporarily override the global configuration during layer initialization. Feedback suggests improving the safety of this implementation by using a shallow copy of the configuration object instead of mutating shared state, which would also simplify the code by removing the need for manual state restoration.
|
I think it should be fixed on checkpoint level. There is |
@vadiklyutiy exactly, |
|
👍 looking forward to this |
|
this is the fix from modelopt team NVIDIA/Model-Optimizer#1124 (review) |
The issue here is still on the vllm side: vllm does not currently respect the MTP exclusion when constructing Qwen3.5 MTP, so a re-export alone would not fix this. |
We can accept this workaroud temporarily, because I think many users need this |
I think the problem is not that vLLM ignores the exclusion, but rather that |
I am ok with accepting workaround temporary, but seem here the thing is overcomplicated because assumes that vllm doesn't respect exclusion list, but it does (see |
|
#38832 is a bit simpler and smaller change that only fix problematic place |
Compressed-tensors NVFP4 Qwen3.5 MoE checkpoints store the MTP layer's
fused expert weights as BF16 unquantized tensors
(`mtp.layers.X.mlp.experts.{down,gate_up}_proj`, shape `[num_experts, ...]`)
while the rest of the model is NVFP4-quantized per-expert per-projection.
However the per-expert MTP linears are not listed in the
compressed-tensors `quantization_config.ignore` field. vLLM ends up
constructing the MTP `FusedMoE` quantized (registering `w13_weight_packed`
/ `w2_weight_packed`), and weight loading fails:
KeyError: 'layers.0.mlp.experts.w2_weight'
in Qwen3_5MultiTokenPredictor.load_fused_expert_weights
This mirrors the existing `mtp.fc` workaround (PR vllm-project#38832) but for the
experts. We extend the active CT `ignore` list with every per-expert MTP
linear before constructing `self.layers`, so the FusedMoE picks
`UnquantizedFusedMoEMethod` and registers BF16 `w13_weight` /
`w2_weight` matching the checkpoint.
Note: this is complementary to (not duplicative of) PR vllm-project#27608, which
fixes the orthogonal CT-loader bug that `get_quant_method` doesn't
honor `ignore` for FusedMoE. Even with vllm-project#27608 landed, an affected
checkpoint would still crash because its `ignore` list is missing the
per-expert MTP entries entirely. Once both vllm-project#27608 and corrected
checkpoint metadata are in place, this workaround can be removed.
Repro / impact (DGX Spark, GB10, BS=1, concurrency=1, prefix=32768,
ISL=2048, OSL=1024):
| K | Before patch | After patch (out tput) |
|---|---------------|------------------------|
| 0 | 63.08 t/s | 63.08 t/s (unaffected) |
| 1 | crash | 71.08 t/s |
| 3 | crash | 84.81 t/s |
| 5 | crash | 87.76 t/s |
Spec config also requires `moe_backend in {triton, flashinfer_trtllm,
flashinfer_cutlass, aiter}` for the unquantized MTP MoE; `marlin`
rejects unquantized FusedMoE. This is unrelated and not changed here.
Drive-by: update stale PR reference in the existing mtp.fc workaround
comment (vllm-project#38650 was closed unmerged; vllm-project#38832 is the merged fix).
Assisted-by: Claude
Signed-off-by: Serge Panev <spanev@nvidia.com>
Purpose
Enable MTP for the official Qwen3.5 NVFP4 checkpoint, which currently fails to initialize with MTP because the Qwen3.5 MTP branch is stored in BF16 rather than
modelopt_fp4.Test Plan
8xB300
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \ --tensor-parallel-size 8 \ --language-model-only \ --reasoning-parser qwen3 \ --enable-prefix-caching \ --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'Test Result