npu qwen3.5 megatron padding_free fix#50
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a helper function, resolve_gdn_attention_mask, to centralize attention mask logic and add support for NPU devices. This function is then integrated into the forward passes of the Qwen3-Next and Qwen3.5 models. A critical typo was identified in qwen3_next.py where an undefined function resolve_hf_attention_mask is called instead of the new helper, which would lead to a runtime error.
| attention_mask = kwargs.get('attention_mask') | ||
| if attention_mask is not None: | ||
| attention_mask = (~attention_mask).sum(dim=(1, 2)) > 0 | ||
| attention_mask = resolve_hf_attention_mask(kwargs) |
There was a problem hiding this comment.
The function call resolve_hf_attention_mask(kwargs) appears to be a typo. The function defined earlier in this file (at line 62) is named resolve_gdn_attention_mask. Using the incorrect name will result in a NameError at runtime.
| attention_mask = resolve_hf_attention_mask(kwargs) | |
| attention_mask = resolve_gdn_attention_mask(kwargs) |
There was a problem hiding this comment.
Pull request overview
Updates Qwen3.5 / Qwen3-Next GatedDeltaNet (GDN) non-THD attention-mask handling to avoid NPU failures when padding_free=False, by preferring attention_mask_2d on NPU and falling back to the existing attention_mask-derived behavior otherwise.
Changes:
- Add a shared
resolve_gdn_attention_mask()helper that selectsattention_mask_2don NPU (fallback to legacy mask logic otherwise). - Use
resolve_gdn_attention_mask()in the Qwen3.5 MM-GPT GDN non-THD forward path. - Update Qwen3-Next GDN non-THD forward path to use a resolver (but currently references an undefined function).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/mcore_bridge/model/mm_gpts/qwen3_5.py |
Switch non-THD GDN path to use shared mask resolver. |
src/mcore_bridge/model/gpts/qwen3_next.py |
Introduce resolve_gdn_attention_mask() and update non-THD GDN mask resolution call site. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| attention_mask = kwargs.get('attention_mask') | ||
| if attention_mask is not None: | ||
| attention_mask = (~attention_mask).sum(dim=(1, 2)) > 0 | ||
| attention_mask = resolve_hf_attention_mask(kwargs) |
There was a problem hiding this comment.
resolve_hf_attention_mask is referenced here but is not defined anywhere in this module (or elsewhere in the repo). This will raise a NameError at runtime. Use the newly introduced resolve_gdn_attention_mask(kwargs) here (or add/rename the helper consistently if a different resolver is intended).
| attention_mask = resolve_hf_attention_mask(kwargs) | |
| attention_mask = resolve_gdn_attention_mask(kwargs) |
|
thanks! lint 过一下 |
Problem
For Qwen3.5 / Qwen3-Next on NPU,
padding_free=Falsemay fail in the GDN non-thd path, whilepadding_free=Trueworks only because the thd path derives mask fromcu_seqlens_qinternally and bypasses the broken external mask chain. :contentReference[oaicite:2]{index=2}Cause
The GDN non-thd branch reads external mask from kwargs, but on NPU the correct mask for this path is
attention_mask_2d. Without preferring that field, the model may consume an incompatible mask. :contentReference[oaicite:3]{index=3}Fix
This PR updates the Qwen GDN non-thd path to prefer
attention_mask_2don NPU and fall back to the originalattention_maskbehavior otherwise.Notes
This is the model-side half of the fix. It is intended to work together with the
ms-swifttrainer-side change that preservesattention_mask_2d.