[dev] [follow-up] Qwen3.5 support: MoE aux loss padding_mask#4776
Draft
wplf wants to merge 2 commits into
Draft
Conversation
Adds a standalone VLM training playground under ``examples/multimodal_dev/`` with Qwen3.5-VL end-to-end. Highlights - Model-agnostic entry point (``pretrain_multimodal.py``) with a ``MODEL_REGISTRY`` so adding a new architecture is just a registry entry plus a backing module. - Qwen3.5-VL model: vision encoder, MRoPE, decoder, factory, specs, configurations covering proxy / 9B / 397B-A17B variants. - Datasets: mock data and CORD-V2 VLM dataset, with THD pack/pad in the collate function. - THD + CP support consolidated in ``forward_step.py`` and the model layer (uses MRoPE THD pre-computation and ``cu_seqlens_q_padded`` CP partitioning). - Run script + README, plus tests for MRoPE parity, CP correctness, CP support, and THD correctness / e2e. Also gates the torch DataLoader vanilla-collate path on the new ``use_vanilla_collate_fn`` arg (one-line change to ``megatron/training/datasets/data_samplers.py``) so CORD-V2 works under BSHD. Functional dependency: the new model arch sets ``mrope_interleaved=True`` in its config and relies on the core MRoPE interleaved layout introduced in a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecoder
`MultimodalModel.forward()` was calling `self.language_model(...)`
without a `padding_mask` argument. With `padding_mask=None`, the
language decoder's MoE router skips three masking sites that document
themselves as MoE-only:
- `apply_z_loss(logits, padding_mask=None)` — z-loss averages over
all tokens including collate padding.
- `_apply_aux_loss(..., with_padding_mask=padding_mask is not None)`
— aux load-balancing loss is computed over all tokens, so padded
positions dilute the signal and bias balancing toward whatever the
model emits for input_id=0 at those slots.
- `_apply_expert_bias(routing_map, padding_mask=None)` — expert-bias
EMA accumulates routing statistics from padded tokens.
Both code paths in `pack_or_pad_batch` introduce padded positions:
BSHD pads each sample to `target_seqlens` with input_id=0 /
label=-100 / loss_mask=0; THD pads each sample's length to a multiple
of `divisible_by` so `cu_seqlens_q_padded` differs from `cu_seqlens_q`.
Fix:
- Build `padding_mask` at collate time in both branches (BSHD: ``[B,
target_seqlens]``; THD: ``[1, T]``). True marks collate-padded
positions only — distinct from `loss_mask` so SFT prompt tokens
(which carry `loss_mask=0` but are real tokens) still participate
in routing.
- Thread `padding_mask` through `forward_step` → `MultimodalModel.forward`
→ `_cp_split_for_forward` → `GPTModel.forward`, mirroring how
`loss_mask` is handled. CP split uses the same BSHD zigzag /
THD `tex.thd_get_partitioned_indices` index as the other tensors.
Dense Qwen3.5-VL variants are unaffected — `padding_mask` is only
consumed inside MoE layers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>
5418b65 to
af7d670
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Qwen3.5 support series
This is a follow-up to the 5-PR series adding Qwen3.5-VL support; it lands on top of #4751 (the example), not the core changes.
Dev PRs:
Main mirror: opened separately against
mainas a sibling.Why
In
examples/multimodal_dev/models/base.py,MultimodalModel.forward()was callingself.language_model(...)without apadding_maskargument.GPTModel.forward()documentspadding_maskas "Only used for MoE layers to exclude padding tokens from routing computations." Withpadding_mask=Nonethe router skips three masking sites that affect MoE numerics:megatron/core/transformer/moe/router.py)padding_mask=Noneapply_z_loss(logits, padding_mask=None)(line 526)_apply_aux_loss(..., with_padding_mask=padding_mask is not None)(line 736)_apply_expert_bias(routing_map, padding_mask=None)(line 604)Both data paths in
pack_or_pad_batch(forward_step.py) introduce padding:target_seqlenswithinput_ids=0,labels=-100,loss_mask=0.divisible_by;cu_seqlens_q_paddeddiffers fromcu_seqlens_q.For MoE variants (
proxy,35b_a3b,35b_a3b_light,122b_a10b,397b_a17b) this means the load-balancing signal is diluted by padded positions whose router logits don't reflect any real token.What this PR does
padding_maskat collate time in both branches ofpack_or_pad_batch:[B, target_seqlens]bool,Truepast each sample's real length.[1, T]bool,Truebetweencu_seqlens_padded[i] + real_len[i]andcu_seqlens_padded[i+1].forward_step→MultimodalModel.forward→_cp_split_for_forward→language_model.forward, mirroring howloss_maskis handled. CP split uses the same BSHD zigzag / THDtex.thd_get_partitioned_indicesindex as the rest.Why not derive from
loss_maskorlabels == -100For SFT data, prompt tokens carry
loss_mask=0/label=-100but are real tokens that should still participate in routing. Folding them intopadding_maskwould under-route real activations — a different bug. The collate-time mask only marks tokens added by the padder.Dependency
Depends on #4751 (PR-5: the Qwen3.5-VL example) —
MultimodalModelandpack_or_pad_batchare introduced there. The diff vsdevtherefore shows PR-5 + this fix; reviewers should compare against #4751's tip for the isolated padding_mask delta.Risk
padding_maskis only consumed inside MoE layers._cp_split_for_forwardandMultimodalModel.forwardgain one optional kwarg withNonedefault; existing callers unaffected.🤖 Generated with Claude Code