feat: adding PP and CP for nemotron v3 models#2316
Conversation
… fix Source-only snapshot for adil/adil-test (YAMLs intentionally excluded). - NemotronV3 pipeline-parallel + MTP wiring (model.py, mtp.py, layers.py, state_dict_adapter.py, common/mtp/mtp.py). - Loss path: PipelineCausalLMLoss, calculate_mtp_loss with cu_seqlens / seq_idx threading, MoEAuxLossAutoScaler-aware grad scaling in train_ft. - Attention preprocess threading for THD + CP (attention/utils.py). - MoE parallelizer: iterate transformer and MTP blocks together for CP hook attach and FSDP wrapping. THD collator fix: - process_input_for_thd: gate trailing-pack-pad absorption on trailing_pad <= max_seqlen; for larger pads, extend cu_seqlens with dummy max_seqlen-sized slots and regenerate position_ids per-slot. - split_batch_into_thd_chunks: emit cu_seqlens_padded whenever any chunk needs it (any-vs-all), absorbed chunks fall back to cu_seqlens. - NemotronV3Model.forward and NemotronHForCausalLM MTP-kwargs: strip -1000 sentinel right-pad from cu_seqlens / cu_seqlens_padded before they reach TE attention / mamba searchsorted. Tests: - tests/unit_tests/distributed/test_thd_utils.py: dummy-slot extension coverage, mixed-chunk short+full coverage; updated existing tests to reflect cu_seqlens emit semantics (real cumsum, not padded). - tests/unit_tests/models/nemotron_v3/test_nemotron_v3_mtp.py + test_nemotron_v3_pp_mtp.py: list-form mtp_layers_block_type resolution, PP-stage metas, MTP placement on last stage, sentinel filter on trimmed mid-stages. Dataset: - nemo_automodel/components/datasets/llm/pre_rendered_chat_dataset.py: new dataset for pre-rendered chat data. Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Not used by the THD/PP/MTP changes on this branch; keep it out of the remote until it ships with its own dedicated PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…contract The THD collator's trailing-pad absorption grows ``cu_seqlens[-1]`` to ``total_tokens`` so TE can skip the ``pad_between_seqs=True`` path. But ``max_seqlen`` was computed BEFORE the absorption from the unpadded ``seq_lens`` — so we handed TE a ``cu_seqlens`` containing a slot wider than the ``max_seqlen`` value we claimed. That is explicit UB per TE's documented contract (``fused_attn.py:152-159``, ``fused_attn.h:548-551``: ``max_seqlen_q ... may be >= max(seqlen_q_i)``; smaller-than-actual is undefined). cuDNN-fused-attn-bwd happened not to crash on the small-overshoot case (128 vs 112) but did crash on the larger captured "short microbatch" (576 vs 112). Fix: defer the ``max_seqlen`` computation to after the absorption block and source it from the FINAL ``cu_seqlens`` slot widths. With the value now truthful, absorption is contract-clean for any trailing-pad size, so the previous defensive dummy-slot extension is no longer needed. Verified with a single-GPU TE probe (te_thd_repro_MINIMAL.py): feeding ``cu_seqlens=[0,112,224,336,448,1024]`` with truthful ``max_seqlen=576`` runs cleanly; the same layout with the lied ``max_seqlen=112`` still crashes. Smoke tests pass at PP=4 EP=2 CP=1 and PP=4 EP=2 CP=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
``cu_seqlens`` and ``cu_seqlens_padded`` are constructed inside nested ``if seq_lens is not None``/``if seq_lens_padded is not None`` blocks, so ``cu_seqlens_padded`` can only be non-None when ``cu_seqlens`` is also non-None — the ``else cu_seqlens_padded`` fallback is unreachable. Falling back to the padded variant would also be semantically wrong (memory offsets vs. compute extent, per TE's docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…tics After the trailing-pad absorption + max_seqlen fix, ``cu_seqlens`` is emitted from ``seq_lens`` (real lengths), not ``seq_lens_padded``. The docstrings still said the opposite and the example output value was wrong. Also document the optional ``cu_seqlens_padded`` and ``max_seqlen`` keys that the function emits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Keep the comments that explain WHY (TE contract, the absorption gate, the pad_between_seqs flip on cu_seqlens_padded presence). Drop the ones that just restate WHAT the surrounding code already shows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
The cu_seqlens → seq_idx derivation used the default ``side="left"`` for
``torch.searchsorted``. By that contract a position equal to a boundary
(``t == cu_seqlens[k]`` — the FIRST token of sub-seq k) gets classified
as sub-seq k-1 instead of k, because cu_seqlens[1:][k-1] = cu_seqlens[k]
is ≥ t and side="left" picks the leftmost insertion point. That is an
off-by-one at every internal sub-seq boundary.
Two production sites were affected:
* ``loss/mtp.py:calculate_mtp_loss`` — derives seq_idx from cu_seqlens
to mask MTP labels whose rolled source crosses a sub-seq boundary.
Without right=True, the boundary token's mask is wrong.
* ``models/nemotron_v3/layers.py:NemotronV3Mamba2Mixer.forward`` —
derives seq_idx for mamba's SSD scan state-reset. Without right=True,
the state reset happens one token late at every boundary, so the
first token of a new sub-seq sees the previous sub-seq's accumulated
state.
Fix is a one-line change at each site: add ``right=True`` to
``torch.searchsorted``.
Tests:
* tests/unit_tests/loss/test_mtp_cross_boundary.py — 7 tests pinning
the MTP cross-boundary mask behavior against a hand-coded reference.
* tests/unit_tests/distributed/test_seq_idx_from_cu_seqlens.py — 34
tests covering deterministic edge cases, randomized exhaustive
verification against a brute-force linear scan, production-site
pinning, and a static-analysis guard so reverts to side="left" fail.
Smoke verification: 8-GPU PP=4 EP=2 CP=1 hellaswag smoke produces a
loss trajectory bit-identical to pre-fix (hellaswag's pad-to-max-length
data has long pad runs between sub-seqs that absorb the bug's effect
on this dataset; the fix matters more for tightly-packed data).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Drops the env-var-gated instrumentation added during the
nemotron_v3 PP+MTP + THD-collator bug hunt:
* NEMO_PP_PACK_DEBUG — per-microbatch packing/PP info prints from
train_ft.py, nemotron_v3/{model,layers}.py, and common/mtp/mtp.py.
* NEMO_CU_DEBUG — cu_seqlens print inside the model forward.
* NEMO_DUMP_BATCH_DIR — per-step per-rank batch tensor dump for
offline replay.
* NEMO_REPLAY_BATCH_DIR / NEMO_REPLAY_BATCH_STEP /
NEMO_REPLAY_OVERRIDE_STEP / NEMO_REPLAY_OVERRIDE_WHAT — offline
batch replay + mixed-field override hooks.
* NEMO_COPY_MB0_TO_ALL — duplicate microbatch 0 across all
microbatch slots.
Also cleans up the few helper variables (``_seq_idx_source``,
``cu_seqlens_source``) that only existed to populate those prints,
and removes the now-unused ``import os`` from layers.py and model.py.
Net: -291 LOC across 4 files. 8-GPU PP=4 EP=2 CP=1 hellaswag smoke
is bit-identical to pre-cleanup (10/10 steps, val 8.9692).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…ition_ids
At MTP depth k, the slot at position t embeds the token originally at
position ``t + k + 1``. That is implemented in ``MTPModule.forward``
by cumulatively rolling ``cur_input_ids`` and ``cur_position_ids`` left
by one at each depth iteration. The label-side rolling on the loss path
is already covered by ``tests/unit_tests/loss/test_mtp_cross_boundary.py``;
this adds the matching coverage on the model side.
Six new tests using a recording sublayer to intercept what reaches the
inner block at each depth:
* test_cumulative_input_ids_rolling_1d / _position_ids_rolling_1d /
_rolling_2d_batch — depth k receives inputs rolled by -(k+1).
* test_multi_sublayer_per_depth_sees_same_rolled_inputs — when
pattern_length > 1, all sublayers of a depth share the same rolled
inputs (no further intra-depth rolling); embed_input is only on
sublayer 0.
* test_precomputed_embed_inputs_path_skips_token_rolling — when the
caller pre-computes embeddings (PP final-stage path), the embed
rolling is skipped but position_ids still rolls.
* test_trailing_positions_zero_after_roll — roll is left-shift with
zeros, not wrap-around; the last k+1 slots are 0 at depth k.
CPU-only, no GPU required, ~4s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Keep the WHY notes that explain non-obvious behavior (the ``right=True`` boundary semantics for searchsorted, the PP-chunking shape contract, and the cross-sub-seq mask rationale). Drop the WHAT-narrating comments — the surrounding code already says what it does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…forward Two comment blocks in ``NemotronV3Model.forward`` and the MTP-kwargs path claimed PyTorch PP could not chunk ``cu_seqlens`` and ``broadcasts kwargs unchunked``. That was true earlier when ``cu_seqlens`` flowed as 1D ``[num_seqs+1]``; after ``split_batch_into_thd_chunks``, ``cu_seqlens`` is emitted stacked as ``[num_microbatches, max_K]`` and PP ``tensor_split``s along dim 0, yielding ``[1, max_K]`` per microbatch. The normalization code (squeeze + sentinel strip + max_seqlen flatten) is unchanged — the shapes it handles are exactly what PP produces today; only the comments were misdescribing the mechanism. Also reframes the cu_seqlens rebuild-from-attention_mask block as the non-THD-collater path (which is the real reason it exists) rather than the now-incorrect "PP can't chunk" rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Keep WHY-explaining notes (PP tuple arity contract, sentinel-strip rationale, FSDP2 lazy-init constraint, CP context forwarding rationale, THD-vs-BSHD backend gating). Drop WHAT-narrating prose that just restates what the next few lines do. Net -106 LOC in model.py. Source behavior unchanged; verified by ``pytest tests/unit_tests/models/nemotron_v3/`` (3 pre-existing ``seq_idx``-tail arity test failures unchanged from prior runs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Third instance of the cu_seqlens → seq_idx off-by-one. The PP last-stage seq_idx tail (used by ``PipelineCausalLMLoss`` for the MTP cross-boundary mask) derived per-token sub-seq ids via ``searchsorted`` with the default ``side="left"``, which mis-classifies every internal boundary token (``position == cu_seqlens[k]``) as belonging to sub-seq ``k-1`` instead of ``k``. The loss function's own ``right=True`` fix only protects its cu_seqlens-derivation fallback (mtp.py:87); when an explicit ``seq_idx`` is supplied (the PP path), the loss uses it verbatim. So this builder was the active source of off-by-one masks at sub-seq boundaries for THD+MTP+PP training. Matches the prior fixes in ``loss/mtp.py`` and ``models/nemotron_v3/layers.py``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…alLM.forward The trim pass dropped the per-parameter Args section that's on main. Restore it (with the new ``*mtp_embed_inputs`` positional plus the PP-aware tuple Returns), keep the existing high-level summary and PP-awareness paragraph from the recent trim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…ocstring
Two small corrections:
* The summary paragraph claimed the last-PP-stage return was
``(logits, *mtp_per_depth_h)``, but with MTP enabled the actual return
appends a ``seq_idx`` tail (``1 + D + 1`` arity), as the Returns block
just below already states. Defer to the Returns section instead of
duplicating (and contradicting) it.
* The THD-squeeze sentence over-claimed: the squeeze only happens when
the attention backend is TE; SDPA / flex paths stay in BSHD even with
``qkv_format='thd'``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Drop the multi-line preamble explaining why we iterate both backbone and MTP attention blocks — ``_iter_transformer_and_mtp_blocks`` is named self-descriptively. The non-obvious caveat (uninitialized CP state → illegal-memory-access) belongs in the helper's docstring, not at every call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>
…aths * Fix `NemotronV3Mamba2Mixer.forward` seq_idx builder: when the input is BSHD with B > 1 (default-collater validation), the cu_seqlens derived from the 2D attention_mask is a global cumsum across rows, not per-row. The mamba_ssm kernel asserts `seq_idx.shape == (B, S)` and treats each row independently, so build seq_idx as `zeros(B, S, int32)` for that case and keep the THD `searchsorted(right=True)` path for the 1D pack case. * Fix `calculate_mtp_loss`: when labels arrive 1D `[T]` from `process_input_for_thd` (non-PP THD path), squeeze the synthetic batch axis from each per-depth hidden tensor so FusedLinearCrossEntropy / cut_cross_entropy's `hidden_states.shape[:-1] == labels.shape` invariant holds. Adds unit tests for both code paths. Signed-off-by: Adil Asif <adasif@nvidia.com>
`_convert_single_merged_expert_to_hf_split_experts` now returns non-contiguous strided views into the model's grouped DTensor local storage whenever the source is a model DTensor with plain (non-DTensor) per-expert splits (`ep_shard == 1`), instead of allocating fresh per-expert contiguous copies. DCP's `target.copy_(source)` writes safetensors data directly through the views into the model's grouped storage, and `_from_hf_w_merged_experts` skips the rebuild for the corresponding native key (tracked in a new `_inplace_loaded_native_keys` set on the adapter). For Ultra-class MoE checkpoints (Nemotron Ultra V3, 560B params, 512 experts) this avoids tens of GB of per-expert scratch on top of the already-materialized model: the load path no longer OOMs at `to_hf` on memory-tight nodes. Save callers must materialize the views before serializing — `safetensors.torch.save` rejects non-contiguous tensors — so `save_model` now calls a new `_materialize_to_hf_views_for_save` helper that walks the dict and calls `.contiguous()` per tensor with `empty_cache` between iterations, keeping the save-time transient bounded to a single expert weight instead of allocating the full grouped set. Adds five `TestInplaceLoadViews` unit tests covering the gated and non-gated paths, the non-DTensor fallback, write-through to model storage, and from_hf rebuild-skip. Signed-off-by: Adil Asif <adasif@nvidia.com>
…acking At local_batch_size>1, multiple packs are concatenated and each pack carries its own trailing pad, so cu_seqlens (real lengths) differs from cu_seqlens_padded (padded slot lengths) at MULTIPLE entries — the single-trailing-pad absorption in thd_utils does not apply, and hidden states / labels are in the PADDED layout. Three coupled fixes: - mamba (nemotron_v3/layers.py): build seq_idx from cu_seqlens_padded (fallback cu_seqlens) so the SSD scan segments the padded hidden states correctly. - MTP loss (loss/mtp.py): total_len = labels.shape[-1] so the searchsorted seq_idx spans the full padded token axis. - train_ft (non-PP): pass cu_seqlens into calculate_mtp_loss so cross-sub-sequence label rolls are masked, matching the PP path. Verified on the full 560B model (4 nodes, ep32, deepep, MTP) at mbs=4: dropping the first two reproduces a 'tensor a (8064) must match tensor b (8192)' crash in the MTP cross_seq mask (real vs padded token count), confirming all three are necessary at mbs>1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>
The shared experts run on a side CUDA stream to overlap the grouped-expert dispatch comm. The forward used forward-only wait_stream barriers with no record_stream, so the backward had no mirror fences: the side-stream grad raced with main-stream grad accumulation and the allocator could recycle in-use buffers, corrupting gradients (grad-norm explosion 1e7+ -> deepep deadlock / CUDA launch failure) at >=8192 tokens/microbatch on H100 (timing/hardware-dependent). Wrap the side-stream segment in two identity autograd Functions (_SharedExpertStreamFork on x, _SharedExpertStreamJoin on z applied after experts() launches) that fence BOTH passes and record_stream the cross-stream tensors. shared_experts stays in the normal autograd graph, so FSDP2 post-backward reduce-scatter hooks and gradient accumulation are unaffected; expert parallelism (routed experts, main stream) is untouched. Now that the overlap is backward-correct, remove the disable_shared_expert_overlap escape hatch (BackendConfig field + gating branch + its two unit tests); the overlap is unconditional. Validated mbs=4 on the full 560B model (4 nodes, ep32, deepep, MTP, AC): overlap-on grad matches the sequential trajectory exactly; moe unit tests pass (148). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>
|
/claude review |
There was a problem hiding this comment.
Light code review — PP and CP for nemotron v3
Comprehensive PR with thorough test coverage (7 new/updated test files). The shared-expert stream overlap fix (_SharedExpertStreamFork/_SharedExpertStreamJoin autograd Functions) and the in-place MoE view-based load optimization are well-designed.
One latent bug flagged inline: the cu_seqlens derivation from a 2D attention_mask at model.py:188-189 produces a global cross-batch cumulative sum that is semantically incorrect for B > 1. Currently masked by downstream guards, but two secondary consumers (PP _seq_idx_tail, MTP kwargs forwarding) read it without a B > 1 guard. See inline comment for details and suggested fix.
Everything else looks clean — the THD cu_seqlens/cu_seqlens_padded split, the trailing-pad absorption heuristic, the MTP cross-boundary label masking, and the PP stage meta/forward contract are all correctly wired.
Behavior-preserving cleanups to the non-PP THD/MTP path and surrounding
gating logic:
- _uses_thd_collater: drop the unreachable __name__/__module__ fallback and
the string-path match. ConfigNode resolves any *_fn key (collate_fn ends in
_fn) from its dotted-path string to the actual callable at load time, so the
value is always the function, never a string — a single identity check
suffices.
- MTP loss (non-PP): inline cu_seqlens=batch.get("cu_seqlens") at the call
site instead of hoisting a _mtp_cu_seqlens copy. filter_forward_kwargs keeps
cu_seqlens (the model forward takes **kwargs) and **batch does not pop it,
so the value is identical.
- PP cu_seqlens->loss block: rename _cu/_loss_fn_obj to cu_seqlens/pp_loss_fn,
drop the redundant shape[0]==1 and isinstance guards (squeeze(0) is a no-op
when dim 0 != 1), and flatten the nested conditionals. Comment is now
model-agnostic and states the constraint (one cu_seqlens encodes a single
shared layout, correct only at one pack/microbatch per step; the seq_idx
tail handles differing per-microbatch boundaries).
- Remove the dead _te_attn / _model_or_cfg locals: leftovers from the old gate
that required TE attention on this rank; the gate is now THD-collater based.
- moe/state_dict_mixin: genericize two illustrative comments (no behavior
change).
Validated on the full 560B model (4 nodes, EP32, deepep, MTP, THD packing) at
mbs=4: a clean 25-step run, loss 2.24 -> 1.61, grad_norm 0.11-0.32 (no
explosion), no crash/NaN/OOM.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Adil Asif <adasif@nvidia.com>
PR #2316 review flagged a latent bug: the non-THD fallback that recovers cu_seqlens from a 2D attention_mask used seq_lens.cumsum(0), which at B>1 is a global cross-row offset (a single flattened-stream descriptor applied across independent batch rows). mamba was already guarded, but the PP seq_idx tail and the MTP kwargs forwarding consumed it unguarded. - model.py: gate the derivation on attention_mask.shape[0] == 1. At B>1 the rows are independent sequences, so cu_seqlens is simply not synthesized and every consumer treats each row as one sequence (mamba per-row seq_idx, the PP tail's no-mask sentinel, no MTP forwarding). - layers.py: restructure the mamba seq_idx build so B>1 BSHD always gets per-row zeros independent of cu_seqlens, preserving the prior mamba behavior (no None-vs-zeros change) while the derivation gate keeps the bad cu_seqlens out of the other two consumers. Stale-test fixes surfaced by running the full unit suite on GPU: - test_nemotron_v3_pp_mtp.py: the final PP stage now appends an int32 [B,S] seq_idx tail (logits, *mtp_per_depth_h, seq_idx); update the 3 arity/shape assertions to expect it (it predates the tail). - moe/test_layers.py, moe/test_latent_projection.py: the shared-expert fork/join calls Tensor.record_stream with the mocked torch.cuda.Stream; no-op _record_stream_safe in the existing stream-mock block so the Mock isn't passed to record_stream (test-only; no source change). Verified on 4xH100 node: tests/unit_tests/moe + tests/unit_tests/models/ nemotron_v3 = 640 passed, 3 skipped, 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>
Resolve calculate_mtp_loss conflict: keep this branch's mbs>1 THD seq_idx/cu_seqlens derivation + cross-boundary label masking alongside main's new mtp_per_depth_logits path and 3-way loss branch (logits / FusedLinearCrossEntropy hidden-states / lm_head). Signed-off-by: Adil Asif <adasif@nvidia.com>
|
/ok to test 6c49fd5 |
|
/claude review |
| dispatcher_num_sms: int = 20 | ||
| dispatcher_share_token_dispatcher: bool = True | ||
| dispatcher_async_dispatch: bool = False | ||
| disable_shared_expert_overlap: bool = False |
Resolve conflicts in loss/mtp.py and recipes/llm/train_ft.py by keeping both sides: Alexandros' typed MTPLossConfig (scaling_factor/ignore_index) alongside main's PP/CP seq_idx + cu_seqlens plumbing from #2316. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Resolve conflicts in loss/mtp.py and recipes/llm/train_ft.py by keeping both sides: Alexandros' typed MTPLossConfig (scaling_factor/ignore_index) alongside main's PP/CP seq_idx + cu_seqlens plumbing from #2316. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Conflicts resolved: - checkpoint/checkpointing.py: main moved _is_geq_torch_2_9 / CheckpointingConfig / SaveConsolidatedMode into checkpoint/config.py and added dtype-mapping normalization; kept main's import-based structure and preserved the nemotron-singlegpu-lora module NOTE. Dropped the now-redundant local _is_geq_torch_2_9 definition. - models/nemotron_v3/model.py: kept both independent method sets — this branch's gradient_checkpointing_enable/disable and main's PP-stage helpers (#2316). - examples/llm_finetune/gpt_oss/gpt_oss_20b_single_gpu.yaml: kept deleted (this branch removed it together with its test-config references; nothing in the merged tree references it — only the _peft variant remains). Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
What does this PR do ?
Adds pipeline parallelism (PP) and context parallelism (CP) support for the NemotronV3/H (hybrid mamba + attention + MoE) model family, including Multi-Token Prediction (MTP) and THD sequence packing, plus the supporting MoE checkpoint and shared-expert changes these models need.
Changelog
Pipeline parallelism (NemotronV3/H)
(hidden_states, *mtp_embed_inputs); the last stage returns(logits, *mtp_per_depth_h, seq_idx). The trailing[B, S] int32seq_idxbinds each microbatch's sub-sequence layout to its own loss call via the PP runtime's output→loss contract, so the wiring is schedule-agnostic.Context parallelism (NemotronV3/H)
context_parallel()+ SDPA), with THD-format CP batch sharding.Multi-Token Prediction (MTP)
k+1masking.PipelineCausalLMLossadds the MTP auxiliary CE on the last stage; supports both hidden-state and pre-computed-logits inputs.seq_idx(preferred) with acu_seqlensfallback for models that don't emit aseq_idxtail.THD sequence packing (
thd_utils)cu_seqlensplus a separatecu_seqlens_paddedandmax_seqlenthat honor Transformer Engine's THD contract; trailing-pad absorption for the single-pack case.seq_idxderived from the padded layout (searchsorted(..., right=True)) so the SSD scan segments packed bins correctly at micro-batch > 1.cu_seqlens-from-2D-attention_maskrecovery on batch size == 1 (a single flattened stream); at B>1 each row is an independent sequence, so cu_seqlens is left unset and each consumer treats every row on its own.MoE supporting changes
_SharedExpertStreamFork/_SharedExpertStreamJoinidentity autograd functions add bidirectional stream fencing +record_streamso the side-stream overlap is correct in both forward and backward.to_hfexpert splitting uses in-place strided views into the model's grouped storage (avoids large per-expert scratch on memory-tight nodes), with a save-time materialization helper for safetensors.Recipe (
train_ft)cu_seqlensinto the non-PP MTP loss.Before your PR is "Ready for review"
Pre checks:
seq_idx-from-cu_seqlens, MTP cross-boundary masking, MTP rolling, MoE state-dict incl. LoRA and in-place views, and context-parallel for attention / mamba / hybrid)Additional Information