feat: adding PP and CP for nemotron v3 models by adil-a · Pull Request #2316 · NVIDIA-NeMo/Automodel

adil-a · 2026-05-25T23:30:55Z

What does this PR do ?

Adds pipeline parallelism (PP) and context parallelism (CP) support for the NemotronV3/H (hybrid mamba + attention + MoE) model family, including Multi-Token Prediction (MTP) and THD sequence packing, plus the supporting MoE checkpoint and shared-expert changes these models need.

Changelog

Pipeline parallelism (NemotronV3/H)

Per-stage forward contract: mid stages return (hidden_states, *mtp_embed_inputs); the last stage returns (logits, *mtp_per_depth_h, seq_idx). The trailing [B, S] int32 seq_idx binds each microbatch's sub-sequence layout to its own loss call via the PP runtime's output→loss contract, so the wiring is schedule-agnostic.
MTP embed-input propagation between stages and last-stage-only MTP head ownership.

Context parallelism (NemotronV3/H)

CP for both attention and the Mamba2 mixer (TE p2p / DualChunkSwap and DTensor context_parallel() + SDPA), with THD-format CP batch sharding.

Multi-Token Prediction (MTP)

MTP head with optional repeated-layer reuse; per-depth label rolling with trailing-k+1 masking.
PipelineCausalLMLoss adds the MTP auxiliary CE on the last stage; supports both hidden-state and pre-computed-logits inputs.
Cross-sub-sequence label-roll masking via seq_idx (preferred) with a cu_seqlens fallback for models that don't emit a seq_idx tail.

THD sequence packing (thd_utils)

Emit real cu_seqlens plus a separate cu_seqlens_padded and max_seqlen that honor Transformer Engine's THD contract; trailing-pad absorption for the single-pack case.
Mamba seq_idx derived from the padded layout (searchsorted(..., right=True)) so the SSD scan segments packed bins correctly at micro-batch > 1.
Gate the cu_seqlens-from-2D-attention_mask recovery on batch size == 1 (a single flattened stream); at B>1 each row is an independent sequence, so cu_seqlens is left unset and each consumer treats every row on its own.

MoE supporting changes

Backward-correct shared-expert stream overlap: _SharedExpertStreamFork / _SharedExpertStreamJoin identity autograd functions add bidirectional stream fencing + record_stream so the side-stream overlap is correct in both forward and backward.
Checkpoint to_hf expert splitting uses in-place strided views into the model's grouped storage (avoids large per-expert scratch on memory-tight nodes), with a save-time materialization helper for safetensors.

Recipe (train_ft)

Gate THD/cu_seqlens processing on the dataset being THD-packed rather than on TE attention being present on a rank, so PP stages without attention layers still build cu_seqlens downstream; plumb cu_seqlens into the non-PP MTP loss.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests? (unit + functional: PP/MTP, THD utils, seq_idx-from-cu_seqlens, MTP cross-boundary masking, MTP rolling, MoE state-dict incl. LoRA and in-place views, and context-parallel for attention / mamba / hybrid)
Did you add or update any necessary documentation?

Additional Information

Enables PP and CP for the NemotronV3/H model family. CI (incl. the L0 unit suites and the L2 context-parallel functional tests) is green.

… fix Source-only snapshot for adil/adil-test (YAMLs intentionally excluded). - NemotronV3 pipeline-parallel + MTP wiring (model.py, mtp.py, layers.py, state_dict_adapter.py, common/mtp/mtp.py). - Loss path: PipelineCausalLMLoss, calculate_mtp_loss with cu_seqlens / seq_idx threading, MoEAuxLossAutoScaler-aware grad scaling in train_ft. - Attention preprocess threading for THD + CP (attention/utils.py). - MoE parallelizer: iterate transformer and MTP blocks together for CP hook attach and FSDP wrapping. THD collator fix: - process_input_for_thd: gate trailing-pack-pad absorption on trailing_pad <= max_seqlen; for larger pads, extend cu_seqlens with dummy max_seqlen-sized slots and regenerate position_ids per-slot. - split_batch_into_thd_chunks: emit cu_seqlens_padded whenever any chunk needs it (any-vs-all), absorbed chunks fall back to cu_seqlens. - NemotronV3Model.forward and NemotronHForCausalLM MTP-kwargs: strip -1000 sentinel right-pad from cu_seqlens / cu_seqlens_padded before they reach TE attention / mamba searchsorted. Tests: - tests/unit_tests/distributed/test_thd_utils.py: dummy-slot extension coverage, mixed-chunk short+full coverage; updated existing tests to reflect cu_seqlens emit semantics (real cumsum, not padded). - tests/unit_tests/models/nemotron_v3/test_nemotron_v3_mtp.py + test_nemotron_v3_pp_mtp.py: list-form mtp_layers_block_type resolution, PP-stage metas, MTP placement on last stage, sentinel filter on trimmed mid-stages. Dataset: - nemo_automodel/components/datasets/llm/pre_rendered_chat_dataset.py: new dataset for pre-rendered chat data. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Not used by the THD/PP/MTP changes on this branch; keep it out of the remote until it ships with its own dedicated PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-05-25T23:30:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…contract The THD collator's trailing-pad absorption grows ``cu_seqlens[-1]`` to ``total_tokens`` so TE can skip the ``pad_between_seqs=True`` path. But ``max_seqlen`` was computed BEFORE the absorption from the unpadded ``seq_lens`` — so we handed TE a ``cu_seqlens`` containing a slot wider than the ``max_seqlen`` value we claimed. That is explicit UB per TE's documented contract (``fused_attn.py:152-159``, ``fused_attn.h:548-551``: ``max_seqlen_q ... may be >= max(seqlen_q_i)``; smaller-than-actual is undefined). cuDNN-fused-attn-bwd happened not to crash on the small-overshoot case (128 vs 112) but did crash on the larger captured "short microbatch" (576 vs 112). Fix: defer the ``max_seqlen`` computation to after the absorption block and source it from the FINAL ``cu_seqlens`` slot widths. With the value now truthful, absorption is contract-clean for any trailing-pad size, so the previous defensive dummy-slot extension is no longer needed. Verified with a single-GPU TE probe (te_thd_repro_MINIMAL.py): feeding ``cu_seqlens=[0,112,224,336,448,1024]`` with truthful ``max_seqlen=576`` runs cleanly; the same layout with the lied ``max_seqlen=112`` still crashes. Smoke tests pass at PP=4 EP=2 CP=1 and PP=4 EP=2 CP=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

``cu_seqlens`` and ``cu_seqlens_padded`` are constructed inside nested ``if seq_lens is not None``/``if seq_lens_padded is not None`` blocks, so ``cu_seqlens_padded`` can only be non-None when ``cu_seqlens`` is also non-None — the ``else cu_seqlens_padded`` fallback is unreachable. Falling back to the padded variant would also be semantically wrong (memory offsets vs. compute extent, per TE's docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

…tics After the trailing-pad absorption + max_seqlen fix, ``cu_seqlens`` is emitted from ``seq_lens`` (real lengths), not ``seq_lens_padded``. The docstrings still said the opposite and the example output value was wrong. Also document the optional ``cu_seqlens_padded`` and ``max_seqlen`` keys that the function emits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Keep the comments that explain WHY (TE contract, the absorption gate, the pad_between_seqs flip on cu_seqlens_padded presence). Drop the ones that just restate WHAT the surrounding code already shows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

The cu_seqlens → seq_idx derivation used the default ``side="left"`` for ``torch.searchsorted``. By that contract a position equal to a boundary (``t == cu_seqlens[k]`` — the FIRST token of sub-seq k) gets classified as sub-seq k-1 instead of k, because cu_seqlens[1:][k-1] = cu_seqlens[k] is ≥ t and side="left" picks the leftmost insertion point. That is an off-by-one at every internal sub-seq boundary. Two production sites were affected: * ``loss/mtp.py:calculate_mtp_loss`` — derives seq_idx from cu_seqlens to mask MTP labels whose rolled source crosses a sub-seq boundary. Without right=True, the boundary token's mask is wrong. * ``models/nemotron_v3/layers.py:NemotronV3Mamba2Mixer.forward`` — derives seq_idx for mamba's SSD scan state-reset. Without right=True, the state reset happens one token late at every boundary, so the first token of a new sub-seq sees the previous sub-seq's accumulated state. Fix is a one-line change at each site: add ``right=True`` to ``torch.searchsorted``. Tests: * tests/unit_tests/loss/test_mtp_cross_boundary.py — 7 tests pinning the MTP cross-boundary mask behavior against a hand-coded reference. * tests/unit_tests/distributed/test_seq_idx_from_cu_seqlens.py — 34 tests covering deterministic edge cases, randomized exhaustive verification against a brute-force linear scan, production-site pinning, and a static-analysis guard so reverts to side="left" fail. Smoke verification: 8-GPU PP=4 EP=2 CP=1 hellaswag smoke produces a loss trajectory bit-identical to pre-fix (hellaswag's pad-to-max-length data has long pad runs between sub-seqs that absorb the bug's effect on this dataset; the fix matters more for tightly-packed data). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Drops the env-var-gated instrumentation added during the nemotron_v3 PP+MTP + THD-collator bug hunt: * NEMO_PP_PACK_DEBUG — per-microbatch packing/PP info prints from train_ft.py, nemotron_v3/{model,layers}.py, and common/mtp/mtp.py. * NEMO_CU_DEBUG — cu_seqlens print inside the model forward. * NEMO_DUMP_BATCH_DIR — per-step per-rank batch tensor dump for offline replay. * NEMO_REPLAY_BATCH_DIR / NEMO_REPLAY_BATCH_STEP / NEMO_REPLAY_OVERRIDE_STEP / NEMO_REPLAY_OVERRIDE_WHAT — offline batch replay + mixed-field override hooks. * NEMO_COPY_MB0_TO_ALL — duplicate microbatch 0 across all microbatch slots. Also cleans up the few helper variables (``_seq_idx_source``, ``cu_seqlens_source``) that only existed to populate those prints, and removes the now-unused ``import os`` from layers.py and model.py. Net: -291 LOC across 4 files. 8-GPU PP=4 EP=2 CP=1 hellaswag smoke is bit-identical to pre-cleanup (10/10 steps, val 8.9692). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

…ition_ids At MTP depth k, the slot at position t embeds the token originally at position ``t + k + 1``. That is implemented in ``MTPModule.forward`` by cumulatively rolling ``cur_input_ids`` and ``cur_position_ids`` left by one at each depth iteration. The label-side rolling on the loss path is already covered by ``tests/unit_tests/loss/test_mtp_cross_boundary.py``; this adds the matching coverage on the model side. Six new tests using a recording sublayer to intercept what reaches the inner block at each depth: * test_cumulative_input_ids_rolling_1d / _position_ids_rolling_1d / _rolling_2d_batch — depth k receives inputs rolled by -(k+1). * test_multi_sublayer_per_depth_sees_same_rolled_inputs — when pattern_length > 1, all sublayers of a depth share the same rolled inputs (no further intra-depth rolling); embed_input is only on sublayer 0. * test_precomputed_embed_inputs_path_skips_token_rolling — when the caller pre-computes embeddings (PP final-stage path), the embed rolling is skipped but position_ids still rolls. * test_trailing_positions_zero_after_roll — roll is left-shift with zeros, not wrap-around; the last k+1 slots are 0 at depth k. CPU-only, no GPU required, ~4s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Keep the WHY notes that explain non-obvious behavior (the ``right=True`` boundary semantics for searchsorted, the PP-chunking shape contract, and the cross-sub-seq mask rationale). Drop the WHAT-narrating comments — the surrounding code already says what it does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

…forward Two comment blocks in ``NemotronV3Model.forward`` and the MTP-kwargs path claimed PyTorch PP could not chunk ``cu_seqlens`` and ``broadcasts kwargs unchunked``. That was true earlier when ``cu_seqlens`` flowed as 1D ``[num_seqs+1]``; after ``split_batch_into_thd_chunks``, ``cu_seqlens`` is emitted stacked as ``[num_microbatches, max_K]`` and PP ``tensor_split``s along dim 0, yielding ``[1, max_K]`` per microbatch. The normalization code (squeeze + sentinel strip + max_seqlen flatten) is unchanged — the shapes it handles are exactly what PP produces today; only the comments were misdescribing the mechanism. Also reframes the cu_seqlens rebuild-from-attention_mask block as the non-THD-collater path (which is the real reason it exists) rather than the now-incorrect "PP can't chunk" rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Keep WHY-explaining notes (PP tuple arity contract, sentinel-strip rationale, FSDP2 lazy-init constraint, CP context forwarding rationale, THD-vs-BSHD backend gating). Drop WHAT-narrating prose that just restates what the next few lines do. Net -106 LOC in model.py. Source behavior unchanged; verified by ``pytest tests/unit_tests/models/nemotron_v3/`` (3 pre-existing ``seq_idx``-tail arity test failures unchanged from prior runs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Third instance of the cu_seqlens → seq_idx off-by-one. The PP last-stage seq_idx tail (used by ``PipelineCausalLMLoss`` for the MTP cross-boundary mask) derived per-token sub-seq ids via ``searchsorted`` with the default ``side="left"``, which mis-classifies every internal boundary token (``position == cu_seqlens[k]``) as belonging to sub-seq ``k-1`` instead of ``k``. The loss function's own ``right=True`` fix only protects its cu_seqlens-derivation fallback (mtp.py:87); when an explicit ``seq_idx`` is supplied (the PP path), the loss uses it verbatim. So this builder was the active source of off-by-one masks at sub-seq boundaries for THD+MTP+PP training. Matches the prior fixes in ``loss/mtp.py`` and ``models/nemotron_v3/layers.py``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

…alLM.forward The trim pass dropped the per-parameter Args section that's on main. Restore it (with the new ``*mtp_embed_inputs`` positional plus the PP-aware tuple Returns), keep the existing high-level summary and PP-awareness paragraph from the recent trim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

…ocstring Two small corrections: * The summary paragraph claimed the last-PP-stage return was ``(logits, *mtp_per_depth_h)``, but with MTP enabled the actual return appends a ``seq_idx`` tail (``1 + D + 1`` arity), as the Returns block just below already states. Defer to the Returns section instead of duplicating (and contradicting) it. * The THD-squeeze sentence over-claimed: the squeeze only happens when the attention backend is TE; SDPA / flex paths stay in BSHD even with ``qkv_format='thd'``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Drop the multi-line preamble explaining why we iterate both backbone and MTP attention blocks — ``_iter_transformer_and_mtp_blocks`` is named self-descriptively. The non-obvious caveat (uninitialized CP state → illegal-memory-access) belongs in the helper's docstring, not at every call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

…aths * Fix `NemotronV3Mamba2Mixer.forward` seq_idx builder: when the input is BSHD with B > 1 (default-collater validation), the cu_seqlens derived from the 2D attention_mask is a global cumsum across rows, not per-row. The mamba_ssm kernel asserts `seq_idx.shape == (B, S)` and treats each row independently, so build seq_idx as `zeros(B, S, int32)` for that case and keep the THD `searchsorted(right=True)` path for the 1D pack case. * Fix `calculate_mtp_loss`: when labels arrive 1D `[T]` from `process_input_for_thd` (non-PP THD path), squeeze the synthetic batch axis from each per-depth hidden tensor so FusedLinearCrossEntropy / cut_cross_entropy's `hidden_states.shape[:-1] == labels.shape` invariant holds. Adds unit tests for both code paths. Signed-off-by: Adil Asif <adasif@nvidia.com>

`_convert_single_merged_expert_to_hf_split_experts` now returns non-contiguous strided views into the model's grouped DTensor local storage whenever the source is a model DTensor with plain (non-DTensor) per-expert splits (`ep_shard == 1`), instead of allocating fresh per-expert contiguous copies. DCP's `target.copy_(source)` writes safetensors data directly through the views into the model's grouped storage, and `_from_hf_w_merged_experts` skips the rebuild for the corresponding native key (tracked in a new `_inplace_loaded_native_keys` set on the adapter). For Ultra-class MoE checkpoints (Nemotron Ultra V3, 560B params, 512 experts) this avoids tens of GB of per-expert scratch on top of the already-materialized model: the load path no longer OOMs at `to_hf` on memory-tight nodes. Save callers must materialize the views before serializing — `safetensors.torch.save` rejects non-contiguous tensors — so `save_model` now calls a new `_materialize_to_hf_views_for_save` helper that walks the dict and calls `.contiguous()` per tensor with `empty_cache` between iterations, keeping the save-time transient bounded to a single expert weight instead of allocating the full grouped set. Adds five `TestInplaceLoadViews` unit tests covering the gated and non-gated paths, the non-DTensor fallback, write-through to model storage, and from_hf rebuild-skip. Signed-off-by: Adil Asif <adasif@nvidia.com>

…acking At local_batch_size>1, multiple packs are concatenated and each pack carries its own trailing pad, so cu_seqlens (real lengths) differs from cu_seqlens_padded (padded slot lengths) at MULTIPLE entries — the single-trailing-pad absorption in thd_utils does not apply, and hidden states / labels are in the PADDED layout. Three coupled fixes: - mamba (nemotron_v3/layers.py): build seq_idx from cu_seqlens_padded (fallback cu_seqlens) so the SSD scan segments the padded hidden states correctly. - MTP loss (loss/mtp.py): total_len = labels.shape[-1] so the searchsorted seq_idx spans the full padded token axis. - train_ft (non-PP): pass cu_seqlens into calculate_mtp_loss so cross-sub-sequence label rolls are masked, matching the PP path. Verified on the full 560B model (4 nodes, ep32, deepep, MTP) at mbs=4: dropping the first two reproduces a 'tensor a (8064) must match tensor b (8192)' crash in the MTP cross_seq mask (real vs padded token count), confirming all three are necessary at mbs>1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>

The shared experts run on a side CUDA stream to overlap the grouped-expert dispatch comm. The forward used forward-only wait_stream barriers with no record_stream, so the backward had no mirror fences: the side-stream grad raced with main-stream grad accumulation and the allocator could recycle in-use buffers, corrupting gradients (grad-norm explosion 1e7+ -> deepep deadlock / CUDA launch failure) at >=8192 tokens/microbatch on H100 (timing/hardware-dependent). Wrap the side-stream segment in two identity autograd Functions (_SharedExpertStreamFork on x, _SharedExpertStreamJoin on z applied after experts() launches) that fence BOTH passes and record_stream the cross-stream tensors. shared_experts stays in the normal autograd graph, so FSDP2 post-backward reduce-scatter hooks and gradient accumulation are unaffected; expert parallelism (routed experts, main stream) is untouched. Now that the overlap is backward-correct, remove the disable_shared_expert_overlap escape hatch (BackendConfig field + gating branch + its two unit tests); the overlap is unconditional. Validated mbs=4 on the full 560B model (4 nodes, ep32, deepep, MTP, AC): overlap-on grad matches the sequential trajectory exactly; moe unit tests pass (148). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>

adil-a · 2026-06-01T16:36:09Z

/claude review

claude

Light code review — PP and CP for nemotron v3

Comprehensive PR with thorough test coverage (7 new/updated test files). The shared-expert stream overlap fix (_SharedExpertStreamFork/_SharedExpertStreamJoin autograd Functions) and the in-place MoE view-based load optimization are well-designed.

One latent bug flagged inline: the cu_seqlens derivation from a 2D attention_mask at model.py:188-189 produces a global cross-batch cumulative sum that is semantically incorrect for B > 1. Currently masked by downstream guards, but two secondary consumers (PP _seq_idx_tail, MTP kwargs forwarding) read it without a B > 1 guard. See inline comment for details and suggested fix.

Everything else looks clean — the THD cu_seqlens/cu_seqlens_padded split, the trailing-pad absorption heuristic, the MTP cross-boundary label masking, and the PP stage meta/forward contract are all correctly wired.

Behavior-preserving cleanups to the non-PP THD/MTP path and surrounding gating logic: - _uses_thd_collater: drop the unreachable __name__/__module__ fallback and the string-path match. ConfigNode resolves any *_fn key (collate_fn ends in _fn) from its dotted-path string to the actual callable at load time, so the value is always the function, never a string — a single identity check suffices. - MTP loss (non-PP): inline cu_seqlens=batch.get("cu_seqlens") at the call site instead of hoisting a _mtp_cu_seqlens copy. filter_forward_kwargs keeps cu_seqlens (the model forward takes **kwargs) and **batch does not pop it, so the value is identical. - PP cu_seqlens->loss block: rename _cu/_loss_fn_obj to cu_seqlens/pp_loss_fn, drop the redundant shape[0]==1 and isinstance guards (squeeze(0) is a no-op when dim 0 != 1), and flatten the nested conditionals. Comment is now model-agnostic and states the constraint (one cu_seqlens encodes a single shared layout, correct only at one pack/microbatch per step; the seq_idx tail handles differing per-microbatch boundaries). - Remove the dead _te_attn / _model_or_cfg locals: leftovers from the old gate that required TE attention on this rank; the gate is now THD-collater based. - moe/state_dict_mixin: genericize two illustrative comments (no behavior change). Validated on the full 560B model (4 nodes, EP32, deepep, MTP, THD packing) at mbs=4: a clean 25-step run, loss 2.24 -> 1.61, grad_norm 0.11-0.32 (no explosion), no crash/NaN/OOM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>

PR #2316 review flagged a latent bug: the non-THD fallback that recovers cu_seqlens from a 2D attention_mask used seq_lens.cumsum(0), which at B>1 is a global cross-row offset (a single flattened-stream descriptor applied across independent batch rows). mamba was already guarded, but the PP seq_idx tail and the MTP kwargs forwarding consumed it unguarded. - model.py: gate the derivation on attention_mask.shape[0] == 1. At B>1 the rows are independent sequences, so cu_seqlens is simply not synthesized and every consumer treats each row as one sequence (mamba per-row seq_idx, the PP tail's no-mask sentinel, no MTP forwarding). - layers.py: restructure the mamba seq_idx build so B>1 BSHD always gets per-row zeros independent of cu_seqlens, preserving the prior mamba behavior (no None-vs-zeros change) while the derivation gate keeps the bad cu_seqlens out of the other two consumers. Stale-test fixes surfaced by running the full unit suite on GPU: - test_nemotron_v3_pp_mtp.py: the final PP stage now appends an int32 [B,S] seq_idx tail (logits, *mtp_per_depth_h, seq_idx); update the 3 arity/shape assertions to expect it (it predates the tail). - moe/test_layers.py, moe/test_latent_projection.py: the shared-expert fork/join calls Tensor.record_stream with the mocked torch.cuda.Stream; no-op _record_stream_safe in the existing stream-mock block so the Mock isn't passed to record_stream (test-only; no source change). Verified on 4xH100 node: tests/unit_tests/moe + tests/unit_tests/models/ nemotron_v3 = 640 passed, 3 skipped, 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>

Resolve calculate_mtp_loss conflict: keep this branch's mbs>1 THD seq_idx/cu_seqlens derivation + cross-boundary label masking alongside main's new mtp_per_depth_logits path and 3-way loss branch (logits / FusedLinearCrossEntropy hidden-states / lm_head). Signed-off-by: Adil Asif <adasif@nvidia.com>

adil-a · 2026-06-01T18:44:43Z

/ok to test 6c49fd5

adil-a · 2026-06-01T18:44:54Z

/claude review

hemildesai · 2026-06-02T21:44:13Z

    dispatcher_num_sms: int = 20
    dispatcher_share_token_dispatcher: bool = True
    dispatcher_async_dispatch: bool = False
-    disable_shared_expert_overlap: bool = False


Can we keep this?

Resolve conflicts in loss/mtp.py and recipes/llm/train_ft.py by keeping both sides: Alexandros' typed MTPLossConfig (scaling_factor/ignore_index) alongside main's PP/CP seq_idx + cu_seqlens plumbing from #2316. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

Conflicts resolved: - checkpoint/checkpointing.py: main moved _is_geq_torch_2_9 / CheckpointingConfig / SaveConsolidatedMode into checkpoint/config.py and added dtype-mapping normalization; kept main's import-based structure and preserved the nemotron-singlegpu-lora module NOTE. Dropped the now-redundant local _is_geq_torch_2_9 definition. - models/nemotron_v3/model.py: kept both independent method sets — this branch's gradient_checkpointing_enable/disable and main's PP-stage helpers (#2316). - examples/llm_finetune/gpt_oss/gpt_oss_20b_single_gpu.yaml: kept deleted (this branch removed it together with its test-config references; nothing in the merged tree references it — only the _peft variant remains). Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

adil-a added 2 commits May 25, 2026 16:06

chore(datasets): drop pre_rendered_chat_dataset.py from branch

cf8f3de

Not used by the THD/PP/MTP changes on this branch; keep it out of the remote until it ships with its own dedicated PR. Signed-off-by: adil-a <adil.asif2000@hotmail.com>

adil-a and others added 19 commits May 25, 2026 19:15

comments

aa9238e

Signed-off-by: adil-a <adil.asif2000@hotmail.com>

claude Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread nemo_automodel/components/models/nemotron_v3/model.py

claude Bot reviewed Jun 1, 2026

View reviewed changes

adil-a and others added 3 commits June 1, 2026 17:11

copy-pr-bot Bot temporarily deployed to nemo-ci June 1, 2026 22:52 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 22:52 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 22:55 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 1, 2026 22:58 Inactive

copy-pr-bot Bot temporarily deployed to public June 1, 2026 23:02 Inactive

adil-a marked this pull request as ready for review June 2, 2026 21:34

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners June 2, 2026 21:35

adil-a enabled auto-merge (squash) June 4, 2026 00:23

hemildesai approved these changes Jun 4, 2026

View reviewed changes

adil-a merged commit 987ba24 into main Jun 4, 2026
78 checks passed

adil-a deleted the adil/adil-test branch June 4, 2026 00:31

adil-a mentioned this pull request Jun 4, 2026

#2316 verification #2404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adding PP and CP for nemotron v3 models#2316

feat: adding PP and CP for nemotron v3 models#2316
adil-a merged 25 commits into
mainfrom
adil/adil-test

adil-a commented May 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 25, 2026

Uh oh!

adil-a commented Jun 1, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

adil-a commented Jun 1, 2026

Uh oh!

adil-a commented Jun 1, 2026

Uh oh!

hemildesai Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adil-a commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 25, 2026

Uh oh!

adil-a commented Jun 1, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Light code review — PP and CP for nemotron v3

Uh oh!

adil-a commented Jun 1, 2026

Uh oh!

adil-a commented Jun 1, 2026

Uh oh!

hemildesai Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adil-a commented May 25, 2026 •

edited

Loading