SALMAutomodel - long-context support (chunking, AC) and batch of fixes #15648
SALMAutomodel - long-context support (chunking, AC) and batch of fixes #15648
SALMAutomodel - long-context support (chunking, AC) and batch of fixes #15648Conversation
Adds an optional AIStore GetBatch path to the multimodal conversation adapters (``NeMoMultimodalConversationJsonlAdapter`` and ``NeMoMultimodalConversationShareGPTJsonlAdapter``) and to ``SALMDataset``. When ``USE_AIS_GET_BATCH=true``, adapters build URL-backed cuts (no tar open) and the dataset uses ``AudioSamples(use_batch_loader=True)`` to issue a single batched fetch per minibatch. ``collate_conversation_audio_fault_tolerant`` is refactored to delegate loading/collation to ``AudioSamples`` and drop any conversation whose cuts didn't survive, preserving the legacy fault-tolerant semantics.
DeepEP's dispatch collective has a hardcoded 100s in-kernel barrier timeout; first-iteration rank skew (FSDP2 all-gather, kernel autotune, DeepEP Buffer construction, JIT compile) can exceed that budget and kill an otherwise-healthy run. Adds ``distributed_warmup_barriers`` (default 0 = disabled) to ``SALMAutomodel``. When enabled: - ``on_fit_start`` runs a dummy forward between WORLD ``dist.barrier()`` calls (via ``warmup_distributed_forward``), migrating cold-start work to NCCL's tunable watchdog regime. Uses ``randn * 0.02`` input and ``seq_len=512`` to avoid gate-tie-breaking routing all tokens to the lowest-index EP rank, and installs per-MoE-layer forward hooks that ``torch.cuda.synchronize() + dist.barrier()`` so per-layer skew cannot accumulate across 30+ MoE blocks into a single > 100s dispatch. - ``training_step`` / ``validation_step`` issue a pre-step WORLD barrier for the first N steps to absorb per-batch dataloader / straggler jitter. - ``validation_step`` also barriers when a dataset is exhausted on this rank but not others, keeping MoE forward calls aligned across ranks.
Three latent bugs blocked any run using a ``MultiLayerProjectionConnector`` (or ``QformerConnector``) modality adapter: - ``AudioPerceptionModule.encoder`` wasn't a property, so when the encoder was wrapped by ``ConformerMultiLayerFeatureExtractor`` into ``encoder_multilayer`` it was no longer reachable as ``.encoder``. ``training_step`` (and any downstream freeze / FSDP traversal code) would miss it. Redirects to ``encoder_multilayer.encoder`` when present, otherwise falls back to the directly-registered submodule. - ``setup_speech_encoder`` loaded the pretrained ASR state dict as-is, but with a multilayer adapter the encoder weights live under ``encoder_multilayer.encoder.*`` — so every pretrained Canary weight was silently dropped by ``strict=False``. Remap ``encoder.*`` → ``encoder_multilayer.encoder.*`` before loading. - ``setup_speech_encoder`` synced ``perception.output_dim`` to ``llm.config.hidden_size`` but left ``modality_adapter.output_dim`` alone. When a connector carries its own output projection (MultiLayerProjectionConnector), the outer ``perception.proj`` is ``nn.Identity()``, so the inner ``Linear(..., 4096)`` flowed 4096-dim audio embeds into a 2688-dim LLM (Nemotron-3-Nano-30B-A3B), crashing at ``replace_placeholders_and_build_targets`` with a shape mismatch. Also propagate ``hidden_size`` into ``modality_adapter.output_dim`` when present so the connector's inner projection matches the LLM. Diagnosed from failed run ``nano-v3-canary-v2-asr-granary1p1-lr1e-4-10k-fixdp-mlproj-4node``.
Nemotron V3 (``FSDPNemotronHForCausalLM``) ignores the 2D ``attention_mask`` and requires a precomputed 4D padding-aware causal mask via ``causal_mask_mapping["full_attention"]`` — attention blocks consume the mapping, and the 2D mask is only used by Mamba blocks during prefill. Without this, padded speech/text positions were silently treated as unmasked by attention, producing a slight degradation without crashing. - ``prepare_inputs`` precomputes ``(B, 1, T, T) = tril(ones) & pad.unsqueeze`` under ``full_attention`` when the LLM accepts it, and returns it in the batch dict. - ``forward`` takes an optional ``causal_mask_mapping`` kwarg and only forwards it to ``self.llm(...)`` when the backbone accepts it (so vanilla HF LLMs stay compatible). - ``configure_model`` sets ``_llm_accepts_causal_mask_mapping`` by inspecting the loaded LLM's forward signature. - ``training_step`` / ``validation_step`` thread the mask through to ``self(...)``. Addresses review comment: Nemotron V3 training path missing Automodel's required ``causal_mask_mapping``. Ported from NeMo-abl1-causalmask.
Previously ``training_step`` normalized CE by each rank's local labeled-token count (``sum_CE / num_frames``). With variable-length speech batches every rank contributes a differently-scaled gradient; FSDP then averages those already-normalized gradients, which is not the same as optimizing the true global mean token loss. Produces a small but systematic objective mismatch, especially with heterogeneous audio lengths. Mirror Automodel's training recipe: all-reduce ``num_frames`` across the DP process group (``self._get_moe_dp_group()``) and scale the per-rank loss by ``dp_size / num_frames_global`` so that FSDP's gradient averaging yields ``sum(rank_CE_sum) / num_frames_global``. Logged ``loss`` stays on the same (local per-token) scale as before — ``loss_display = loss_sum / num_frames`` is a detached, logging-only value — so existing dashboards remain comparable. Also logs ``num_frames_global`` alongside the existing ``num_frames``. Addresses review comment: SALM normalized CE by local token count while Automodel normalizes by the global DP count. Ported from NeMo-abl2-lossnorm.
``MoEAuxLossAutoScaler`` multiplies aux-loss-derived gradients by ``main_loss_backward_scale`` during backward. FSDP's all-reduce then divides every gradient by ``dp_group_size``, so aux-loss grads end up under-scaled by ``dp_group_size`` relative to the intended per-token scaling that ``MoE.forward`` applies in ``MoEAuxLossAutoScaler.apply(weights, aux_loss * weights.shape[0])``. Mirror Automodel's non-PP recipe (``train_ft.py:1456-1458``): at ``on_fit_start`` set ``main_loss_backward_scale = dp_group_size`` via ``_get_moe_dp_group`` (``include_cp=True`` semantics) so FSDP's division cancels out and the net aux-loss gradient scale is 1. No-op when ``nemo_automodel`` isn't installed or ``aux_loss_coeff == 0`` (the scaler is never applied in the gate).
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
StatelessTimer._check_time_remaining (and PreemptionCallback.on_train_batch_end) save the -last checkpoint from inside PTL's on_train_batch_end hook. That hook fires BEFORE TrainingEpochLoop.advance() calls batch_progress.increment_completed(), but the batch's optim step has already advanced global_step. The saved state ends up with batch_progress.current.completed one behind optim_progress.step.total.completed. On resume, PTL's reset_on_restart rewinds batch_progress, PTL replays the in-flight batch, and its optim step runs a second time — leaking +1 global_step per resume. Observed with num-runs=2 chained training: a clean sweep showed epoch-end saves at 1000..5000, then 6001, 7001, 8001, 9001 — the +1 introduced at the resume persists through the rest of training, and max_steps trips mid-epoch so the final epoch save never happens. Fix: flush the in-flight batch via batch_progress.increment_completed() in both StatelessTimer and PreemptionCallback before saving, so the saved state is self-consistent and resume does not replay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old nemo_automodel.components.distributed.device_mesh.create_device_mesh calls _flatten() but discards the return value, leaving device_mesh._flatten_mapping empty. The new MoE parallelizer resolves dp_shard_cp via that mapping, so SALM training blew up with a KeyError on "dp_shard_cp" after pulling the latest Automodel main. Switch to the migrated helper in mesh_utils, which populates _flatten_mapping the same way as every other Automodel caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aining" This reverts commit fc85688.
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
… off grad if final layer is skipped for checkpoint compat Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Factor the chunking logic into nemo/collections/speechlm2/parts/encoder_chunking.py and wire it into both SALM and SALMAutomodel behind an optional encoder_chunk_size_seconds config key. The helper splits long audio rows on the time axis, runs them through the perception forward as a padded chunk batch, and concatenates the per-row embeddings back before the LLM forward, so the LLM-facing shapes are identical to the non-chunked path. Leaving encoder_chunk_size_seconds unset (or null) preserves the legacy behavior byte-for-byte, which keeps SALM 100% backwards compatible with existing configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DeepEP 100s in-kernel timeouts that motivated this change were actually caused by very slow Lustre NFS lock files during JIT compilation, not by first-iteration rank skew. The warmup path adds complexity without addressing the root cause. Removes the on_fit_start dummy-forward warmup, the per-step WORLD barrier gate (_maybe_warmup_barrier), the validation-step barrier on dataset exhaustion, the maybe_barrier()/warmup_distributed_forward() helpers, the _AutomodelMoE import, and the distributed_warmup_barriers config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: pzelasko <pzelasko@users.noreply.github.com>
SALMAutomodel - batch of fixes SALMAutomodel - long-context support (chunking, AC) and batch of fixes
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
|
/ok to test |
@pzelasko, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 1ef31c0 |
|
/ok to test 1ef31c0 |
|
/ok to test b9528d5 |
|
[🤖]: Hi @pzelasko 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
| self.encoder = encoder | ||
| self.num_layers = len(encoder.layers) | ||
| self.layer_idx_list = [] | ||
| self.include_final_output = False |
There was a problem hiding this comment.
How about exposing include_final_output as another input param, since it's not captured by the layer_idx_list? Using -1 in layer_idx_list may also be misleading since people usually think the -1 index refers to the last layer output, but not the final model output.
| audios.extend(conv_audios) | ||
| all_cuts.extend(conv_cuts) | ||
| ids = [] | ||
| for cut in conversation.list_cuts(): |
There was a problem hiding this comment.
Not a big issue, but could you add some docstring here to explain what this is doing? Might be useful if others want to modify the adapter
What does this PR do ?
A batch of correctness and feature fixes for the
speechlm2SALM / SALMAutomodel training path (long-audio support, MoE/FSDP loss scaling, preemption resume, multi-layer perception, AIStore data loading, and assorted plumbing).Collection: speechlm2 (with small touch-ups in
asr/modules/conformer_encoder.py,common/data/lhotse/text_adapters.py,utils/callbacks/preemption.py,utils/exp_manager.py).Changelog
nemo/collections/speechlm2/parts/encoder_chunking.pywired into both SALM and SALMAutomodel via an optionalencoder_chunk_size_secondsconfig key. Splits long audio rows on the time axis and concatenates per-row embeddings before the LLM forward; LLM-facing shapes are unchanged. Leaving the key unset is byte-for-byte backwards compatible.training_stepnow all-reducesnum_framesacross the DP group and rescales loss bydp_size / num_frames_globalso FSDP gradient averaging yields the true global mean token loss instead of an average of locally-normalized per-rank losses.MoEAuxLossAutoScaler.main_loss_backward_scale = dp_group_sizeaton_fit_startso FSDP's per-grad division cancels out and the net aux-loss gradient scale is 1 (mirrors Automodel's non-PP recipe).global_stepdrift on wall-time / preemption resume:StatelessTimerandPreemptionCallbacknow flush the in-flight batch viabatch_progress.increment_completed()before saving-last, so resume does not replay the last batch and leak +1 step per resume.MultiLayerProjectionConnector/QformerConnectorpath in SALMAutomodel: makeAudioPerceptionModule.encodera property that resolves throughencoder_multilayer, remap pretrainedencoder.*→encoder_multilayer.encoder.*on load (was being silently dropped bystrict=False), and propagatehidden_sizeintomodality_adapter.output_dimso connector inner projections match the LLM dim.[..., -1]slicing) and disable grad on skipped final layers for checkpoint compatibility.tests/collections/speechlm2/test_perception_activation_checkpointing.py.USE_AIS_GET_BATCH=truepath inNeMoMultimodalConversationJsonlAdapter/...ShareGPTJsonlAdapterandSALMDataset(single batched fetch per minibatch, no tar open).collate_conversation_audio_fault_tolerantrefactored to delegate toAudioSampleswhile preserving fault-tolerant semantics.salm_automodel.yaml.create_device_meshimport fix: switch tomesh_utils.create_device_meshso_flatten_mappingis populated and the new MoE parallelizer can resolvedp_shard_cp(was crashing withKeyError("dp_shard_cp")after Automodel main bumped).Usage
Long-audio chunking is opt-in via the model config:
AIStore GetBatch is enabled via env var on the data side:
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
tests/collections/speechlm2/test_salm.py,test_salm_automodel.py,test_perception_activation_checkpointing.py,test_parallel.py,_chunking_helpers.py,tests/collections/common/test_lhotse_multimodal_ais_get_batch.py,tests/core_ptl/test_ptl_stateless_timer.py)docs/source/speechlm2/{configs,models}.rst)nemo_automodelpaths are guarded — MoE aux-loss fix is a no-op when not installed)PR Type:
Additional Information