NVIDIA-NeMo
diff --git a/‎docs/source/features/mixed_precision.rst‎
Lines changed: 66 additions & 2 deletions b/‎docs/source/features/mixed_precision.rst‎
Lines changed: 66 additions & 2 deletions
diff --git a/‎docs/source/index.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/index.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/speechlm2/configs.rst‎
Lines changed: 128 additions & 2 deletions b/‎docs/source/speechlm2/configs.rst‎
Lines changed: 128 additions & 2 deletions
diff --git a/‎docs/source/speechlm2/intro.rst‎
Lines changed: 61 additions & 7 deletions b/‎docs/source/speechlm2/intro.rst‎
Lines changed: 61 additions & 7 deletions
@@ -5,7 +5,7 @@ Mixed Precision Training
 
 Mixed precision training enhances computational efficiency by conducting operations in low-precision
 format while selectively maintaining critical data in single-precision. NeMo supports FP16 and BF16
-precision via PyTorch Lightning, in both mixed and true half-precision modes.
+precision via PyTorch Lightning, in mixed, true, and flash half-precision modes.
 
 Precision Modes
 ---------------
@@ -23,6 +23,16 @@ PyTorch Lightning provides two categories of half-precision training:
     but requires the model to be numerically stable in half-precision.
     SpeechLM2 models use ``"bf16-true"`` by default for training.
 
+**Flash Precision** (``"bf16-flash"`` / ``"fp16-flash"``):
+    The model also runs in half-precision, but NeMo avoids Lightning's global
+    default-dtype override and autocast context. This mode is intended for use
+    with FlashOptim, a library of drop-in optimizers that reduces training
+    memory by shrinking optimizer states, master weights, and gradients. In
+    practice, this may be a better fit than AMP / mixed precision when
+    optimizer-state memory or checkpoint size is the bottleneck, and may lead to
+    improved convergence compared to Lightning's true half-precision as it keeps
+    track of the residual between half and full precision weights.
+
 Configuration
 -------------
 
@@ -37,6 +47,8 @@ In YAML (with Hydra):
         # precision: "16-mixed"    # FP16 mixed precision
         # precision: "bf16-true"   # True BF16 half precision
         # precision: "fp16-true"   # True FP16 half precision
+        # precision: "bf16-flash"  # BF16 flash precision
+        # precision: "fp16-flash"  # FP16 flash precision
 
 In Python:
 
@@ -71,7 +83,7 @@ the substring ``"audio"`` is kept in its original precision (typically FP32). Al
 tensors are cast to the target half-precision dtype.
 
 This plugin is used automatically when you launch training with NeMo's ``resolve_trainer_cfg``
-utility (used by all NeMo example training scripts). When the trainer config specifies
+utility (used by many NeMo example training scripts). When the trainer config specifies
 ``precision: "bf16-true"`` or ``precision: "fp16-true"``, ``resolve_trainer_cfg`` replaces
 the precision setting with the ``HalfPrecisionForAudio`` plugin:
 
@@ -94,3 +106,55 @@ If you construct the trainer manually, you can install the plugin directly:
         devices=2,
         accelerator="gpu",
     )
+
+FlashPrecision
+---------------
+
+NeMo provides the ``FlashPrecision`` plugin (in
+``nemo.utils.trainer_utils``) primarily for FlashOptim-backed training.
+According to the official FlashOptim README, FlashOptim provides drop-in
+optimizer replacements that reduce training memory by compressing optimizer
+states, master weights, and gradients while preserving the standard PyTorch
+optimizer API.
+
+FlashOptim generally expects the model parameters to already be in bf16/fp16,
+while the optimizer manages reduced-precision state and master-weight
+correction internally. ``FlashPrecision`` fits that model: it preserves the
+same audio-aware input casting behavior as ``HalfPrecisionForAudio``, but does
+not enter autocast and does not change PyTorch's global default dtype. This
+avoids layering Lightning's global precision policy on top of FlashOptim's own
+reduced-precision optimizer behavior.
+
+When the trainer config specifies ``precision: "bf16-flash"`` or
+``precision: "fp16-flash"``, ``resolve_trainer_cfg`` replaces the precision
+setting with the ``FlashPrecision`` plugin:
+
+.. code-block:: python
+
+    from nemo.utils.trainer_utils import resolve_trainer_cfg
+
+    # In YAML: trainer.precision = "bf16-flash"
+    trainer = pl.Trainer(**resolve_trainer_cfg(cfg.trainer))
+
+If you construct the trainer manually, you can install the plugin directly:
+
+.. code-block:: python
+
+    from nemo.utils.trainer_utils import FlashPrecision
+
+    trainer = pl.Trainer(
+        plugins=[FlashPrecision("bf16-flash")],
+        devices=2,
+        accelerator="gpu",
+    )
+
+If you're going to use ``FlashPrecision``, make sure to set up ``flashoptim`` optimizer, e.g.:
+
+.. code-block:: yaml
+
+    optimizer:
+      _target_: flashoptim.FlashAdamW
+      lr: 1e-4
+      betas: [0.9, 0.999]
+      weight_decay: 5e-2
+
@@ -31,7 +31,7 @@ NVIDIA NeMo Speech Developer Docs
      </a>
      <a class="task-card" href="speechlm2/intro.html">
        <h3>🧠 Speech Language Models</h3>
-       <p>Audio-aware LLMs that understand and generate speech. Speech-to-text, speech-to-speech, and more.</p>
+       <p>Audio-aware LLMs that understand and generate speech. Use HuggingFace Transformers, or NeMo Automodel for efficient MoE and model parallelism. Speech-to-text, speech-to-speech, and more.</p>
        <strong>Quick Start →</strong>
      </a>
      <a class="task-card" href="audio/intro.html">
 
@@ -40,7 +40,10 @@ See the `SALM paper <https://arxiv.org/abs/2310.09424>`_ for more details.
       pretrained_llm: "TinyLlama/TinyLlama_v1.1"  # HF model path
       pretrained_asr: "stt_en_fastconformer_hybrid_large_streaming_80ms"  # NeMo checkpoint name
       pretrained_weights: True  # Whether to load weights or just architecture
-      
+
+      # Fine-tune from a previous training checkpoint (weights only, fresh optimizer)
+      init_from_checkpoint: null  # path to .ckpt, DCP dir, or HF dir
+
       # Special token settings
       audio_locator_tag: "<audio>"  # Tag to replace with audio embeddings
       
@@ -94,6 +97,68 @@ See the `SALM paper <https://arxiv.org/abs/2310.09424>`_ for more details.
           dropout_pre_encoder: 0
           dropout_emb: 0.0
 
+SALMAutomodel Configuration
+----------------------------
+
+The SALMAutomodel configuration extends the SALM configuration with NeMo Automodel
+support. The key difference is ``use_nemo_automodel: true`` and the use of
+``AutomodelParallelStrategy`` instead of ``DDPStrategy``.
+
+The example below shows a configuration for training with NVIDIA Nemotron Nano V3
+MoE as the LLM backbone, with Expert Parallelism across 8 GPUs:
+
+.. code-block:: yaml
+
+    model:
+      use_nemo_automodel: true  # Selects SALMAutomodel in salm_train.py
+      pretrained_llm: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
+      pretrained_asr: "nvidia/canary-1b-flash"
+      pretrained_weights: True
+
+      freeze_params:
+        - "^llm\\..+$"
+        - "^perception\\.preprocessor\\..+$"
+        - "^perception\\.encoder\\..+$"
+      prevent_freeze_params: []
+
+      # LoRA uses Automodel-native format (not HF PEFT):
+      # lora:
+      #   dim: 128
+      #   alpha: 256
+      #   dropout: 0.01
+      #   target_modules: ["q_proj", "v_proj"]
+
+      perception:
+        target: nemo.collections.speechlm2.modules.perception.AudioPerceptionModule
+        output_dim: 2048
+        modality_adapter:
+          _target_: nemo.collections.speechlm2.modules.perception.IdentityConnector
+          d_model: 1024
+
+    trainer:
+      strategy:
+        _target_: nemo.collections.speechlm2.parts.parallel.AutomodelParallelStrategy
+        ep_size: 8  # Expert Parallelism across 8 GPUs for MoE
+        # tp_size: 1
+        # dp_size: null  # inferred
+
+NeMo Automodel applies MoE-specific optimizations automatically when an MoE model
+is detected:
+
+* **Grouped GEMM** — fuses expert computations into a single batched matrix multiply
+  for higher GPU throughput.
+* **DeepEP** (Deep Expert Parallelism) — efficient all-to-all expert routing across
+  GPUs, minimizing communication overhead for MoE layers.
+
+Note the differences from the SALM configuration:
+
+* ``model.use_nemo_automodel: true`` — selects ``SALMAutomodel`` in the training script.
+* ``model.pretrained_llm`` can point to MoE models like ``nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16``.
+* ``trainer.strategy._target_`` uses ``AutomodelParallelStrategy`` instead of ``ModelParallelStrategy``.
+* ``ep_size`` controls Expert Parallelism on the FSDP data-parallel axis — dense layers are sharded via FSDP2, while MoE layers use EP for expert routing on the same GPUs.
+* LoRA config uses ``dim``/``alpha`` keys (Automodel native) instead of ``r``/``lora_alpha`` (HF PEFT).
+* No ``embed_tokens`` freeze pattern — embeddings stay inside the LLM.
+
 DuplexS2SModel Configuration
 -----------------------------
 
@@ -264,6 +329,7 @@ Model Parameters
 - **pretrained_llm**: Path to the pretrained HuggingFace LLM
 - **pretrained_asr**: Name of the pretrained NeMo ASR model used for perception
 - **pretrained_audio_codec**: Path to the pretrained audio codec model (for speech generation)
+- **init_from_checkpoint**: Path to a training checkpoint to initialize model weights from (see :ref:`fine-tuning-from-checkpoint` below)
 - **freeze_params**: Regex patterns of parameters to freeze during training
 - **audio_loss_weight/text_loss_weight**: Weighting of different loss components
 
@@ -291,6 +357,7 @@ Example Configuration Files
 Example configurations for all model types can be found in the example directory:
 
 - SALM: `examples/speechlm2/conf/salm.yaml`
+- SALMAutomodel: `examples/speechlm2/conf/salm_automodel.yaml`
 - DuplexS2SModel: `examples/speechlm2/conf/s2s_duplex.yaml`
 - DuplexS2SSpeechDecoderModel: `examples/speechlm2/conf/s2s_duplex_speech_decoder.yaml`
 - DuplexSTTModel: `examples/speechlm2/conf/duplex_stt.yaml`
@@ -307,6 +374,10 @@ You can use these configurations with the training scripts by specifying the con
       --config-path=conf \
       --config-name=salm
 
+    # Train SALMAutomodel
+    python examples/speechlm2/salm_train.py \
+      --config-name=salm_automodel
+
 You can also override configuration values from the command line:
 
 .. code-block:: bash
@@ -316,4 +387,59 @@ You can also override configuration values from the command line:
       --config-name=salm \
       model.pretrained_llm="different/llm/path" \
       trainer.max_steps=1000 \
-      data.train_ds.batch_size=8 
+      data.train_ds.batch_size=8
+
+.. _fine-tuning-from-checkpoint:
+
+Fine-Tuning from a Previous Checkpoint
+---------------------------------------
+
+To start a new training run initialized from a previous checkpoint — with a fresh
+optimizer, LR scheduler, and step counter — set ``model.init_from_checkpoint``:
+
+.. code-block:: yaml
+
+    model:
+      init_from_checkpoint: /path/to/checkpoints/step=6375.ckpt
+
+Or pass it as a Hydra override:
+
+.. code-block:: bash
+
+    python examples/speechlm2/salm_train.py \
+      --config-name=salm_automodel \
+      ++model.init_from_checkpoint=/path/to/checkpoints/step=6375.ckpt
+
+This differs from ``exp_manager.resume_from_checkpoint`` which restores the
+**full** training state (optimizer, scheduler, step counter) to continue an
+interrupted run. ``init_from_checkpoint`` only loads model weights, giving you a
+clean starting point for fine-tuning on different data or with different
+hyperparameters.
+
+Supported Checkpoint Formats
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Three checkpoint formats are supported:
+
+* **Distributed checkpoints (DCP)**: Directories with a ``.metadata`` file, produced
+  by ``ModelParallelStrategy`` / ``AutomodelParallelStrategy``. This is the default
+  format when training with FSDP2 or TP. DCP loading handles automatic resharding
+  when the parallelism configuration differs between the source and target runs.
+
+* **HuggingFace model directories**: Directories containing ``model.safetensors``,
+  such as the output of ``to_hf.py``.
+
+* **Single-file checkpoints**: Standard ``.ckpt`` or ``.pt`` files with a
+  ``state_dict`` key.
+
+The model architecture is still defined by ``pretrained_llm`` and ``pretrained_asr``
+(needed for config and tokenizer initialization), but all weights are overridden by
+the checkpoint.
+
+This feature works with both ``SALM`` and ``SALMAutomodel``.
+
+.. note::
+   ``init_from_checkpoint`` requires the source and target models to use the
+   same model class (e.g., both ``SALMAutomodel``). Cross-model loading
+   (e.g., ``SALM`` checkpoint into ``SALMAutomodel``) will encounter state dict
+   key mismatches because the two classes structure the embedding layer differently.
@@ -4,16 +4,20 @@ SpeechLM2
 .. note::
    The SpeechLM2 collection is still in active development and the code is likely to keep changing.
 
+.. note::
+   Install with ``pip install nemo-toolkit[speechlm2]`` to get all required dependencies including NeMo Automodel.
 
+SpeechLM2 refers to a collection that augments pre-trained Large Language Models (LLMs) with speech understanding and generation capabilities.
 
-SpeechLM2 refers to a collection that augments pre-trained Large Language Models (LLMs) with speech understanding and generation capabilities. 
-
-This collection is designed to be compact, efficient, and to support easy swapping of different LLMs backed by HuggingFace AutoModel. 
+This collection is designed to be compact, efficient, and to support easy swapping of different LLMs backed by HuggingFace AutoModel or NeMo Automodel. 
 It has a first-class support for using dynamic batch sizes via Lhotse and various model parallelism techniques (e.g., FSDP2, Tensor Parallel, Sequence Parallel) via PyTorch DTensor API.
 
 We currently support six main model types:
 
-* **SALM** (Speech-Augmented Language Model) - a simple but effective approach to augmenting pre-trained LLMs with speech understanding capabilities.
+* **SALM** (Speech-Augmented Language Model) - a simple but effective approach to augmenting pre-trained LLMs with speech understanding capabilities. Available in two variants:
+
+  * ``SALM`` — uses HuggingFace Transformers for the LLM backbone with optional HF PEFT LoRA.
+  * ``SALMAutomodel`` — uses `NeMo Automodel <https://github.com/NVIDIA-NeMo/Automodel>`_ for the LLM backbone with native LoRA, advanced parallelism (FSDP2, TP, SP, EP via ``AutomodelParallelStrategy``), and MoE optimizations (Grouped GEMM, DeepEP) for efficient training with models like `NVIDIA Nemotron Nano V3 <https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16>`_.
 * **DuplexS2SModel** - a full-duplex speech-to-speech model with an ASR encoder, directly predicting discrete audio codes.
 * **DuplexS2SSpeechDecoderModel** - a variant of DuplexS2SModel with a separate transformer decoder for speech generation.
 * **DuplexEARTTS** - a ready-to-use duplex text-to-speech model that supports user interruption via a special text interruption token.
@@ -71,7 +75,7 @@ You can run inference using the loaded pretrained SALM model:
     prompt = [{"role": "user", "content": f"{model.audio_locator_tag}"}]
     
     # Generate response
-    with torch.no_grad():
+    with torch.inference_mode():
         output = model.generate(
             prompts=[prompt],
             audios=audio_signal,
@@ -83,6 +87,43 @@ You can run inference using the loaded pretrained SALM model:
     response = model.tokenizer.ids_to_text(output[0])
     print(f"Model response: {response}")
 
+SALMAutomodel
+*************
+
+``SALMAutomodel`` is the NeMo Automodel variant of SALM. It enables efficient training of
+Speech LLMs with MoE architectures like `NVIDIA Nemotron Nano V3 <https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16>`_
+using MoE-specific optimizations (Grouped GEMM, DeepEP). It uses deferred initialization
+(``configure_model()``) and supports distributed training and inference via
+``AutomodelParallelStrategy``.
+
+.. code-block:: python
+
+    import torch
+    import nemo.collections.speechlm2 as slm
+    from nemo.collections.speechlm2.parts.parallel import setup_distributed
+
+    # Initialize distributed and create an Automodel-compatible device mesh with EP=2.
+    # setup_distributed delegates mesh creation to nemo_automodel, which builds
+    # the full (pp, dp_replicate, dp_shard, cp, tp) mesh with MoE submeshes.
+    strategy = setup_distributed(ep_size=2)
+
+    # Load a pretrained SALMAutomodel with the Automodel device mesh
+    model = slm.models.SALMAutomodel.from_pretrained(
+        "path/to/checkpoint",
+        device_mesh=strategy.device_mesh,
+        distributed_config=strategy.distributed_config,
+        moe_config=strategy.moe_config,
+        moe_mesh=strategy.moe_mesh,
+    ).eval()
+
+    # Inference is identical to SALM
+    with torch.inference_mode():
+        output = model.generate(
+            prompts=[prompt],
+            audios=audio_signal,
+            audio_lens=audio_len,
+        )
+
 DuplexS2SModel
 **************
 
@@ -310,22 +351,35 @@ Alternatively, you can train a model using the provided training scripts in the
       --config-path=examples/speechlm2/conf \
       --config-name=salm
 
-    # For SALM inference/evaluation 
+    # For SALM inference/evaluation
     python examples/speechlm2/salm_eval.py \
       pretrained_name=/path/to/checkpoint \
       inputs=/path/to/test_manifest \
       batch_size=64 \
       max_new_tokens=128 \
       output_manifest=generations.jsonl
 
+To train the SALMAutomodel variant (with NeMo Automodel backend), use the ``salm_automodel`` config:
+
+.. code-block:: bash
+
+    # Train SALMAutomodel with NVIDIA Nemotron Nano V3 MoE backbone on 8 GPUs
+    torchrun --nproc_per_node=8 examples/speechlm2/salm_train.py \
+      --config-name=salm_automodel \
+      model.pretrained_llm=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
+
+The ``salm_automodel.yaml`` config sets ``model.use_nemo_automodel: true``, which selects the
+``SALMAutomodel`` class. This variant supports ``AutomodelParallelStrategy`` for FSDP2/TP/EP
+parallelism and MoE optimizations (Grouped GEMM, DeepEP).
+
 For more detailed information on training at scale, model parallelism, and SLURM-based training, see :doc:`training and scaling <training_and_scaling>`.
 
 Collection Structure
 --------------------
 
 The speechlm2 collection is organized into the following key components:
 
-- **Models**: Contains implementations of DuplexS2SModel, DuplexS2SSpeechDecoderModel, DuplexSTTModel, SALM, DuplexEARTTS, and the inference-only NemotronVoiceChat.
+- **Models**: Contains implementations of DuplexS2SModel, DuplexS2SSpeechDecoderModel, DuplexSTTModel, SALM, SALMAutomodel, DuplexEARTTS, and the inference-only NemotronVoiceChat.
 - **Modules**: Contains audio perception and speech generation modules.
 - **Data**: Includes dataset classes and data loading utilities.