[OMNIML-3232] Support full TE spec for NemotronH HF-to-Megatron import (#884)

yueshen2016 · danielkorzekwa · commit a5fd5b20bf1f · 2026-03-04T03:27:00.000-08:00
## What does this PR do? **Type of change:** new feature **Overview:** Enable full TE spec support for NemotronH (Mamba hybrid) models during HF-to-Megatron weight import via `import_mcore_gpt_from_hf`. Previously, importing HF weights into a Megatron model built with the full TE spec (`TELayerNormColumnParallelLinear`, `TEGroupedMLP`, etc.) failed for NemotronH models due to two issues: 1. **Grouped expert prefix bug**: The `experts.linear_fc1/fc2` import rules had a hard-coded `mtp.layers.{}` prefix, which was only correct for MTP layers. When regular decoder MoE layers use `TEGroupedMLP` (via the full TE spec), the importer generated incorrect HF keys (e.g., `mtp.layers.27.mixer.experts.0.up_proj.weight` instead of `backbone.layers.27.mixer.experts.0.up_proj.weight`). 2. **Fused layer norm loading**: In the full TE spec, layer norms are fused into `TELayerNormColumnParallelLinear` modules as `layer_norm_weight`. The importer's `_name_remapping` would crash trying to load `layer_norm_weight` from a non-existent HF path (e.g., `backbone.layers.X.mixer.in_proj.layer_norm_weight`), when the actual HF norm weight lives at `backbone.layers.X.norm.weight`. ### Changes **`mcore_nemotron.py`**: - Fixed grouped expert prefix from `mtp.layers.{}` to `backbone.layers.{}`. The `_grouped_mlp_merging` function already handles `backbone` → `mtp` replacement when `is_mtp=True`, so both decoder and MTP layers work correctly. - Added `mapping={"layer_norm_weight": None}` to `in_proj` and `linear_fc1` rules to skip `layer_norm_weight` during `_name_remapping` (loaded separately via `fused_norm`). - Added `fused_norm` rule (`NameRemapping("backbone.layers.{}.norm.weight")`) to load HF norm weights into fused TE modules. **`megatron_importer.py`**: - Added `source_key is None` check in `_name_remapping` to skip keys mapped to `None` in the mapping dict (keeps existing value instead of crashing on missing HF key). - Added fused norm loading in `_import_mamba_layer`: after loading `in_proj`, loads `layer_norm_weight` from HF via `fused_norm` rule when `layer.norm` is `IdentityOp`. - Added fused norm loading in `_import_transformer_layer`: loads `layer_norm_weight` into `linear_qkv` (when `input_layernorm` is `IdentityOp`) and into `linear_fc1` (when `pre_mlp_layernorm` is `IdentityOp`). ## Usage The full TE spec is enabled via the `--full-te-spec` flag on the Megatron-LM side (separate PR). On the ModelOpt side, no user-facing changes are needed -- the import rules automatically handle both local spec and full TE spec models. ```bash # Convert HF checkpoint to Megatron with full TE spec (megatron-lm side) unset MLM_MODEL_CKPT && export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm && export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 export PP=2 export MLM_EXTRA_ARGS="--full-te-spec" bash convert.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # Quantize the converted checkpoint (megatron-lm side) export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 export PP=2 && export TP=4 && export EP=4 && export ETP=1 export MLM_EXTRA_ARGS="--full-te-spec" bash quantize.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 FP8_DEFAULT_CFG # Generate export PP=2 && export TP=4 && export EP=4 && export ETP=1 export MLM_EXTRA_ARGS="--full-te-spec" export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && ./generate.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # MMLU export PP=2 && export TP=4 && export EP=4 && export ETP=1 export MLM_EXTRA_ARGS="--full-te-spec" export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && export MLM_EXTRA_ARGS="--fraction 0.05 --disable-tqdm" && ./mmlu.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 ``` ## Testing - Tested end-to-end: HF → Megatron conversion → FP8 quantization → inference (generate) → MMLU evaluation with Nemotron-3-Nano-30B-A3B-BF16. - Verified the resulting model structure matches Megatron-Bridge's TE spec output (TELayerNormColumnParallelLinear, TEGroupedMLP, IdentityOp norms, etc.). - Verified quantized model produces coherent text generation outputs. - Verified backward compatibility: all changes are no-ops for existing local-spec pipelines (guarded by `IdentityOp` checks, `hasattr` checks, and `"fused_norm" in self.rules` checks). ## Before your PR is "*Ready for review*" - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes -- all changes are guarded by conditions that only activate for full TE spec models. Local spec models follow the exact same code paths as before. - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No ## Additional Information Companion megatron-lm changes (separate PR): - `megatron/core/post_training/modelopt/mamba/model_specs.py`: Added `use_full_te_spec` parameter to return canonical `mamba_stack_spec` from `mamba_layer_specs.py`. - `megatron/post_training/model_builder.py`: Passes `use_full_te_spec=args.full_te_spec` to `get_mamba_stack_modelopt_spec`. - `megatron/post_training/arguments.py`: Added `--full-te-spec` CLI flag. - `examples/post_training/modelopt/convert_model.py`: Skip `moe_grouped_gemm=False` override when `--full-te-spec` is set.  ## Summary by CodeRabbit * **New Features** * Added support for loading fused normalization weights during model import. * **Bug Fixes** * Improved weight mapping logic to correctly skip redundant layer norm weights in specialized model architectures. * **Refactor** * Reorganized expert model parallel configuration paths for better compatibility with mixed parallel processing settings.  Signed-off-by: James Shen <yueshen@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
diff --git a/modelopt/torch/export/plugins/mcore_nemotron.py b/modelopt/torch/export/plugins/mcore_nemotron.py
@@ -58,16 +58,25 @@
     "D": NameRemapping("backbone.layers.{}.mixer.D", REPLICATE),
     "dt_bias": NameRemapping("backbone.layers.{}.mixer.dt_bias", REPLICATE),
     "conv1d": NameRemapping("backbone.layers.{}.mixer.conv1d.", REPLICATE),
-    "in_proj": NameRemapping("backbone.layers.{}.mixer.in_proj.", COL_TP),
+    # mapping layer_norm_weight to None tells _name_remapping to skip it;
+    # the fused layer_norm_weight is loaded separately via the "fused_norm" rule.
+    "in_proj": NameRemapping(
+        "backbone.layers.{}.mixer.in_proj.", COL_TP | {"mapping": {"layer_norm_weight": None}}
+    ),
     "out_proj": NameRemapping("backbone.layers.{}.mixer.out_proj.", ROW_TP),
     # Attention
     "input_layernorm": NameRemapping("backbone.layers.{}.norm.", REPLICATE),
     "linear_qkv": QKVMerging("backbone.layers.{}.mixer.", COL_TP),
     "linear_proj": NameRemapping("backbone.layers.{}.mixer.o_proj.", ROW_TP),
     # MLP
     "pre_mlp_layernorm": NameRemapping("backbone.layers.{}.norm.", REPLICATE),
-    "linear_fc1": NameRemapping("backbone.layers.{}.mixer.up_proj.", COL_TP),
+    "linear_fc1": NameRemapping(
+        "backbone.layers.{}.mixer.up_proj.", COL_TP | {"mapping": {"layer_norm_weight": None}}
+    ),
     "linear_fc2": NameRemapping("backbone.layers.{}.mixer.down_proj.", ROW_TP),
+    # Fused layer norm: loads the HF norm weight into fused TELayerNormColumnParallelLinear
+    # modules (in_proj, linear_qkv, linear_fc1) when using TE spec.
+    "fused_norm": NameRemapping("backbone.layers.{}.norm.weight"),
     # MoE
     "router": NameRemapping(
         "backbone.layers.{}.mixer.gate.", {"mapping": {"expert_bias": "e_score_correction_bias"}}
@@ -92,12 +101,14 @@
     "mtp.hnorm": NameRemapping("mtp.layers.{}.hnorm.", {"is_mtp": True}),
     "mtp.eh_proj": NameRemapping("mtp.layers.{}.eh_proj.", {"is_mtp": True}),
     "mtp.final_layernorm": NameRemapping("mtp.layers.{}.final_layernorm.", {"is_mtp": True}),
-    # Grouped local experts in MTP
+    # Grouped local experts (used for TEGroupedMLP in both decoder and MTP layers).
+    # The prefix uses "backbone" for regular decoder layers; when called from MTP
+    # context (is_mtp=True), _grouped_mlp_merging replaces "backbone" with "mtp".
     "experts.linear_fc1": GroupedMLPMerging(
-        "mtp.layers.{}.mixer.experts.{{}}.up_proj", COL_ETP | {"is_mtp": True}
+        "backbone.layers.{}.mixer.experts.{{}}.up_proj", COL_ETP
     ),
     "experts.linear_fc2": GroupedMLPMerging(
-        "mtp.layers.{}.mixer.experts.{{}}.down_proj", ROW_ETP | {"is_mtp": True}
+        "backbone.layers.{}.mixer.experts.{{}}.down_proj", ROW_ETP
     ),
 }
 
diff --git a/modelopt/torch/export/plugins/megatron_importer.py b/modelopt/torch/export/plugins/megatron_importer.py
@@ -200,6 +200,12 @@ def _name_remapping(
                 state_dict[key] = val
             else:
                 source_key = mapping.get(key, key)
+                # A mapping value of None means "skip this key" (keep existing value).
+                # This is used for fused TE modules where layer_norm_weight is loaded
+                # separately from a different HF path.
+                if source_key is None:
+                    state_dict[key] = val
+                    continue
                 # For bias tensors in ROW_TP layers, don't use parallel config to avoid sharding
                 # since bias should always be replicated, not sharded
                 if (
@@ -537,6 +543,15 @@ def _import_mamba_layer(self, layer, layer_id, layer_pbar):
         self.rules["in_proj"](layer.mixer.in_proj, layer_id)
         self.rules["out_proj"](layer.mixer.out_proj, layer_id)
 
+        # TE spec: layer norm is fused into in_proj (TELayerNormColumnParallelLinear).
+        # Load the fused layer_norm_weight from the HF norm path.
+        if (
+            isinstance(layer.norm, IdentityOp)
+            and hasattr(layer.mixer.in_proj, "layer_norm_weight")
+            and "fused_norm" in self.rules
+        ):
+            self.rules["fused_norm"](layer.mixer.in_proj.layer_norm_weight, layer_id)
+
     def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool = False):
         if not isinstance(layer.input_layernorm, IdentityOp):
             self.rules["input_layernorm"](layer.input_layernorm, layer_id, is_mtp=is_mtp)
@@ -578,6 +593,18 @@ def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool =
                         attention.core_attention.softmax_offset, layer_id, is_mtp=is_mtp
                     )
 
+            # TE spec: input_layernorm is fused into linear_qkv (TELayerNormColumnParallelLinear).
+            # Load the fused layer_norm_weight from the HF norm path.
+            if (
+                isinstance(layer.input_layernorm, IdentityOp)
+                and hasattr(attention, "linear_qkv")
+                and hasattr(attention.linear_qkv, "layer_norm_weight")
+                and "fused_norm" in self.rules
+            ):
+                self.rules["fused_norm"](
+                    attention.linear_qkv.layer_norm_weight, layer_id, is_mtp=is_mtp
+                )
+
         if not isinstance(layer.pre_mlp_layernorm, IdentityOp):
             self.rules["pre_mlp_layernorm"](layer.pre_mlp_layernorm, layer_id, is_mtp=is_mtp)
 
@@ -671,6 +698,18 @@ def _import_transformer_layer(self, layer, layer_id, layer_pbar, is_mtp: bool =
                 self.rules["linear_fc1"](layer.mlp.linear_fc1, layer_id, is_mtp=is_mtp)
                 self.rules["linear_fc2"](layer.mlp.linear_fc2, layer_id, is_mtp=is_mtp)
 
+                # TE spec: pre_mlp_layernorm is fused into linear_fc1
+                # (TELayerNormColumnParallelLinear).
+                # Load the fused layer_norm_weight from the HF norm path.
+                if (
+                    isinstance(layer.pre_mlp_layernorm, IdentityOp)
+                    and hasattr(layer.mlp.linear_fc1, "layer_norm_weight")
+                    and "fused_norm" in self.rules
+                ):
+                    self.rules["fused_norm"](
+                        layer.mlp.linear_fc1.layer_norm_weight, layer_id, is_mtp=is_mtp
+                    )
+
     def _import_state_dict(self):
         model = self.model
         layer_pbar = tqdm(model.decoder.layers, disable=self.disable_tqdm)