Commit a5fd5b2
[OMNIML-3232] Support full TE spec for NemotronH HF-to-Megatron import (#884)
## What does this PR do?
**Type of change:** new feature
**Overview:** Enable full TE spec support for NemotronH (Mamba hybrid)
models during HF-to-Megatron weight import via
`import_mcore_gpt_from_hf`.
Previously, importing HF weights into a Megatron model built with the
full TE spec (`TELayerNormColumnParallelLinear`, `TEGroupedMLP`, etc.)
failed for NemotronH models due to two issues:
1. **Grouped expert prefix bug**: The `experts.linear_fc1/fc2` import
rules had a hard-coded `mtp.layers.{}` prefix, which was only correct
for MTP layers. When regular decoder MoE layers use `TEGroupedMLP` (via
the full TE spec), the importer generated incorrect HF keys (e.g.,
`mtp.layers.27.mixer.experts.0.up_proj.weight` instead of
`backbone.layers.27.mixer.experts.0.up_proj.weight`).
2. **Fused layer norm loading**: In the full TE spec, layer norms are
fused into `TELayerNormColumnParallelLinear` modules as
`layer_norm_weight`. The importer's `_name_remapping` would crash trying
to load `layer_norm_weight` from a non-existent HF path (e.g.,
`backbone.layers.X.mixer.in_proj.layer_norm_weight`), when the actual HF
norm weight lives at `backbone.layers.X.norm.weight`.
### Changes
**`mcore_nemotron.py`**:
- Fixed grouped expert prefix from `mtp.layers.{}` to
`backbone.layers.{}`. The `_grouped_mlp_merging` function already
handles `backbone` → `mtp` replacement when `is_mtp=True`, so both
decoder and MTP layers work correctly.
- Added `mapping={"layer_norm_weight": None}` to `in_proj` and
`linear_fc1` rules to skip `layer_norm_weight` during `_name_remapping`
(loaded separately via `fused_norm`).
- Added `fused_norm` rule
(`NameRemapping("backbone.layers.{}.norm.weight")`) to load HF norm
weights into fused TE modules.
**`megatron_importer.py`**:
- Added `source_key is None` check in `_name_remapping` to skip keys
mapped to `None` in the mapping dict (keeps existing value instead of
crashing on missing HF key).
- Added fused norm loading in `_import_mamba_layer`: after loading
`in_proj`, loads `layer_norm_weight` from HF via `fused_norm` rule when
`layer.norm` is `IdentityOp`.
- Added fused norm loading in `_import_transformer_layer`: loads
`layer_norm_weight` into `linear_qkv` (when `input_layernorm` is
`IdentityOp`) and into `linear_fc1` (when `pre_mlp_layernorm` is
`IdentityOp`).
## Usage
The full TE spec is enabled via the `--full-te-spec` flag on the
Megatron-LM side (separate PR). On the ModelOpt side, no user-facing
changes are needed -- the import rules automatically handle both local
spec and full TE spec models.
```bash
# Convert HF checkpoint to Megatron with full TE spec (megatron-lm side)
unset MLM_MODEL_CKPT && export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm && export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
export PP=2
export MLM_EXTRA_ARGS="--full-te-spec"
bash convert.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
# Quantize the converted checkpoint (megatron-lm side)
export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm
export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm
export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
export PP=2 && export TP=4 && export EP=4 && export ETP=1
export MLM_EXTRA_ARGS="--full-te-spec"
bash quantize.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 FP8_DEFAULT_CFG
# Generate
export PP=2 && export TP=4 && export EP=4 && export ETP=1
export MLM_EXTRA_ARGS="--full-te-spec"
export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && ./generate.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
# MMLU
export PP=2 && export TP=4 && export EP=4 && export ETP=1
export MLM_EXTRA_ARGS="--full-te-spec"
export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && export MLM_EXTRA_ARGS="--fraction 0.05 --disable-tqdm" && ./mmlu.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```
## Testing
- Tested end-to-end: HF → Megatron conversion → FP8 quantization →
inference (generate) → MMLU evaluation with
Nemotron-3-Nano-30B-A3B-BF16.
- Verified the resulting model structure matches Megatron-Bridge's TE
spec output (TELayerNormColumnParallelLinear, TEGroupedMLP, IdentityOp
norms, etc.).
- Verified quantized model produces coherent text generation outputs.
- Verified backward compatibility: all changes are no-ops for existing
local-spec pipelines (guarded by `IdentityOp` checks, `hasattr` checks,
and `"fused_norm" in self.rules` checks).
## Before your PR is "*Ready for review*"
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes -- all changes are
guarded by conditions that only activate for full TE spec models. Local
spec models follow the exact same code paths as before.
- **Did you write any new necessary tests?**: No
- **Did you add or update any necessary documentation?**: No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No
## Additional Information
Companion megatron-lm changes (separate PR):
- `megatron/core/post_training/modelopt/mamba/model_specs.py`: Added
`use_full_te_spec` parameter to return canonical `mamba_stack_spec` from
`mamba_layer_specs.py`.
- `megatron/post_training/model_builder.py`: Passes
`use_full_te_spec=args.full_te_spec` to `get_mamba_stack_modelopt_spec`.
- `megatron/post_training/arguments.py`: Added `--full-te-spec` CLI
flag.
- `examples/post_training/modelopt/convert_model.py`: Skip
`moe_grouped_gemm=False` override when `--full-te-spec` is set.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added support for loading fused normalization weights during model
import.
* **Bug Fixes**
* Improved weight mapping logic to correctly skip redundant layer norm
weights in specialized model architectures.
* **Refactor**
* Reorganized expert model parallel configuration paths for better
compatibility with mixed parallel processing settings.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: James Shen <yueshen@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>1 parent 47ced39 commit a5fd5b2
File tree
2 files changed
+55
-5
lines changed- modelopt/torch/export/plugins
2 files changed
+55
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
61 | | - | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
62 | 66 | | |
63 | 67 | | |
64 | 68 | | |
65 | 69 | | |
66 | 70 | | |
67 | 71 | | |
68 | 72 | | |
69 | | - | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
70 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
71 | 80 | | |
72 | 81 | | |
73 | 82 | | |
| |||
92 | 101 | | |
93 | 102 | | |
94 | 103 | | |
95 | | - | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
96 | 107 | | |
97 | | - | |
| 108 | + | |
98 | 109 | | |
99 | 110 | | |
100 | | - | |
| 111 | + | |
101 | 112 | | |
102 | 113 | | |
103 | 114 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
200 | 200 | | |
201 | 201 | | |
202 | 202 | | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
203 | 209 | | |
204 | 210 | | |
205 | 211 | | |
| |||
537 | 543 | | |
538 | 544 | | |
539 | 545 | | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
540 | 555 | | |
541 | 556 | | |
542 | 557 | | |
| |||
578 | 593 | | |
579 | 594 | | |
580 | 595 | | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
581 | 608 | | |
582 | 609 | | |
583 | 610 | | |
| |||
671 | 698 | | |
672 | 699 | | |
673 | 700 | | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
674 | 713 | | |
675 | 714 | | |
676 | 715 | | |
| |||
0 commit comments