You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
Split out of #1501 so the pack=True calibration packing change can land
independently. This PR carries the pruning + export-side fixes.
**Pruning bug fixes**
- Register `HybridModel` (parent of `MambaModel` in modern Megatron-LM)
under a new `HAS_HYBRID` flag so `mcore_minitron` actually prunes
Nemotron-H et al. Previously `HybridModel` instances fell through
`convert_to_dynamic`, got `freeze()`-ed (collapsing `hidden_size` /
`num_layers` to a single choice), and produced unloadable saved
checkpoints with mixed pruned/unpruned dims.
- Replace the `isinstance(MambaModel)` gate in `_get_hybrid_pattern_key`
with attribute-presence detection so both `MambaModel` (still using
`hybrid_override_pattern`) and plain `HybridModel`
(`hybrid_layer_pattern`) are handled uniformly.
- Track `in_features` as a dynamic attribute on
`_DynamicTEQKVLayerNormColumnParallelLinear` so TE's forward-time
`inp_shape[-1] == in_features` assertion holds when `hidden_size` is
pruned.
- Dedupe MambaModel / HybridModel divisor dict into `_HYBRID_DIVISORS`.
**Fused-TE-spec import/export for GPT-family**
- Importer: prefer per-context keys (`fused_input_layernorm`,
`fused_pre_mlp_layernorm`); fall back to legacy `fused_norm` for
Nemotron-H back-compat. **Raise `KeyError`** when a fused-TE model has
neither rule registered — the branch only fires when the model uses
fused `TELayerNormColumnParallelLinear`, so a missing rule is
unambiguously a plugin misconfig that would otherwise ship a
chance-accuracy checkpoint.
- Exporter: mirror the same fallback chain in `_get_fused_norm_weight`
so GPT-family models round-trip cleanly back to HF.
- Add the new rules to Qwen3, Qwen2.5, Llama, Llama4 (MoE-only, only
`fused_input_layernorm`), DeepSeek, GptOss (MoE-only, only
`fused_input_layernorm`) import and export mappings.
- Preserve TE `_extra_state` from the existing module state dict (don't
blank to `None`) at both call sites in the importer.
**Misc**
- `megatron_prefill`: `.contiguous()` on the logits slice before
`broadcast_from_last_pipeline_stage` — broadcast asserts contiguity
which fails when SP pads `seq_length` to a multiple of TP.
- `megatron_mmlu`: accept `mmlu_dataset` kwarg so callers can point at a
local copy of `cais/mmlu`.
- `warn_rank_0`: auto-bump `stacklevel` by 1 inside the wrapper so
callers' warnings point at user code, not at the wrapper frame.
- `tools/launcher/examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml`: bump
`mmlu_lower_bound` 0.68 → 0.75 (validated end-to-end with the fused-norm
import fix).
- CHANGELOG: bug-fix entry for the importer; date correction on the 0.44
entry.
## Consumer
Megatron-LM PR NVIDIA/Megatron-LM#4807 —
`prune.py` / `mmlu.py` consume these APIs and currently ship inline WARs
against released 0.44. Once 0.45 ships and the modelopt pin is bumped,
those WARs collapse to one-liners.
Related: #1501 (calibration packing).
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Importer/exporter now correctly load fused LayerNorm weights for
GPT-family models, preferring context-specific fused keys with a legacy
fallback.
* **New Features**
* Hybrid Mamba/HybridModel support added for pruning/NAS workflows.
* MMLU evaluation accepts a customizable dataset path (default:
"cais/mmlu").
* **Improvements**
* Extended export/import mappings and state handling across DeepSeek,
GPT, Llama, Qwen; ensured last-stage logits are contiguous.
* **Documentation**
* Updated changelog entry and release date adjustment.
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1518?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,11 @@ Changelog
26
26
- DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
27
27
- Add NVFP4 W4A16 weight-only quantization (``w4a16_nvfp4``): FP4 weights with group_size=16, BF16 activations, no calibration forward pass required. Use ``mtq.W4A16_NVFP4_CFG`` or ``--qformat w4a16_nvfp4`` in ``hf_ptq.py``. vLLM deployment support is in progress.
28
28
29
-
0.44 (2026-05-18)
29
+
**Bug Fixes**
30
+
31
+
- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
0 commit comments