You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe (#1521)
### What does this PR do?
Type of change: New features + Bug fixes
Mixed Precision and MSE support in MCore PTQ
- support mixed precision export in MCore by detecting mixed precision
layers in HF Quant Config
- Restore static quantizer in MCore checkpoint restore as `NVFP4QTensor`
(not TensorQuantizer which can call max calibrate. we want to skip max
calibrate for static quantizer during restore) --> fixes bug during
MCore export for MSE
- Fix dynamic block quantizer detection when `block_sizes` is
dict-backed.
- Add a YAML quantization recipe that roughly mirrors Nemotron 3 Super
NVFP4 `hf_quant_config.json`
Export bug fixes
- copy .py files properly from original HF ckpt (for reasoning parser
etc)
## Super recipe
Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
hf_quant_config.json:
- MoE routed experts (mixer.experts.<N>.{up,down}_proj): NVFP4 W4A4
weight MSE, group_size 16
- MoE shared experts (mixer.shared_experts.{up,down}_proj): FP8
per-tensor
- Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor
- KV cache: FP8
rest: not quantized
### Usage
```python
# Add a code snippet demonstrating how to use this
```
### Testing
Tested on Nemotron model
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Added NVFP4 (4-bit) quantization checkpoint restore and export support
for Megatron-Core models
* Added tokenizer file export capability in model checkpoints
* Extended quantization support for expert-parallel distributed training
* Introduced new PTQ recipes for Nemotron-3-Super models with
mixed-precision quantization
* **Bug Fixes**
* Fixed FP8 and FP4 hardware compatibility detection on non-CUDA systems
* Improved offline Hugging Face Hub access handling with better error
messaging
* Enhanced calibration validation for mixture-of-experts models
* Fixed amplitude maximum validation for static block quantizers
* **Documentation**
* Updated expert weight quantization configuration documentation
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1521?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jenny Chen <jennifchen@nvidia.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,10 +36,14 @@ Changelog
36
36
- The ``nemotron-sft-agentic-v2`` registered dataset (added in #1498) now uses only the ``search`` split. The previously configured ``interactive_agent`` and ``tool_calling`` splits contain content-level defects (heterogeneous schema and a malformed JSON row, respectively) that cause pyarrow's streaming JSON reader to fail deterministically.
37
37
- Add shared Megatron-Core calibration forward loop: ``modelopt.torch.utils.plugins.megatron_calibration.get_megatron_calibration_forward_loop`` produces the ``forward_loop`` callable expected by ``mtq.quantize`` / ``mtp.prune``. Replaces the bespoke calibration loops in Megatron-LM and Megatron-Bridge for quantization and pruning with a single canonical implementation.
38
38
- Add ``pack=True`` mode to ``get_dataset_dataloader`` (Megatron-LM pretraining-style global-stream document packing): all raw samples concatenated EOS-separated into one token stream, sliced into uniform ``max_sample_length`` rows. Used by the shared megatron calibration loop.
39
+
- Support Megatron-Core checkpoint restore and export for MSE ``NVFP4StaticQuantizer``.
40
+
- Add mixed-precision FP8 + NVFP4 export for Megatron-Core: per-layer ``quant_algo`` recorded under ``quantized_layers`` in ``hf_quant_config.json``, PP-aware ``kv_cache_dtype`` gather, fused-QKV exclude split into per-HF-name ``q/k/v_proj`` entries.
- Add quantized ``nn.Embedding`` support. ``nn.Embedding`` is now registered in ``QuantModuleRegistry`` and exposes ``weight_quantizer`` (embedding table), ``output_quantizer`` (lookup activations), and a permanently disabled ``input_quantizer`` placeholder — embedding inputs are integer indices and cannot be fake-quantized, so direct ``enable*()`` calls raise. ``export_hf_checkpoint`` packs quantized embedding weights alongside Linear layers. Embedding quantizers are opt-in (``parent_class: nn.Embedding`` disabled by default).
40
43
41
44
**Bug Fixes**
42
45
46
+
- In Megatron-Core only do EP amax sync for routed expert weights if ``sync_expert_weight_amax=True``. Previously EP amax sync would sync routed expert weights across EP ranks even when ``sync_expert_weight_amax`` was False.
43
47
- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
0 commit comments