Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe by jenchen13 · Pull Request #1521 · NVIDIA/Model-Optimizer

jenchen13 · 2026-05-19T18:24:56Z

What does this PR do?

Type of change: New recipe + Bug Fixes

MCore and MSE fixes

support mixed precision export in MCore by detecting mixed precision layers in HF Quant Config
Restore static quantizer in MCore checkpoint restore as NVFP4QTensor (not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore) --> fixes bug during MCore export for MSE
Skip MSE calibration for any non-NVFP4 quantization format
Fix dynamic block quantizer detection when block_sizes is dict-backed.
Add a YAML quantization recipe that roughly mirrors Nemotron 3 Super NVFP4 hf_quant_config.json
Export bug fixes
copy .py files properly from original HF ckpt (for reasoning parser etc)

Super recipe

Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:

MoE routed experts (mixer.experts..{up,down}_proj): NVFP4 W4A4 weight MSE, group_size 16
MoE shared experts (mixer.shared_experts.{up,down}_proj): FP8 per-tensor
Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor
KV cache: FP8
rest: not quantized

Usage

# Add a code snippet demonstrating how to use this

Testing

TODO test in HF and MCore PTQ on Nemotron model

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added NVFP4 W4A16 weight-only quantization format support
- FP8+NVFP4 mixed-precision export with per-layer quantization metadata tracking
- New Nemotron-3-Super-120B-A12B PTQ recipe configurations
- Enhanced Megatron-Core checkpoint restore and export for NVFP4 quantization
- Offline Hugging Face Hub support for model exports
Bug Fixes
- Fixed Megatron-Core expert parallel amax synchronization for routed expert weights

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

### What does this PR do? Type of change: Bug fix This PR enables `auto_quantize` for Megatron expert parallel MoE flows by including the expert model parallel group when aggregating scores and costs and when synchronizing selected recipes. It also derives the search budget from the no-quant candidate costs in `candidate_stats`, so sharded expert layers use global candidate costs instead of local module weights. ### Usage ```python model, search_state = mtq.auto_quantize( model, constraints={"effective_bits": 8.0}, quantization_formats=[mtq.NVFP4_DEFAULT_CFG, mtq.FP8_DEFAULT_CFG], data_loader=data_loader, forward_step=forward_step, ) ``` ### Testing - Focused Megatron EP test from local log: `python -m pytest tests/gpu_megatron/torch/quantization/plugins/test_megatron.py::test_auto_quantize_moe_ep -xvs` in NGC PyTorch 26.01 (`1 passed` in 134.37s). - Added unit coverage for deriving the auto_quantize budget from no-quant candidate costs. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A ### Additional Information Base branch: `jennifchen/super_nvfp4_recipe`. Signed-off-by: realAsma <akuriparambi@nvidia.com> Signed-off-by: Jenny Chen <jennifchen@nvidia.com> Co-authored-by: Jenny Chen <jennifchen@nvidia.com>

coderabbitai · 2026-05-19T18:25:14Z

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

This PR extends NVFP4 static-block quantization with calibration validation and state restoration; adds distributed expert-parallelism support to auto-quantize including format consensus across EP ranks; implements per-layer mixed-precision quantization metadata recording and export for Megatron-Core models targeting Hugging Face format; and introduces HF Hub offline mode support with two new Nemotron-3-Super-120B PTQ recipe configurations.

Changes

NVFP4 Static Quantizer and Expert Parallelism Integration

Layer / File(s)	Summary
Block Quantization Detection and Backend Availability `modelopt/torch/quantization/nn/modules/tensor_quantizer.py`, `modelopt/torch/quantization/backends/utils.py`	Refactors block quantization predicates with unified `is_block_quant` property; adds CUDA availability checks to `fp8_compatible()` and `fp4_compatible()` before device capability assertions.
NVFP4 Amax Validation and Static Quantizer Promotion `modelopt/torch/export/quant_utils.py`, `modelopt/torch/quantization/conversion.py`	Detects invalid amax values (non-finite, negative) during NVFP4 calibration; introduces `maybe_promote_nvfp4_static_quantizer()` to upgrade TensorQuantizer to NVFP4StaticQuantizer during restore when saved state indicates NVFP4 static quantizer.
Quantizer State Restoration and Megatron Sharding `modelopt/torch/quantization/plugins/custom.py`, `modelopt/torch/quantization/plugins/megatron.py`	Restores NVFP4 static quantizer state in custom and Megatron plugins; validates saved state completeness; adjusts Megatron sharding to treat replicated `global_amax` as non-sharded across column/row parallel dimensions.
MoE Calibration Validation and Expert Parallelism Amax Sync `modelopt/torch/quantization/model_calib.py`, `modelopt/torch/quantization/plugins/megatron.py`	Validates TP compatibility for static-block NVFP4 quantizers; checks MoE calibration completeness across DP/EP/TP groups; conditionally gates EP amax synchronization for routed-expert weights via `sync_expert_weight_amax` flag; refactors MSE calibration bootstrap to target static-weight quantizers.
Expert Parallelism Support in AutoQuantize `modelopt/torch/quantization/algorithms.py`	Extends auto-quantize distributed synchronization to aggregate scores and costs across `expert_model_parallel_group`; adds EP consistency check for quantization format selection; extends quantization grouping regexes for NemotronH MoE expert patterns; refactors total weight-size computation to use candidate stats instead of module introspection.
NVFP4 Weight Export and Scaling Factor Computation `modelopt/torch/quantization/qtensor/nvfp4_tensor.py`	Introduces `_get_static_global_amax()` helper to resolve NVFP4 global_amax from either attribute name; computes weight scaling factors with FP8 overflow clamping for per-block scales.
Per-Layer Mixed-Precision Quantization Metadata Export `modelopt/torch/export/unified_export_megatron.py`	Records per-layer quantization format and block-size metadata during all export mapping paths; gathers per-layer configs and KV-cache dtype from all ranks; builds HF quantization config with per-layer and KV-cache settings when mixed-precision metadata is present; falls back to simpler config when not available.
Hugging Face Hub Offline Mode and Remote Code Handling `modelopt/torch/export/plugins/hf_checkpoint_utils.py`	Detects HF Hub offline mode via `HF_HUB_OFFLINE` environment variable; downloads Python sidecar files using `snapshot_download` with `local_files_only` when offline; handles missing offline cache with informative RuntimeError.
Megatron-Core Nemotron Export and PTQ Recipes `modelopt/torch/export/plugins/mcore_nemotron.py`, `modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`, `modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`	Updates `core_attention` mapping to specify key/value scale parameter names; adds two PTQ recipe configurations for Nemotron-3-Super-120B with mixed-precision NVFP4/FP8 quantization (MSE-based and max-calibration variants).
Dataset Utilities and Documentation `modelopt/torch/utils/dataset_utils.py`, `CHANGELOG.rst`, `examples/specdec_bench/specdec_bench/datasets/speed.py`	Adds `get_dataloader_from_dataset()` helper for distributed dataset loading; updates CHANGELOG documenting NVFP4 checkpoint restore, mixed-precision export, PTQ recipes, expert parallelism, and HF offline support; reformats metadata comprehension.

Comprehensive Test Coverage

Layer / File(s)	Summary
NVFP4 Static Export Round-Trip Tests `tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py`, `tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py`	Adds extensive CPU tests validating NVFP4 static export under extreme per-block amax conditions (overflow/underflow/corner-cases), verifies finite scales and dequantized values, equivalence with dynamic export, and manual FP4 grid alignment; adds CUDA test for FP8 scale overflow clamping.
Hugging Face Hub Offline Mode Tests `tests/unit/torch/export/test_hf_checkpoint_utils.py`	Tests offline mode detection and cached snapshot usage with `local_files_only=True`, plus error handling when offline cache is missing.
Mixed-Precision Export and Expert Parallelism Tests `tests/gpu_megatron/torch/export/test_unified_export_megatron.py`, `tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`	Validates new HF quantization config format in export, adds QKV slicing exclude_modules record verification, and exercises auto-quantize under expert parallelism with EP consistency validation.
Auto-Quantize Infrastructure and Bootstrap Updates `tests/_test_utils/torch/quantization/quantize_common.py`, `tests/unit/torch/quantization/test_autoquant.py`, `tests/unit/torch/quantization/test_mse_calibrator.py`, `tests/unit/torch/quantization/plugins/test_fused_experts.py`	Updates `auto_quantize_helper` signature for customizable calibration data and callbacks; adds budget cost calculation test using candidate stats; updates MSE calibrator test for sweep-style configuration; updates bootstrap test to use renamed static-weight quantizer bootstrap helper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1518: Both PRs modify GPT-family TE fused-norm handling in modelopt/torch/export/unified_export_megatron.py with _get_fused_norm_weight(..., primary_key) refactoring and corresponding export path updates.
NVIDIA/Model-Optimizer#1405: Main PR's quantization search changes update QuantRecipe to use typed QuantizerCfgEntry objects, aligning with PR #1405's schematization of config entries in modelopt/torch/quantization/config.py.
NVIDIA/Model-Optimizer#1522: Both PRs extend export/quantization codepaths to recognize W4A16_NVFP4 and add DATASET_COMBOS dataset fan-out logic in modelopt/torch/utils/dataset_utils.py.

Suggested reviewers

kevalmorabia97
sychen52
meenchen
vishalpandya1990
cjluo-nv

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 67.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: adding mixed-precision export support in MCore, static MSE support, and a new Nemotron Super v3 NVFP4 quantization recipe.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns detected. No unsafe torch.load/numpy.load, hardcoded trust_remote_code, eval/exec on untrusted input, nosec comments, or non-permissive dependencies found.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/mcore_mse_mixed_precision

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jenchen13 · 2026-05-19T18:26:55Z

A continuation of #1363

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

modelopt/torch/export/unified_export_megatron.py (1)

818-828: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Treat QUANTIZATION_NONE as unquantized when building exclude_modules.

This branch only records excludes for qformat is None, but the same method immediately returns early on qformat == QUANTIZATION_NONE, and _qkv_slicing() already treats both values the same. As written, any normal module reported as QUANTIZATION_NONE will skip the HF ignore list even though it is still unquantized.
Suggested fix
-        if qformat is None and "norm" not in prefix:
+        if qformat in (None, QUANTIZATION_NONE) and "norm" not in prefix:
             self._record_excluded_module(prefix)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/unified_export_megatron.py` around lines 818 - 828, The
code currently only calls _record_excluded_module(prefix) when qformat is None,
but QUANTIZATION_NONE should be treated the same; update the branch in
unified_export_megatron.py (the block around qformat, QUANTIZATION_NONE,
_get_weight_bias, and _record_excluded_module) so that if qformat is None or
qformat == QUANTIZATION_NONE (and "norm" not in prefix) you record the module as
excluded before the early return; keep the existing early return for
QUANTIZATION_NONE but ensure the exclude is recorded first and keep
compatibility with _qkv_slicing behavior.

modelopt/torch/quantization/algorithms.py (1)

765-782: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Recompute the persisted score/cost after recipe synchronization.

After best_format is replaced by the DP/TP/EP-synchronized value, best_constraints and best_scores are still accumulated from the local solver choice. On ranks that did not originate the synchronized format, self.best["constraints"] / self.best["score"] can end up describing a different recipe than the one actually activated and checkpointed.

Suggested fix

         for name, best_hparam_recipe_info in best_recipe_info.items():
             # Solvers could give different solutions for the same layer across DP/TP/EP groups even though
             # the scores and costs are the same. Lets make sure the same recipe is selected across DP/TP/EP
             _ps = self.model.get_submodule(name.split(".quant_recipe")[0]).parallel_state
             best_format = DistributedProcessGroup.get_dist_syncd_obj(
                 best_hparam_recipe_info["format"],
                 [
                     _ps.data_parallel_group,
                     _ps.tensor_parallel_group,
                     _ps.expert_model_parallel_group,
                 ],
                 lambda a: a[0],
             )

             best_recipe[name] = best_format
-            get_hparam(self.model, name).active = best_format
-            best_constraints += best_hparam_recipe_info["costs"]
-            best_scores += best_hparam_recipe_info["scores"]
+            hparam = get_hparam(self.model, name)
+            hparam.active = best_format
+            best_constraints += hparam.get_cost(best_format)
+            best_scores += hparam.get_score(best_format)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/algorithms.py` around lines 765 - 782, The loop
currently accumulates best_constraints and best_scores from
best_hparam_recipe_info before replacing the local solver's format with the
DP/TP/EP-synchronized best_format; update the code so that after you set
best_recipe[name] = best_format and get_hparam(self.model, name).active =
best_format you recompute and add the costs and scores that correspond to the
actually activated best_format (not the original
best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores
that the solver produced for the layer (referencing best_recipe_info,
best_hparam_recipe_info and get_hparam) and use that entry to increment
best_constraints and best_scores (and keep
self.best["constraints"]/self.best["score"] consistent with the activated
recipe).

🧹 Nitpick comments (1)

tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py (1)
32-42: ⚡ Quick win

Add one regression that uses only the restored _global_amax path.

The implementation change specifically supports static quantizers restored with _global_amax, but this helper only seeds global_amax, so the new restore path is still untested. A single round-trip case that sets _global_amax directly would keep the actual bugfix from regressing.

As per coding guidelines, tests/**/*.py: Write focused unit tests during development and curate production tests to be lean, documenting expected behavior, protecting against regressions, and flagging backward-incompatible changes.

Also applies to: 45-70
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py` around lines
32 - 42, Add a focused unit test that exercises the restored _global_amax code
path: create an NVFP4StaticQuantizer via the existing helper
_make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the
private attribute _global_amax (not global_amax) to a tensor value, perform the
export/import (or the same round‑trip flow used elsewhere in this test file) and
assert the quantizer restores using the _global_amax path (e.g., resulting
amax/global_amax behavior matches expected values). Ensure the test is small,
documents the expected behavior, and only validates the single round‑trip
regression scenario so the `_global_amax` restore remains covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 47-79: The routed-expert weight quantizers in this max-calib
recipe (entries with quantizer_name: '*mixer.experts.*weight_quantizer' and
'*mlp.experts*weight_quantizer') are set to type: dynamic but must be static for
a fair max-vs-MSE comparison; update those two quantizer blocks to use type:
static (leave the corresponding input_quantizer blocks as-is) so only the weight
quantizers for routed experts switch from dynamic to static.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 30-32: The calibration comment is misleading about FP8 scale
selection: update the comment near the calibration block that mentions "FP8
per-tensor scales" and "NVFP4 weights" (the lines describing MSE searches) to
explicitly state that only NVFP4 weight block scales are selected via MSE while
non-NVFP4 FP8 formats skip MSE and use the stack's default scaling method; edit
the text to clarify that FP8 per-tensor scales for non-NVFP4 are not
MSE-searched to avoid confusion for recipe users.

In `@modelopt/torch/quantization/plugins/custom.py`:
- Around line 148-153: The current check treats incomplete tail blocks as
invalid; instead compute blocks per row as ceil(weight.shape[-1] / block_size)
and total expected_blocks = (weight.numel() // weight.shape[-1]) *
blocks_per_row so padded trailing blocks count toward the expected amax length.
In the validation around quantizer.block_sizes / block_size, replace
expected_blocks = weight.numel() // block_size with rows = weight.numel() //
weight.shape[-1]; blocks_per_row = math.ceil(weight.shape[-1] / block_size) (or
integer ceil via (N + block_size - 1)//block_size); expected_blocks = rows *
blocks_per_row, then return amax.numel() == expected_blocks and
global_amax.numel() == 1, allowing restored `_amax` that includes padded tail
blocks.

In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 88-99: The TP>1 guard is too broad because it triggers for any
fake static-block quantizer; change the check that builds offending to only
consider NVFP4 static-block quantizers by requiring both
leaf.is_static_block_quant and that the leaf reports the NVFP4 format (e.g.,
leaf.format == "NVFP4" or the project’s NVFP4 enum/attribute — replace with the
actual attribute used in your quantizer objects) when iterating over leaves (the
variables/functions involved: weight_quantizer, SequentialQuantizer, leaves,
is_static_block_quant, offending, tp_group.world_size()); keep the rest of the
logic and the NotImplementedError unchanged.

In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 45-65: The test is comparing config.json's quantization_config to
the raw HF wrapper (hf_quant_config_dict) instead of the converted serving
format; change the test to use the converted structure (call
convert_hf_quant_config_format on hf_quant_config_dict or otherwise use the same
transformation used when producing config_dict) before asserting and before
indexing fields like "quant_algo", "ignore", and "config_groups"; update
references in the verification block so quant_config_dict refers to the
converted result (not the original hf_quant_config_dict) and then perform the
existing assertions and kv_cache checks against that converted object.

---

Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 818-828: The code currently only calls
_record_excluded_module(prefix) when qformat is None, but QUANTIZATION_NONE
should be treated the same; update the branch in unified_export_megatron.py (the
block around qformat, QUANTIZATION_NONE, _get_weight_bias, and
_record_excluded_module) so that if qformat is None or qformat ==
QUANTIZATION_NONE (and "norm" not in prefix) you record the module as excluded
before the early return; keep the existing early return for QUANTIZATION_NONE
but ensure the exclude is recorded first and keep compatibility with
_qkv_slicing behavior.

In `@modelopt/torch/quantization/algorithms.py`:
- Around line 765-782: The loop currently accumulates best_constraints and
best_scores from best_hparam_recipe_info before replacing the local solver's
format with the DP/TP/EP-synchronized best_format; update the code so that after
you set best_recipe[name] = best_format and get_hparam(self.model, name).active
= best_format you recompute and add the costs and scores that correspond to the
actually activated best_format (not the original
best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores
that the solver produced for the layer (referencing best_recipe_info,
best_hparam_recipe_info and get_hparam) and use that entry to increment
best_constraints and best_scores (and keep
self.best["constraints"]/self.best["score"] consistent with the activated
recipe).

---

Nitpick comments:
In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py`:
- Around line 32-42: Add a focused unit test that exercises the restored
_global_amax code path: create an NVFP4StaticQuantizer via the existing helper
_make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the
private attribute _global_amax (not global_amax) to a tensor value, perform the
export/import (or the same round‑trip flow used elsewhere in this test file) and
assert the quantizer restores using the _global_amax path (e.g., resulting
amax/global_amax behavior matches expected values). Ensure the test is small,
documents the expected behavior, and only validates the single round‑trip
regression scenario so the `_global_amax` restore remains covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 864054fb-7e5a-459d-9bc8-f15b0be42e2b

📥 Commits

Reviewing files that changed from the base of the PR and between 8f1529a and bd2e8e9.

📒 Files selected for processing (26)

CHANGELOG.rst
examples/specdec_bench/specdec_bench/datasets/speed.py
modelopt/torch/export/plugins/hf_checkpoint_utils.py
modelopt/torch/export/plugins/mcore_nemotron.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_megatron.py
modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/backends/utils.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/conversion.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/nn/modules/tensor_quantizer.py
modelopt/torch/quantization/plugins/custom.py
modelopt/torch/quantization/plugins/megatron.py
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
tests/_test_utils/torch/quantization/quantize_common.py
tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py
tests/gpu_megatron/torch/export/test_unified_export_megatron.py
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
tests/unit/torch/export/test_hf_checkpoint_utils.py
tests/unit/torch/quantization/plugins/test_fused_experts.py
tests/unit/torch/quantization/test_autoquant.py
tests/unit/torch/quantization/test_mse_calibrator.py
tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py

coderabbitai · 2026-05-19T18:35:07Z

+    - quantizer_name: '*mixer.experts.*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    - quantizer_name: '*mixer.experts.*input_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.
+    - quantizer_name: '*mlp.experts*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    - quantizer_name: '*mlp.experts*input_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep routed-expert weight blocks static in the max-calib variant.

This recipe says it differs from the MSE recipe by calibration method, but routed-expert weight quantizers are set to type: dynamic, which changes quantization behavior and undermines the max-vs-MSE comparison.

Proposed fix

- quantizer_name: '*mixer.experts.*weight_quantizer' enable: true cfg: block_sizes: - type: dynamic + type: static scale_bits: e4m3 num_bits: e2m1 @@ - quantizer_name: '*mlp.experts*weight_quantizer' enable: true cfg: block_sizes: - type: dynamic + type: static scale_bits: e4m3 num_bits: e2m1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml` around lines 47 - 79, The routed-expert weight quantizers in this max-calib recipe (entries with quantizer_name: '*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer') are set to type: dynamic but must be static for a fair max-vs-MSE comparison; update those two quantizer blocks to use type: static (leave the corresponding input_quantizer blocks as-is) so only the weight quantizers for routed experts switch from dynamic to static.

coderabbitai · 2026-05-19T18:35:07Z

+# Calibration: weight MSE with FP8-scale sweep over the 128 e4m3 scale values
+# (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales
+# are also chosen via MSE search instead of plain amax).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the calibration comment for FP8 layers.

The comment says FP8 per-tensor scales are selected via MSE search, but this stack skips MSE for non-NVFP4 formats. This is misleading for recipe users.

Proposed fix

-# (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales -# are also chosen via MSE search instead of plain amax). +# (NVFP4 routed-expert weights use static block scales selected by MSE; +# non-NVFP4 layers, such as FP8 per-tensor, follow the non-MSE path.)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around lines 30 - 32, The calibration comment is misleading about FP8 scale selection: update the comment near the calibration block that mentions "FP8 per-tensor scales" and "NVFP4 weights" (the lines describing MSE searches) to explicitly state that only NVFP4 weight block scales are selected via MSE while non-NVFP4 FP8 formats skip MSE and use the stack's default scaling method; edit the text to clarify that FP8 per-tensor scales for non-NVFP4 are not MSE-searched to avoid confusion for recipe users.

coderabbitai · 2026-05-19T18:35:08Z

+            block_sizes = getattr(quantizer, "block_sizes", None)
+            block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None
+            if block_size is None or weight.shape[-1] % block_size != 0:
+                return False
+            expected_blocks = weight.numel() // block_size
+            return amax.numel() == expected_blocks and global_amax.numel() == 1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle padded trailing blocks when validating restored NVFP4 state.

Static block quantization already pads the tail block during setup, so a restored _amax can be complete even when weight.shape[-1] % block_size != 0. Returning False here forces max_calibrate() and overwrites the saved MSE-derived scales for those layers.

Suggested fix

block_sizes = getattr(quantizer, "block_sizes", None) block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None - if block_size is None or weight.shape[-1] % block_size != 0: + if block_size is None: return False - expected_blocks = weight.numel() // block_size + rows = weight.numel() // weight.shape[-1] + expected_blocks = rows * ((weight.shape[-1] + block_size - 1) // block_size) return amax.numel() == expected_blocks and global_amax.numel() == 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/plugins/custom.py` around lines 148 - 153, The current check treats incomplete tail blocks as invalid; instead compute blocks per row as ceil(weight.shape[-1] / block_size) and total expected_blocks = (weight.numel() // weight.shape[-1]) * blocks_per_row so padded trailing blocks count toward the expected amax length. In the validation around quantizer.block_sizes / block_size, replace expected_blocks = weight.numel() // block_size with rows = weight.numel() // weight.shape[-1]; blocks_per_row = math.ceil(weight.shape[-1] / block_size) (or integer ceil via (N + block_size - 1)//block_size); expected_blocks = rows * blocks_per_row, then return amax.numel() == expected_blocks and global_amax.numel() == 1, allowing restored `_amax` that includes padded tail blocks.

coderabbitai · 2026-05-19T18:35:08Z

+        leaves = (
+            list(weight_quantizer)
+            if isinstance(weight_quantizer, SequentialQuantizer)
+            else [weight_quantizer]
+        )
+        if any(leaf.is_static_block_quant for leaf in leaves):
+            offending.append((name, tp_group.world_size()))
+    if offending:
+        raise NotImplementedError(
+            "Static-block NVFP4 weight quantization (e.g. MSE) is not supported with TP > 1. Please re-run with TP=1. "
+            f"Offending modules (showing first 5 of {len(offending)}): {offending[:5]}"
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Narrow this TP guard to NVFP4-static weights only.

is_static_block_quant is true for every fake static-block format, not just NVFP4. With this predicate, TP>1 AWQ/INT4 block-quantized models now hit this new NotImplementedError, even though the message and PR scope are NVFP4/MSE-specific.

Suggested fix

- if any(leaf.is_static_block_quant for leaf in leaves): + if any(leaf.is_nvfp4_static for leaf in leaves): offending.append((name, tp_group.world_size()))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/plugins/megatron.py` around lines 88 - 99, The TP>1 guard is too broad because it triggers for any fake static-block quantizer; change the check that builds offending to only consider NVFP4 static-block quantizers by requiring both leaf.is_static_block_quant and that the leaf reports the NVFP4 format (e.g., leaf.format == "NVFP4" or the project’s NVFP4 enum/attribute — replace with the actual attribute used in your quantizer objects) when iterating over leaves (the variables/functions involved: weight_quantizer, SequentialQuantizer, leaves, is_static_block_quant, offending, tp_group.world_size()); keep the rest of the logic and the NotImplementedError unchanged.

coderabbitai · 2026-05-19T18:35:08Z

+    # Make sure config.json and hf_quant_config.json use the same serving config.
+    assert config_dict["quantization_config"] == hf_quant_config_dict

    # Verify config.json
    if kv_cache_quant_cfg:
        assert config_dict["quantization_config"]["kv_cache_scheme"]["num_bits"] == 8

    # Verify hf_quant_config.json
    if quant_config:
-        quant_config_dict = hf_quant_config_dict["quantization"]
+        quant_config_dict = hf_quant_config_dict
        quant_type = quant_config_dict["quant_algo"]
        assert (
            quant_type in quant_config
        )  # quant config str is subset of quant config e.g. NVFP4 -> NVFP4_DEFAULT_CFG
-        assert len(quant_config_dict["exclude_modules"]) > 1  # Dynamically added exclude modules
+        assert len(quant_config_dict["ignore"]) > 1  # Dynamically added exclude modules
        if quant_type == "NVFP4":
-            assert quant_config_dict["group_size"] == 16
+            assert quant_config_dict["config_groups"]["group_0"]["weights"]["group_size"] == 16

        if kv_cache_quant_cfg:
-            assert quant_config_dict["kv_cache_quant_algo"] == KV_CACHE_FP8
+            assert quant_config_dict["kv_cache_scheme"]["num_bits"] == 8



⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert against the converted serving config, not the raw HF wrapper.

hf_quant_config.json is still written as {"producer": ..., "quantization": ...}, while config.json["quantization_config"] gets the output of convert_hf_quant_config_format(...). This helper now compares unlike objects and then indexes quant_algo / ignore / config_groups at the wrong level, so the quantized export cases will fail or validate the wrong structure.

Suggested fix

+from modelopt.torch.export.convert_hf_config import convert_hf_quant_config_format + def _verify_model_quant_config( export_dir: Path, quant_config: str | None = None, kv_cache_quant_cfg: str | None = None ): """Verify config.json and hf_quant_config.json""" config_dict = json.load(open(export_dir / "config.json")) hf_quant_config_dict = json.load(open(export_dir / "hf_quant_config.json")) # Make sure config.json and hf_quant_config.json use the same serving config. - assert config_dict["quantization_config"] == hf_quant_config_dict + assert config_dict["quantization_config"] == convert_hf_quant_config_format( + hf_quant_config_dict + ) # Verify config.json if kv_cache_quant_cfg: assert config_dict["quantization_config"]["kv_cache_scheme"]["num_bits"] == 8 # Verify hf_quant_config.json if quant_config: - quant_config_dict = hf_quant_config_dict + quant_config_dict = hf_quant_config_dict["quantization"] quant_type = quant_config_dict["quant_algo"]

As per coding guidelines, tests/**/*.py: Write focused unit tests during development and curate production tests to be lean, documenting expected behavior, protecting against regressions, and flagging backward-incompatible changes.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines 45 - 65, The test is comparing config.json's quantization_config to the raw HF wrapper (hf_quant_config_dict) instead of the converted serving format; change the test to use the converted structure (call convert_hf_quant_config_format on hf_quant_config_dict or otherwise use the same transformation used when producing config_dict) before asserting and before indexing fields like "quant_algo", "ignore", and "config_groups"; update references in the verification block so quant_config_dict refers to the converted result (not the original hf_quant_config_dict) and then perform the existing assertions and kv_cache checks against that converted object.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

github-actions · 2026-05-21T00:50:30Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1521/
Built to branch `gh-pages` at 2026-05-21 00:50 UTC. Preview will be ready when the GitHub Pages deployment is complete.

jenchen13 and others added 5 commits May 14, 2026 13:08

support NVFP4 MSE and mixed precision in mcore; super nvfp4 recipe

211c6b3

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

minor fixes; check block quant moe completeness

8dff087

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

add mcore autoquant MOE rule

cc4a570

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

support EP in autoquant scoring

60ca49c

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

jenchen13 requested review from a team as code owners May 19, 2026 18:24

jenchen13 requested review from h-guo18, jingyu-ml and kaix-nv May 19, 2026 18:24

jenchen13 mentioned this pull request May 19, 2026

Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe #1363

Closed

jenchen13 requested review from meenchen and realAsma May 19, 2026 18:27

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

merge main in, lm head quant in mcore, w4a16 mcore

3e85e70

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

jenchen13 requested review from a team as code owners May 21, 2026 00:33

jenchen13 requested review from ajrasane and kevalmorabia97 May 21, 2026 00:33

jenchen13 added 2 commits May 20, 2026 17:46

merge main

5057ee0

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

dataset util for mcore quantize

6b1965a

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

kevalmorabia97 requested review from ChenhanYu and yeyu-nvidia and removed request for a team, ajrasane, jingyu-ml and kaix-nv May 21, 2026 11:58

Conversation

jenchen13 commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Super recipe

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

jenchen13 commented May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

Built to branch gh-pages at 2026-05-21 00:50 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jenchen13 commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-21 00:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.