Skip to content

Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe#1521

Open
jenchen13 wants to merge 8 commits into
mainfrom
feature/mcore_mse_mixed_precision
Open

Support Mixed precision & Static MSE in MCore; Nemotron Super v3 NVFP4 recipe#1521
jenchen13 wants to merge 8 commits into
mainfrom
feature/mcore_mse_mixed_precision

Conversation

@jenchen13
Copy link
Copy Markdown
Contributor

@jenchen13 jenchen13 commented May 19, 2026

What does this PR do?

Type of change: New recipe + Bug Fixes

MCore and MSE fixes

  • support mixed precision export in MCore by detecting mixed precision layers in HF Quant Config
  • Restore static quantizer in MCore checkpoint restore as NVFP4QTensor (not TensorQuantizer which can call max calibrate. we want to skip max calibrate for static quantizer during restore) --> fixes bug during MCore export for MSE
  • Skip MSE calibration for any non-NVFP4 quantization format
  • Fix dynamic block quantizer detection when block_sizes is dict-backed.
  • Add a YAML quantization recipe that roughly mirrors Nemotron 3 Super NVFP4 hf_quant_config.json
    Export bug fixes
  • copy .py files properly from original HF ckpt (for reasoning parser etc)

Super recipe

Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:

  • MoE routed experts (mixer.experts..{up,down}_proj): NVFP4 W4A4 weight MSE, group_size 16
  • MoE shared experts (mixer.shared_experts.{up,down}_proj): FP8 per-tensor
  • Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor
  • KV cache: FP8
    rest: not quantized

Usage

# Add a code snippet demonstrating how to use this

Testing

TODO test in HF and MCore PTQ on Nemotron model

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added NVFP4 W4A16 weight-only quantization format support
    • FP8+NVFP4 mixed-precision export with per-layer quantization metadata tracking
    • New Nemotron-3-Super-120B-A12B PTQ recipe configurations
    • Enhanced Megatron-Core checkpoint restore and export for NVFP4 quantization
    • Offline Hugging Face Hub support for model exports
  • Bug Fixes

    • Fixed Megatron-Core expert parallel amax synchronization for routed expert weights

Review Change Stack

jenchen13 and others added 5 commits May 14, 2026 13:08
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
### What does this PR do?

Type of change: Bug fix

This PR enables `auto_quantize` for Megatron expert parallel MoE flows
by including the expert model parallel group when aggregating scores and
costs and when synchronizing selected recipes. It also derives the
search budget from the no-quant candidate costs in `candidate_stats`, so
sharded expert layers use global candidate costs instead of local module
weights.

### Usage

```python
model, search_state = mtq.auto_quantize(
    model,
    constraints={"effective_bits": 8.0},
    quantization_formats=[mtq.NVFP4_DEFAULT_CFG, mtq.FP8_DEFAULT_CFG],
    data_loader=data_loader,
    forward_step=forward_step,
)
```

### Testing

- Focused Megatron EP test from local log: `python -m pytest
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py::test_auto_quantize_moe_ep
-xvs` in NGC PyTorch 26.01 (`1 passed` in 134.37s).
- Added unit coverage for deriving the auto_quantize budget from
no-quant candidate costs.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
- Did you get Claude approval on this PR?: N/A

### Additional Information

Base branch: `jennifchen/super_nvfp4_recipe`.

Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: Jenny Chen <jennifchen@nvidia.com>
Co-authored-by: Jenny Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested review from a team as code owners May 19, 2026 18:24
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

This PR extends NVFP4 static-block quantization with calibration validation and state restoration; adds distributed expert-parallelism support to auto-quantize including format consensus across EP ranks; implements per-layer mixed-precision quantization metadata recording and export for Megatron-Core models targeting Hugging Face format; and introduces HF Hub offline mode support with two new Nemotron-3-Super-120B PTQ recipe configurations.

Changes

NVFP4 Static Quantizer and Expert Parallelism Integration

Layer / File(s) Summary
Block Quantization Detection and Backend Availability
modelopt/torch/quantization/nn/modules/tensor_quantizer.py, modelopt/torch/quantization/backends/utils.py
Refactors block quantization predicates with unified is_block_quant property; adds CUDA availability checks to fp8_compatible() and fp4_compatible() before device capability assertions.
NVFP4 Amax Validation and Static Quantizer Promotion
modelopt/torch/export/quant_utils.py, modelopt/torch/quantization/conversion.py
Detects invalid amax values (non-finite, negative) during NVFP4 calibration; introduces maybe_promote_nvfp4_static_quantizer() to upgrade TensorQuantizer to NVFP4StaticQuantizer during restore when saved state indicates NVFP4 static quantizer.
Quantizer State Restoration and Megatron Sharding
modelopt/torch/quantization/plugins/custom.py, modelopt/torch/quantization/plugins/megatron.py
Restores NVFP4 static quantizer state in custom and Megatron plugins; validates saved state completeness; adjusts Megatron sharding to treat replicated global_amax as non-sharded across column/row parallel dimensions.
MoE Calibration Validation and Expert Parallelism Amax Sync
modelopt/torch/quantization/model_calib.py, modelopt/torch/quantization/plugins/megatron.py
Validates TP compatibility for static-block NVFP4 quantizers; checks MoE calibration completeness across DP/EP/TP groups; conditionally gates EP amax synchronization for routed-expert weights via sync_expert_weight_amax flag; refactors MSE calibration bootstrap to target static-weight quantizers.
Expert Parallelism Support in AutoQuantize
modelopt/torch/quantization/algorithms.py
Extends auto-quantize distributed synchronization to aggregate scores and costs across expert_model_parallel_group; adds EP consistency check for quantization format selection; extends quantization grouping regexes for NemotronH MoE expert patterns; refactors total weight-size computation to use candidate stats instead of module introspection.
NVFP4 Weight Export and Scaling Factor Computation
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
Introduces _get_static_global_amax() helper to resolve NVFP4 global_amax from either attribute name; computes weight scaling factors with FP8 overflow clamping for per-block scales.
Per-Layer Mixed-Precision Quantization Metadata Export
modelopt/torch/export/unified_export_megatron.py
Records per-layer quantization format and block-size metadata during all export mapping paths; gathers per-layer configs and KV-cache dtype from all ranks; builds HF quantization config with per-layer and KV-cache settings when mixed-precision metadata is present; falls back to simpler config when not available.
Hugging Face Hub Offline Mode and Remote Code Handling
modelopt/torch/export/plugins/hf_checkpoint_utils.py
Detects HF Hub offline mode via HF_HUB_OFFLINE environment variable; downloads Python sidecar files using snapshot_download with local_files_only when offline; handles missing offline cache with informative RuntimeError.
Megatron-Core Nemotron Export and PTQ Recipes
modelopt/torch/export/plugins/mcore_nemotron.py, modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml, modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
Updates core_attention mapping to specify key/value scale parameter names; adds two PTQ recipe configurations for Nemotron-3-Super-120B with mixed-precision NVFP4/FP8 quantization (MSE-based and max-calibration variants).
Dataset Utilities and Documentation
modelopt/torch/utils/dataset_utils.py, CHANGELOG.rst, examples/specdec_bench/specdec_bench/datasets/speed.py
Adds get_dataloader_from_dataset() helper for distributed dataset loading; updates CHANGELOG documenting NVFP4 checkpoint restore, mixed-precision export, PTQ recipes, expert parallelism, and HF offline support; reformats metadata comprehension.

Comprehensive Test Coverage

Layer / File(s) Summary
NVFP4 Static Export Round-Trip Tests
tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py, tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py
Adds extensive CPU tests validating NVFP4 static export under extreme per-block amax conditions (overflow/underflow/corner-cases), verifies finite scales and dequantized values, equivalence with dynamic export, and manual FP4 grid alignment; adds CUDA test for FP8 scale overflow clamping.
Hugging Face Hub Offline Mode Tests
tests/unit/torch/export/test_hf_checkpoint_utils.py
Tests offline mode detection and cached snapshot usage with local_files_only=True, plus error handling when offline cache is missing.
Mixed-Precision Export and Expert Parallelism Tests
tests/gpu_megatron/torch/export/test_unified_export_megatron.py, tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
Validates new HF quantization config format in export, adds QKV slicing exclude_modules record verification, and exercises auto-quantize under expert parallelism with EP consistency validation.
Auto-Quantize Infrastructure and Bootstrap Updates
tests/_test_utils/torch/quantization/quantize_common.py, tests/unit/torch/quantization/test_autoquant.py, tests/unit/torch/quantization/test_mse_calibrator.py, tests/unit/torch/quantization/plugins/test_fused_experts.py
Updates auto_quantize_helper signature for customizable calibration data and callbacks; adds budget cost calculation test using candidate stats; updates MSE calibrator test for sweep-style configuration; updates bootstrap test to use renamed static-weight quantizer bootstrap helper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1518: Both PRs modify GPT-family TE fused-norm handling in modelopt/torch/export/unified_export_megatron.py with _get_fused_norm_weight(..., primary_key) refactoring and corresponding export path updates.
  • NVIDIA/Model-Optimizer#1405: Main PR's quantization search changes update QuantRecipe to use typed QuantizerCfgEntry objects, aligning with PR #1405's schematization of config entries in modelopt/torch/quantization/config.py.
  • NVIDIA/Model-Optimizer#1522: Both PRs extend export/quantization codepaths to recognize W4A16_NVFP4 and add DATASET_COMBOS dataset fan-out logic in modelopt/torch/utils/dataset_utils.py.

Suggested reviewers

  • kevalmorabia97
  • sychen52
  • meenchen
  • vishalpandya1990
  • cjluo-nv
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 67.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding mixed-precision export support in MCore, static MSE support, and a new Nemotron Super v3 NVFP4 quantization recipe.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns detected. No unsafe torch.load/numpy.load, hardcoded trust_remote_code, eval/exec on untrusted input, nosec comments, or non-permissive dependencies found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/mcore_mse_mixed_precision

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jenchen13
Copy link
Copy Markdown
Contributor Author

A continuation of #1363

@jenchen13 jenchen13 requested review from meenchen and realAsma May 19, 2026 18:27
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modelopt/torch/export/unified_export_megatron.py (1)

818-828: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Treat QUANTIZATION_NONE as unquantized when building exclude_modules.

This branch only records excludes for qformat is None, but the same method immediately returns early on qformat == QUANTIZATION_NONE, and _qkv_slicing() already treats both values the same. As written, any normal module reported as QUANTIZATION_NONE will skip the HF ignore list even though it is still unquantized.

Suggested fix
-        if qformat is None and "norm" not in prefix:
+        if qformat in (None, QUANTIZATION_NONE) and "norm" not in prefix:
             self._record_excluded_module(prefix)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/export/unified_export_megatron.py` around lines 818 - 828, The
code currently only calls _record_excluded_module(prefix) when qformat is None,
but QUANTIZATION_NONE should be treated the same; update the branch in
unified_export_megatron.py (the block around qformat, QUANTIZATION_NONE,
_get_weight_bias, and _record_excluded_module) so that if qformat is None or
qformat == QUANTIZATION_NONE (and "norm" not in prefix) you record the module as
excluded before the early return; keep the existing early return for
QUANTIZATION_NONE but ensure the exclude is recorded first and keep
compatibility with _qkv_slicing behavior.
modelopt/torch/quantization/algorithms.py (1)

765-782: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Recompute the persisted score/cost after recipe synchronization.

After best_format is replaced by the DP/TP/EP-synchronized value, best_constraints and best_scores are still accumulated from the local solver choice. On ranks that did not originate the synchronized format, self.best["constraints"] / self.best["score"] can end up describing a different recipe than the one actually activated and checkpointed.

Suggested fix
         for name, best_hparam_recipe_info in best_recipe_info.items():
             # Solvers could give different solutions for the same layer across DP/TP/EP groups even though
             # the scores and costs are the same. Lets make sure the same recipe is selected across DP/TP/EP
             _ps = self.model.get_submodule(name.split(".quant_recipe")[0]).parallel_state
             best_format = DistributedProcessGroup.get_dist_syncd_obj(
                 best_hparam_recipe_info["format"],
                 [
                     _ps.data_parallel_group,
                     _ps.tensor_parallel_group,
                     _ps.expert_model_parallel_group,
                 ],
                 lambda a: a[0],
             )

             best_recipe[name] = best_format
-            get_hparam(self.model, name).active = best_format
-            best_constraints += best_hparam_recipe_info["costs"]
-            best_scores += best_hparam_recipe_info["scores"]
+            hparam = get_hparam(self.model, name)
+            hparam.active = best_format
+            best_constraints += hparam.get_cost(best_format)
+            best_scores += hparam.get_score(best_format)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/algorithms.py` around lines 765 - 782, The loop
currently accumulates best_constraints and best_scores from
best_hparam_recipe_info before replacing the local solver's format with the
DP/TP/EP-synchronized best_format; update the code so that after you set
best_recipe[name] = best_format and get_hparam(self.model, name).active =
best_format you recompute and add the costs and scores that correspond to the
actually activated best_format (not the original
best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores
that the solver produced for the layer (referencing best_recipe_info,
best_hparam_recipe_info and get_hparam) and use that entry to increment
best_constraints and best_scores (and keep
self.best["constraints"]/self.best["score"] consistent with the activated
recipe).
🧹 Nitpick comments (1)
tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py (1)

32-42: ⚡ Quick win

Add one regression that uses only the restored _global_amax path.

The implementation change specifically supports static quantizers restored with _global_amax, but this helper only seeds global_amax, so the new restore path is still untested. A single round-trip case that sets _global_amax directly would keep the actual bugfix from regressing.

As per coding guidelines, tests/**/*.py: Write focused unit tests during development and curate production tests to be lean, documenting expected behavior, protecting against regressions, and flagging backward-incompatible changes.

Also applies to: 45-70

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py` around lines
32 - 42, Add a focused unit test that exercises the restored _global_amax code
path: create an NVFP4StaticQuantizer via the existing helper
_make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the
private attribute _global_amax (not global_amax) to a tensor value, perform the
export/import (or the same round‑trip flow used elsewhere in this test file) and
assert the quantizer restores using the _global_amax path (e.g., resulting
amax/global_amax behavior matches expected values). Ensure the test is small,
documents the expected behavior, and only validates the single round‑trip
regression scenario so the `_global_amax` restore remains covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 47-79: The routed-expert weight quantizers in this max-calib
recipe (entries with quantizer_name: '*mixer.experts.*weight_quantizer' and
'*mlp.experts*weight_quantizer') are set to type: dynamic but must be static for
a fair max-vs-MSE comparison; update those two quantizer blocks to use type:
static (leave the corresponding input_quantizer blocks as-is) so only the weight
quantizers for routed experts switch from dynamic to static.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 30-32: The calibration comment is misleading about FP8 scale
selection: update the comment near the calibration block that mentions "FP8
per-tensor scales" and "NVFP4 weights" (the lines describing MSE searches) to
explicitly state that only NVFP4 weight block scales are selected via MSE while
non-NVFP4 FP8 formats skip MSE and use the stack's default scaling method; edit
the text to clarify that FP8 per-tensor scales for non-NVFP4 are not
MSE-searched to avoid confusion for recipe users.

In `@modelopt/torch/quantization/plugins/custom.py`:
- Around line 148-153: The current check treats incomplete tail blocks as
invalid; instead compute blocks per row as ceil(weight.shape[-1] / block_size)
and total expected_blocks = (weight.numel() // weight.shape[-1]) *
blocks_per_row so padded trailing blocks count toward the expected amax length.
In the validation around quantizer.block_sizes / block_size, replace
expected_blocks = weight.numel() // block_size with rows = weight.numel() //
weight.shape[-1]; blocks_per_row = math.ceil(weight.shape[-1] / block_size) (or
integer ceil via (N + block_size - 1)//block_size); expected_blocks = rows *
blocks_per_row, then return amax.numel() == expected_blocks and
global_amax.numel() == 1, allowing restored `_amax` that includes padded tail
blocks.

In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 88-99: The TP>1 guard is too broad because it triggers for any
fake static-block quantizer; change the check that builds offending to only
consider NVFP4 static-block quantizers by requiring both
leaf.is_static_block_quant and that the leaf reports the NVFP4 format (e.g.,
leaf.format == "NVFP4" or the project’s NVFP4 enum/attribute — replace with the
actual attribute used in your quantizer objects) when iterating over leaves (the
variables/functions involved: weight_quantizer, SequentialQuantizer, leaves,
is_static_block_quant, offending, tp_group.world_size()); keep the rest of the
logic and the NotImplementedError unchanged.

In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 45-65: The test is comparing config.json's quantization_config to
the raw HF wrapper (hf_quant_config_dict) instead of the converted serving
format; change the test to use the converted structure (call
convert_hf_quant_config_format on hf_quant_config_dict or otherwise use the same
transformation used when producing config_dict) before asserting and before
indexing fields like "quant_algo", "ignore", and "config_groups"; update
references in the verification block so quant_config_dict refers to the
converted result (not the original hf_quant_config_dict) and then perform the
existing assertions and kv_cache checks against that converted object.

---

Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 818-828: The code currently only calls
_record_excluded_module(prefix) when qformat is None, but QUANTIZATION_NONE
should be treated the same; update the branch in unified_export_megatron.py (the
block around qformat, QUANTIZATION_NONE, _get_weight_bias, and
_record_excluded_module) so that if qformat is None or qformat ==
QUANTIZATION_NONE (and "norm" not in prefix) you record the module as excluded
before the early return; keep the existing early return for QUANTIZATION_NONE
but ensure the exclude is recorded first and keep compatibility with
_qkv_slicing behavior.

In `@modelopt/torch/quantization/algorithms.py`:
- Around line 765-782: The loop currently accumulates best_constraints and
best_scores from best_hparam_recipe_info before replacing the local solver's
format with the DP/TP/EP-synchronized best_format; update the code so that after
you set best_recipe[name] = best_format and get_hparam(self.model, name).active
= best_format you recompute and add the costs and scores that correspond to the
actually activated best_format (not the original
best_hparam_recipe_info["format"]); locate the mapping of format->costs/scores
that the solver produced for the layer (referencing best_recipe_info,
best_hparam_recipe_info and get_hparam) and use that entry to increment
best_constraints and best_scores (and keep
self.best["constraints"]/self.best["score"] consistent with the activated
recipe).

---

Nitpick comments:
In `@tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py`:
- Around line 32-42: Add a focused unit test that exercises the restored
_global_amax code path: create an NVFP4StaticQuantizer via the existing helper
_make_static_quantizer (or directly instantiate NVFP4StaticQuantizer), set the
private attribute _global_amax (not global_amax) to a tensor value, perform the
export/import (or the same round‑trip flow used elsewhere in this test file) and
assert the quantizer restores using the _global_amax path (e.g., resulting
amax/global_amax behavior matches expected values). Ensure the test is small,
documents the expected behavior, and only validates the single round‑trip
regression scenario so the `_global_amax` restore remains covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 864054fb-7e5a-459d-9bc8-f15b0be42e2b

📥 Commits

Reviewing files that changed from the base of the PR and between 8f1529a and bd2e8e9.

📒 Files selected for processing (26)
  • CHANGELOG.rst
  • examples/specdec_bench/specdec_bench/datasets/speed.py
  • modelopt/torch/export/plugins/hf_checkpoint_utils.py
  • modelopt/torch/export/plugins/mcore_nemotron.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_megatron.py
  • modelopt/torch/quantization/algorithms.py
  • modelopt/torch/quantization/backends/utils.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/conversion.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/nn/modules/tensor_quantizer.py
  • modelopt/torch/quantization/plugins/custom.py
  • modelopt/torch/quantization/plugins/megatron.py
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
  • tests/_test_utils/torch/quantization/quantize_common.py
  • tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py
  • tests/gpu_megatron/torch/export/test_unified_export_megatron.py
  • tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
  • tests/unit/torch/export/test_hf_checkpoint_utils.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py
  • tests/unit/torch/quantization/test_autoquant.py
  • tests/unit/torch/quantization/test_mse_calibrator.py
  • tests/unit/torch/quantization/test_nvfp4_static_export_cpu.py

Comment on lines +47 to +79
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mixer.experts.*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
# Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mlp.experts*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep routed-expert weight blocks static in the max-calib variant.

This recipe says it differs from the MSE recipe by calibration method, but routed-expert weight quantizers are set to type: dynamic, which changes quantization behavior and undermines the max-vs-MSE comparison.

Proposed fix
     - quantizer_name: '*mixer.experts.*weight_quantizer'
       enable: true
       cfg:
         block_sizes:
-          type: dynamic
+          type: static
           scale_bits: e4m3
         num_bits: e2m1
@@
     - quantizer_name: '*mlp.experts*weight_quantizer'
       enable: true
       cfg:
         block_sizes:
-          type: dynamic
+          type: static
           scale_bits: e4m3
         num_bits: e2m1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`
around lines 47 - 79, The routed-expert weight quantizers in this max-calib
recipe (entries with quantizer_name: '*mixer.experts.*weight_quantizer' and
'*mlp.experts*weight_quantizer') are set to type: dynamic but must be static for
a fair max-vs-MSE comparison; update those two quantizer blocks to use type:
static (leave the corresponding input_quantizer blocks as-is) so only the weight
quantizers for routed experts switch from dynamic to static.

Comment on lines +30 to +32
# Calibration: weight MSE with FP8-scale sweep over the 128 e4m3 scale values
# (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales
# are also chosen via MSE search instead of plain amax).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the calibration comment for FP8 layers.

The comment says FP8 per-tensor scales are selected via MSE search, but this stack skips MSE for non-NVFP4 formats. This is misleading for recipe users.

Proposed fix
-# (NVFP4 weights use static block scales selected by MSE; FP8 per-tensor scales
-# are also chosen via MSE search instead of plain amax).
+# (NVFP4 routed-expert weights use static block scales selected by MSE;
+# non-NVFP4 layers, such as FP8 per-tensor, follow the non-MSE path.)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around
lines 30 - 32, The calibration comment is misleading about FP8 scale selection:
update the comment near the calibration block that mentions "FP8 per-tensor
scales" and "NVFP4 weights" (the lines describing MSE searches) to explicitly
state that only NVFP4 weight block scales are selected via MSE while non-NVFP4
FP8 formats skip MSE and use the stack's default scaling method; edit the text
to clarify that FP8 per-tensor scales for non-NVFP4 are not MSE-searched to
avoid confusion for recipe users.

Comment on lines +148 to +153
block_sizes = getattr(quantizer, "block_sizes", None)
block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None
if block_size is None or weight.shape[-1] % block_size != 0:
return False
expected_blocks = weight.numel() // block_size
return amax.numel() == expected_blocks and global_amax.numel() == 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle padded trailing blocks when validating restored NVFP4 state.

Static block quantization already pads the tail block during setup, so a restored _amax can be complete even when weight.shape[-1] % block_size != 0. Returning False here forces max_calibrate() and overwrites the saved MSE-derived scales for those layers.

Suggested fix
             block_sizes = getattr(quantizer, "block_sizes", None)
             block_size = block_sizes.get(-1) if isinstance(block_sizes, dict) else None
-            if block_size is None or weight.shape[-1] % block_size != 0:
+            if block_size is None:
                 return False
-            expected_blocks = weight.numel() // block_size
+            rows = weight.numel() // weight.shape[-1]
+            expected_blocks = rows * ((weight.shape[-1] + block_size - 1) // block_size)
             return amax.numel() == expected_blocks and global_amax.numel() == 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/custom.py` around lines 148 - 153, The
current check treats incomplete tail blocks as invalid; instead compute blocks
per row as ceil(weight.shape[-1] / block_size) and total expected_blocks =
(weight.numel() // weight.shape[-1]) * blocks_per_row so padded trailing blocks
count toward the expected amax length. In the validation around
quantizer.block_sizes / block_size, replace expected_blocks = weight.numel() //
block_size with rows = weight.numel() // weight.shape[-1]; blocks_per_row =
math.ceil(weight.shape[-1] / block_size) (or integer ceil via (N + block_size -
1)//block_size); expected_blocks = rows * blocks_per_row, then return
amax.numel() == expected_blocks and global_amax.numel() == 1, allowing restored
`_amax` that includes padded tail blocks.

Comment on lines +88 to +99
leaves = (
list(weight_quantizer)
if isinstance(weight_quantizer, SequentialQuantizer)
else [weight_quantizer]
)
if any(leaf.is_static_block_quant for leaf in leaves):
offending.append((name, tp_group.world_size()))
if offending:
raise NotImplementedError(
"Static-block NVFP4 weight quantization (e.g. MSE) is not supported with TP > 1. Please re-run with TP=1. "
f"Offending modules (showing first 5 of {len(offending)}): {offending[:5]}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Narrow this TP guard to NVFP4-static weights only.

is_static_block_quant is true for every fake static-block format, not just NVFP4. With this predicate, TP>1 AWQ/INT4 block-quantized models now hit this new NotImplementedError, even though the message and PR scope are NVFP4/MSE-specific.

Suggested fix
-        if any(leaf.is_static_block_quant for leaf in leaves):
+        if any(leaf.is_nvfp4_static for leaf in leaves):
             offending.append((name, tp_group.world_size()))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/plugins/megatron.py` around lines 88 - 99, The
TP>1 guard is too broad because it triggers for any fake static-block quantizer;
change the check that builds offending to only consider NVFP4 static-block
quantizers by requiring both leaf.is_static_block_quant and that the leaf
reports the NVFP4 format (e.g., leaf.format == "NVFP4" or the project’s NVFP4
enum/attribute — replace with the actual attribute used in your quantizer
objects) when iterating over leaves (the variables/functions involved:
weight_quantizer, SequentialQuantizer, leaves, is_static_block_quant, offending,
tp_group.world_size()); keep the rest of the logic and the NotImplementedError
unchanged.

Comment on lines +45 to 65
# Make sure config.json and hf_quant_config.json use the same serving config.
assert config_dict["quantization_config"] == hf_quant_config_dict

# Verify config.json
if kv_cache_quant_cfg:
assert config_dict["quantization_config"]["kv_cache_scheme"]["num_bits"] == 8

# Verify hf_quant_config.json
if quant_config:
quant_config_dict = hf_quant_config_dict["quantization"]
quant_config_dict = hf_quant_config_dict
quant_type = quant_config_dict["quant_algo"]
assert (
quant_type in quant_config
) # quant config str is subset of quant config e.g. NVFP4 -> NVFP4_DEFAULT_CFG
assert len(quant_config_dict["exclude_modules"]) > 1 # Dynamically added exclude modules
assert len(quant_config_dict["ignore"]) > 1 # Dynamically added exclude modules
if quant_type == "NVFP4":
assert quant_config_dict["group_size"] == 16
assert quant_config_dict["config_groups"]["group_0"]["weights"]["group_size"] == 16

if kv_cache_quant_cfg:
assert quant_config_dict["kv_cache_quant_algo"] == KV_CACHE_FP8
assert quant_config_dict["kv_cache_scheme"]["num_bits"] == 8

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert against the converted serving config, not the raw HF wrapper.

hf_quant_config.json is still written as {"producer": ..., "quantization": ...}, while config.json["quantization_config"] gets the output of convert_hf_quant_config_format(...). This helper now compares unlike objects and then indexes quant_algo / ignore / config_groups at the wrong level, so the quantized export cases will fail or validate the wrong structure.

Suggested fix
+from modelopt.torch.export.convert_hf_config import convert_hf_quant_config_format
+
 def _verify_model_quant_config(
     export_dir: Path, quant_config: str | None = None, kv_cache_quant_cfg: str | None = None
 ):
     """Verify config.json and hf_quant_config.json"""
     config_dict = json.load(open(export_dir / "config.json"))
     hf_quant_config_dict = json.load(open(export_dir / "hf_quant_config.json"))
     # Make sure config.json and hf_quant_config.json use the same serving config.
-    assert config_dict["quantization_config"] == hf_quant_config_dict
+    assert config_dict["quantization_config"] == convert_hf_quant_config_format(
+        hf_quant_config_dict
+    )

     # Verify config.json
     if kv_cache_quant_cfg:
         assert config_dict["quantization_config"]["kv_cache_scheme"]["num_bits"] == 8

     # Verify hf_quant_config.json
     if quant_config:
-        quant_config_dict = hf_quant_config_dict
+        quant_config_dict = hf_quant_config_dict["quantization"]
         quant_type = quant_config_dict["quant_algo"]

As per coding guidelines, tests/**/*.py: Write focused unit tests during development and curate production tests to be lean, documenting expected behavior, protecting against regressions, and flagging backward-incompatible changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines
45 - 65, The test is comparing config.json's quantization_config to the raw HF
wrapper (hf_quant_config_dict) instead of the converted serving format; change
the test to use the converted structure (call convert_hf_quant_config_format on
hf_quant_config_dict or otherwise use the same transformation used when
producing config_dict) before asserting and before indexing fields like
"quant_algo", "ignore", and "config_groups"; update references in the
verification block so quant_config_dict refers to the converted result (not the
original hf_quant_config_dict) and then perform the existing assertions and
kv_cache checks against that converted object.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested review from a team as code owners May 21, 2026 00:33
jenchen13 added 2 commits May 20, 2026 17:46
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1521/

Built to branch gh-pages at 2026-05-21 00:50 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@kevalmorabia97 kevalmorabia97 requested review from ChenhanYu and yeyu-nvidia and removed request for a team, ajrasane, jingyu-ml and kaix-nv May 21, 2026 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants