Skip to content

Add nvfp4_omlp_only config and simplify the config.py#973

Merged
cjluo-nv merged 12 commits intomainfrom
chenjiel/fix_config
Mar 6, 2026
Merged

Add nvfp4_omlp_only config and simplify the config.py#973
cjluo-nv merged 12 commits intomainfrom
chenjiel/fix_config

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Mar 4, 2026

What does this PR do?

Type of change: ? new feature

  1. Add vfp4_omlp_only config == nvfp4_mlp_only + o_proj quant
  2. Add block sparse MOE to mlp only config
  3. Simplfiy config.py
  4. Update readme in llm_ptq mention these two configs for better accuracy.

Usage

huggingface_script.sh ... --quant nvfp4_omlp_only

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added nvfp4_omlp_only quantization format for NVFP4, enabling selective quantization of MLP and output projection layers while preserving attention QKV projection accuracy.
  • Changed

    • pass_through_bwd now defaults to True; set to False if using STE with zeroed outlier gradients for better QAT accuracy.
  • Documentation

    • Updated post-training quantization guidance with NVFP4-specific configuration recommendations and usage examples.

cjluo-nv added 6 commits March 4, 2026 18:05
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested review from a team as code owners March 4, 2026 18:35
@cjluo-nv cjluo-nv requested a review from Edwardf0t1 March 4, 2026 18:35
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces a new NVFP4 quantization variant (nvfp4_omlp_only) that quantizes output projections alongside MLP layers, refactors the quantization config to reduce duplication via shared quantizer specifications, changes the default pass_through_bwd behavior to True, and updates documentation and examples to reflect these additions.

Changes

Cohort / File(s) Summary
Core Quantization Configuration
modelopt/torch/quantization/config.py
Introduces shared \_nvfp4_quantizer\\ spec to eliminate duplication across NVFP4 configs. Adds new helper \_nvfp4_mlp_only_quant_cfg\\ and public configs \NVFP4_MLP_ONLY_CFG\\ and \NVFP4_OMLP_ONLY_CFG\\. Changes \QuantizerAttributeConfig.pass_through_bwd\\ default from False to True. Consolidates multiple inline quantizer definitions.
Documentation and Changelog
CHANGELOG.rst, examples/llm_ptq/README.md
Adds changelog entries for nvfp4_omlp_only quantization format and pass_through_bwd default change. Updates PTQ documentation with explicit NVFP4 high-accuracy guidance and configuration recommendations for nvfp4_mlp_only and nvfp4_omlp_only variants.
Example Scripts and Configuration
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/scripts/huggingface_example.sh
Registers nvfp4_omlp_only in \QUANT_CFG_CHOICES\\ dictionary and updates qformat validation lists. Adds nvfp4_omlp_only to shell script quantization format loop and error messages.
Quantization Utility Updates
examples/llm_ptq/example_utils.py
Removes conditional quantization disables previously applied to attention projections for specific model types (qwen3moe, qwen3next, minimax, deepseek), enabling full quantization coverage by default.
Test Coverage
tests/gpu/torch/export/test_fsdp2_export.py
Adds \mtq.NVFP4_OMLP_ONLY_CFG\\ to test parameter lists for weight update and fuse-layer export test scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main changes: adding the nvfp4_omlp_only config and simplifying config.py, which are the primary modifications across the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed PR introduces NVFP4 quantization configuration refactoring with no critical security anti-patterns detected: no unsafe torch.load, numpy.load, eval/exec, hardcoded trust_remote_code, or nosec bypasses.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenjiel/fix_config

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
examples/llm_ptq/README.md (1)

72-72: Tighten phrasing in the NVFP4 accuracy note.

On Line 72, “only” is repeated in a way that reads awkwardly. Consider simplifying the sentence for clarity.

✍️ Suggested wording tweak
-> *For higher NVFP4 PTQ accuracy, we recommend using `mtq.NVFP4_MLP_ONLY_CFG` or `mtq.NVFP4_OMLP_ONLY_CFG` instead of `mtq.NVFP4_DEFAULT_CFG`. `NVFP4_MLP_ONLY_CFG` applies NVFP4 quantization only to MLP (and MoE) layers, leaving attention layers unquantized. `NVFP4_OMLP_ONLY_CFG` additionally quantizes the `o_proj` layer. Both preserve accuracy in the sensitive attention QKV projections while still providing significant compression.*
+> *For higher NVFP4 PTQ accuracy, we recommend using `mtq.NVFP4_MLP_ONLY_CFG` or `mtq.NVFP4_OMLP_ONLY_CFG` instead of `mtq.NVFP4_DEFAULT_CFG`. `NVFP4_MLP_ONLY_CFG` applies NVFP4 quantization to MLP (and MoE) layers, leaving attention layers unquantized. `NVFP4_OMLP_ONLY_CFG` additionally quantizes the `o_proj` layer. Both preserve accuracy in the sensitive attention QKV projections while still providing significant compression.*
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` at line 72, The sentence in the NVFP4 accuracy
note repeats "only" awkwardly; reword the sentence that describes
mtq.NVFP4_MLP_ONLY_CFG and mtq.NVFP4_OMLP_ONLY_CFG so it reads smoothly—for
example: "mtq.NVFP4_MLP_ONLY_CFG applies NVFP4 quantization to MLP (and MoE)
layers while leaving attention layers unquantized; mtq.NVFP4_OMLP_ONLY_CFG also
quantizes the o_proj layer." Update the README entry replacing the current
phrasing for mtq.NVFP4_MLP_ONLY_CFG and mtq.NVFP4_OMLP_ONLY_CFG (which compares
against mtq.NVFP4_DEFAULT_CFG) accordingly.
modelopt/torch/quantization/config.py (1)

399-407: Avoid sharing one quantizer dict across multiple keys. Use separate copies instead.

The _nvfp4_quantizer dict is currently referenced by both *weight_quantizer and *input_quantizer keys. While the current codebase doesn't perform in-place mutations on these configs, this pattern is inconsistent with other quantizer configs in the file (INT8_DEFAULT_CFG, FP8_DEFAULT_CFG, etc., which use inline separate dicts for each key). Using copy.deepcopy() prevents potential issues from future code modifications and aligns with the defensive copying pattern already used elsewhere in the codebase (e.g., in docstring examples with copy.deepcopy(mtq.INT4_AWQ_CFG)).

🔧 Suggested fix
+import copy
@@
 NVFP4_DEFAULT_CFG = {
     "quant_cfg": {
-        "*weight_quantizer": _nvfp4_quantizer,
-        "*input_quantizer": _nvfp4_quantizer,
+        "*weight_quantizer": copy.deepcopy(_nvfp4_quantizer),
+        "*input_quantizer": copy.deepcopy(_nvfp4_quantizer),
         **_default_disabled_quantizer_cfg,
     },
     "algorithm": "max",
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 399 - 407, The
_nvfp4_quantizer dict is being shared between NVFP4_DEFAULT_CFG's
"*weight_quantizer" and "*input_quantizer"; change the assignment so each key
gets its own copy (e.g., use copy.deepcopy(_nvfp4_quantizer) or construct two
separate dict literals) rather than referencing the same _nvfp4_quantizer
object, updating NVFP4_DEFAULT_CFG so "*weight_quantizer" and "*input_quantizer"
point to independent dicts to avoid accidental cross-mutation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Line 90: The new qformat key "nvfp4_omlp_only" (mapped to
mtq.NVFP4_OMLP_ONLY_CFG) is exposed but not included in the auto-quantize
allowlist, causing assertions when using --qformat with it; update the allowlist
used by the auto-quantize validation (the list/variable that enumerates
supported qformats) to include the string "nvfp4_omlp_only" so validation
recognizes mtq.NVFP4_OMLP_ONLY_CFG as allowed.

In `@modelopt/torch/quantization/config.py`:
- Around line 627-634: The new public config NVFP4_OMLP_ONLY_CFG is not included
in the choices collection, breaking name-based validation/discovery; update the
choices object (the variable named choices where other configs are listed) to
include an entry for "NVFP4_OMLP_ONLY_CFG" (or the appropriate key name used for
configs in that list) mapping to the NVFP4_OMLP_ONLY_CFG symbol so it is
discoverable/validatable alongside the other quant config entries.

---

Nitpick comments:
In `@examples/llm_ptq/README.md`:
- Line 72: The sentence in the NVFP4 accuracy note repeats "only" awkwardly;
reword the sentence that describes mtq.NVFP4_MLP_ONLY_CFG and
mtq.NVFP4_OMLP_ONLY_CFG so it reads smoothly—for example:
"mtq.NVFP4_MLP_ONLY_CFG applies NVFP4 quantization to MLP (and MoE) layers while
leaving attention layers unquantized; mtq.NVFP4_OMLP_ONLY_CFG also quantizes the
o_proj layer." Update the README entry replacing the current phrasing for
mtq.NVFP4_MLP_ONLY_CFG and mtq.NVFP4_OMLP_ONLY_CFG (which compares against
mtq.NVFP4_DEFAULT_CFG) accordingly.

In `@modelopt/torch/quantization/config.py`:
- Around line 399-407: The _nvfp4_quantizer dict is being shared between
NVFP4_DEFAULT_CFG's "*weight_quantizer" and "*input_quantizer"; change the
assignment so each key gets its own copy (e.g., use
copy.deepcopy(_nvfp4_quantizer) or construct two separate dict literals) rather
than referencing the same _nvfp4_quantizer object, updating NVFP4_DEFAULT_CFG so
"*weight_quantizer" and "*input_quantizer" point to independent dicts to avoid
accidental cross-mutation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8ac15cd9-402b-4461-a64d-a394e6582ff7

📥 Commits

Reviewing files that changed from the base of the PR and between e8f9687 and 9aa775a.

📒 Files selected for processing (6)
  • CHANGELOG.rst
  • examples/llm_ptq/README.md
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/quantization/config.py
💤 Files with no reviewable changes (1)
  • examples/llm_ptq/example_utils.py

Comment thread examples/llm_ptq/hf_ptq.py
Comment thread modelopt/torch/quantization/config.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the NVFP4 PTQ configuration options by introducing an “MLP-only + attention o_proj” variant and refactors parts of the quantization config definitions to reduce duplication, alongside example-script and documentation updates to expose and recommend these formats.

Changes:

  • Added NVFP4_OMLP_ONLY_CFG (and CLI/script plumbing for nvfp4_omlp_only) to quantize MLP/MoE plus attention o_proj.
  • Refactored NVFP4-related quantizer sub-configs to reuse shared dicts and removed redundant fields (enable, axis) where defaults apply.
  • Updated LLM PTQ examples/README and changelog to document/recommend nvfp4_mlp_only / nvfp4_omlp_only.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
modelopt/torch/quantization/config.py Adds NVFP4_OMLP_ONLY_CFG, refactors NVFP4 config construction, and changes pass_through_bwd default.
examples/llm_ptq/scripts/huggingface_example.sh Accepts nvfp4_omlp_only in CLI validation.
examples/llm_ptq/hf_ptq.py Maps nvfp4_omlp_only to the new mtq config constant.
examples/llm_ptq/example_utils.py Removes model-specific nvfp4 exclusions that previously disabled certain attention/MLA quantization paths.
examples/llm_ptq/README.md Documents accuracy guidance and adds the new quant option to usage snippets.
CHANGELOG.rst Notes the new nvfp4_omlp_only quantization format.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modelopt/torch/quantization/config.py
Comment thread modelopt/torch/quantization/config.py
Comment on lines 622 to 634
NVFP4_MLP_ONLY_CFG = {
"quant_cfg": _nvfp4_mlp_only_quant_cfg,
"algorithm": "max",
}

NVFP4_OMLP_ONLY_CFG = {
"quant_cfg": {
"*mlp*weight_quantizer": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"enable": True,
"pass_through_bwd": True,
},
"*mlp*input_quantizer": {
"num_bits": (2, 1),
"block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
"enable": True,
"pass_through_bwd": True,
},
**_default_disabled_quantizer_cfg,
"*o_proj*weight_quantizer": _nvfp4_quantizer,
"*o_proj*input_quantizer": _nvfp4_quantizer,
**_nvfp4_mlp_only_quant_cfg,
},
"algorithm": "max",
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New quantization format NVFP4_OMLP_ONLY_CFG is introduced here, but existing test parametrizations already cover other NVFP4 configs (e.g., NVFP4_MLP_ONLY_CFG). It would be good to add this new config to those existing test matrices to ensure export/quantization paths work for the new format as well.

Copilot uses AI. Check for mistakes.
Comment thread examples/llm_ptq/example_utils.py
Comment thread modelopt/torch/quantization/config.py
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 970-981: The default for pass_through_bwd was changed to True
which silently alters behavior for all configs; revert the ModeloptField default
back to the previous value (False) to preserve backward compatibility, add an
inline comment above the pass_through_bwd declaration noting that any change to
this default is breaking and must be accompanied by a migration/changelog entry,
and if you intend to make True the new default in a future release, implement an
explicit migration path or feature flag so existing users aren’t impacted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fe1fc81b-7dba-4a78-b75c-27c8096d2ffe

📥 Commits

Reviewing files that changed from the base of the PR and between 9aa775a and 34952a7.

📒 Files selected for processing (2)
  • CHANGELOG.rst
  • modelopt/torch/quantization/config.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.rst

Comment thread modelopt/torch/quantization/config.py
cjluo-nv added 2 commits March 4, 2026 21:34
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested a review from Copilot March 5, 2026 17:52
Comment thread modelopt/torch/quantization/conversion.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)

243-260: ⚠️ Potential issue | 🟠 Major

Mutating args.qformat in the assertion generator breaks downstream logic.

Line 259 uses for args.qformat in qformat_list, which overwrites the argparse namespace attribute during validation. After this loop completes, args.qformat becomes the last element from qformat_list (a single format string), not the original comma-separated value. This breaks lines 876 and 888, which call args.qformat.split(",") to check the format count—these checks will now always return length 1, regardless of whether multiple formats were originally provided.

Suggested fix
     assert all(
-        args.qformat
+        qformat
         in [
             "fp8",
             "int8_sq",
             "int8_wo",
             "int4_awq",
             "nvfp4",
             "nvfp4_awq",
             "w4a8_awq",
             "fp8_pb_wo",
             "w4a8_mxfp4_fp8",
             "nvfp4_mlp_only",
             "nvfp4_omlp_only",
             "mxfp8",
         ]
-        for args.qformat in qformat_list
+        for qformat in qformat_list
     ), "One or more quantization formats provided are not supported for unified checkpoint export"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 243 - 260, The assertion currently
mutates args.qformat by using "for args.qformat in qformat_list"; change the
generator to iterate with a new loop variable (e.g., "fmt" or "q") and validate
each fmt in qformat_list against the allowed formats so args.qformat is not
overwritten; update the assert in the block that references args.qformat and
qformat_list (the validation near the top of hf_ptq.py) to use the non-mutating
loop variable and keep the same error message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 243-260: The assertion currently mutates args.qformat by using
"for args.qformat in qformat_list"; change the generator to iterate with a new
loop variable (e.g., "fmt" or "q") and validate each fmt in qformat_list against
the allowed formats so args.qformat is not overwritten; update the assert in the
block that references args.qformat and qformat_list (the validation near the top
of hf_ptq.py) to use the non-mutating loop variable and keep the same error
message.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b0ee13b-f0b7-4308-8cfa-6b3cda2459e5

📥 Commits

Reviewing files that changed from the base of the PR and between 98fdd3b and 5b943f9.

📒 Files selected for processing (3)
  • examples/llm_ptq/README.md
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/conversion.py

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert removing "enable": True, please? This behavior adds unwarranted complications.

If the objective of this PR is to reduce the length of config.py, that’s not the right goal.

We’re moving toward YAML-based recipes, which will address the recipe expansion issue.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.13%. Comparing base (a4fde49) to head (db185e3).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #973   +/-   ##
=======================================
  Coverage   72.12%   72.13%           
=======================================
  Files         209      209           
  Lines       23628    23631    +3     
=======================================
+ Hits        17042    17046    +4     
+ Misses       6586     6585    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv
Copy link
Copy Markdown
Collaborator Author

cjluo-nv commented Mar 5, 2026

Can we revert removing "enable": True, please? This behavior adds unwarranted complications.

If the objective of this PR is to reduce the length of config.py, that’s not the right goal.

We’re moving toward YAML-based recipes, which will address the recipe expansion issue.

Done

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv enabled auto-merge (squash) March 5, 2026 20:34
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv merged commit 42482b1 into main Mar 6, 2026
52 of 54 checks passed
@cjluo-nv cjluo-nv deleted the chenjiel/fix_config branch March 6, 2026 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants