Add nvfp4_omlp_only config and simplify the config.py#973
Conversation
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request introduces a new NVFP4 quantization variant (nvfp4_omlp_only) that quantizes output projections alongside MLP layers, refactors the quantization config to reduce duplication via shared quantizer specifications, changes the default pass_through_bwd behavior to True, and updates documentation and examples to reflect these additions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
examples/llm_ptq/README.md (1)
72-72: Tighten phrasing in the NVFP4 accuracy note.On Line 72, “only” is repeated in a way that reads awkwardly. Consider simplifying the sentence for clarity.
✍️ Suggested wording tweak
-> *For higher NVFP4 PTQ accuracy, we recommend using `mtq.NVFP4_MLP_ONLY_CFG` or `mtq.NVFP4_OMLP_ONLY_CFG` instead of `mtq.NVFP4_DEFAULT_CFG`. `NVFP4_MLP_ONLY_CFG` applies NVFP4 quantization only to MLP (and MoE) layers, leaving attention layers unquantized. `NVFP4_OMLP_ONLY_CFG` additionally quantizes the `o_proj` layer. Both preserve accuracy in the sensitive attention QKV projections while still providing significant compression.* +> *For higher NVFP4 PTQ accuracy, we recommend using `mtq.NVFP4_MLP_ONLY_CFG` or `mtq.NVFP4_OMLP_ONLY_CFG` instead of `mtq.NVFP4_DEFAULT_CFG`. `NVFP4_MLP_ONLY_CFG` applies NVFP4 quantization to MLP (and MoE) layers, leaving attention layers unquantized. `NVFP4_OMLP_ONLY_CFG` additionally quantizes the `o_proj` layer. Both preserve accuracy in the sensitive attention QKV projections while still providing significant compression.*🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/README.md` at line 72, The sentence in the NVFP4 accuracy note repeats "only" awkwardly; reword the sentence that describes mtq.NVFP4_MLP_ONLY_CFG and mtq.NVFP4_OMLP_ONLY_CFG so it reads smoothly—for example: "mtq.NVFP4_MLP_ONLY_CFG applies NVFP4 quantization to MLP (and MoE) layers while leaving attention layers unquantized; mtq.NVFP4_OMLP_ONLY_CFG also quantizes the o_proj layer." Update the README entry replacing the current phrasing for mtq.NVFP4_MLP_ONLY_CFG and mtq.NVFP4_OMLP_ONLY_CFG (which compares against mtq.NVFP4_DEFAULT_CFG) accordingly.modelopt/torch/quantization/config.py (1)
399-407: Avoid sharing one quantizer dict across multiple keys. Use separate copies instead.The
_nvfp4_quantizerdict is currently referenced by both*weight_quantizerand*input_quantizerkeys. While the current codebase doesn't perform in-place mutations on these configs, this pattern is inconsistent with other quantizer configs in the file (INT8_DEFAULT_CFG, FP8_DEFAULT_CFG, etc., which use inline separate dicts for each key). Usingcopy.deepcopy()prevents potential issues from future code modifications and aligns with the defensive copying pattern already used elsewhere in the codebase (e.g., in docstring examples withcopy.deepcopy(mtq.INT4_AWQ_CFG)).🔧 Suggested fix
+import copy @@ NVFP4_DEFAULT_CFG = { "quant_cfg": { - "*weight_quantizer": _nvfp4_quantizer, - "*input_quantizer": _nvfp4_quantizer, + "*weight_quantizer": copy.deepcopy(_nvfp4_quantizer), + "*input_quantizer": copy.deepcopy(_nvfp4_quantizer), **_default_disabled_quantizer_cfg, }, "algorithm": "max", }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 399 - 407, The _nvfp4_quantizer dict is being shared between NVFP4_DEFAULT_CFG's "*weight_quantizer" and "*input_quantizer"; change the assignment so each key gets its own copy (e.g., use copy.deepcopy(_nvfp4_quantizer) or construct two separate dict literals) rather than referencing the same _nvfp4_quantizer object, updating NVFP4_DEFAULT_CFG so "*weight_quantizer" and "*input_quantizer" point to independent dicts to avoid accidental cross-mutation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Line 90: The new qformat key "nvfp4_omlp_only" (mapped to
mtq.NVFP4_OMLP_ONLY_CFG) is exposed but not included in the auto-quantize
allowlist, causing assertions when using --qformat with it; update the allowlist
used by the auto-quantize validation (the list/variable that enumerates
supported qformats) to include the string "nvfp4_omlp_only" so validation
recognizes mtq.NVFP4_OMLP_ONLY_CFG as allowed.
In `@modelopt/torch/quantization/config.py`:
- Around line 627-634: The new public config NVFP4_OMLP_ONLY_CFG is not included
in the choices collection, breaking name-based validation/discovery; update the
choices object (the variable named choices where other configs are listed) to
include an entry for "NVFP4_OMLP_ONLY_CFG" (or the appropriate key name used for
configs in that list) mapping to the NVFP4_OMLP_ONLY_CFG symbol so it is
discoverable/validatable alongside the other quant config entries.
---
Nitpick comments:
In `@examples/llm_ptq/README.md`:
- Line 72: The sentence in the NVFP4 accuracy note repeats "only" awkwardly;
reword the sentence that describes mtq.NVFP4_MLP_ONLY_CFG and
mtq.NVFP4_OMLP_ONLY_CFG so it reads smoothly—for example:
"mtq.NVFP4_MLP_ONLY_CFG applies NVFP4 quantization to MLP (and MoE) layers while
leaving attention layers unquantized; mtq.NVFP4_OMLP_ONLY_CFG also quantizes the
o_proj layer." Update the README entry replacing the current phrasing for
mtq.NVFP4_MLP_ONLY_CFG and mtq.NVFP4_OMLP_ONLY_CFG (which compares against
mtq.NVFP4_DEFAULT_CFG) accordingly.
In `@modelopt/torch/quantization/config.py`:
- Around line 399-407: The _nvfp4_quantizer dict is being shared between
NVFP4_DEFAULT_CFG's "*weight_quantizer" and "*input_quantizer"; change the
assignment so each key gets its own copy (e.g., use
copy.deepcopy(_nvfp4_quantizer) or construct two separate dict literals) rather
than referencing the same _nvfp4_quantizer object, updating NVFP4_DEFAULT_CFG so
"*weight_quantizer" and "*input_quantizer" point to independent dicts to avoid
accidental cross-mutation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8ac15cd9-402b-4461-a64d-a394e6582ff7
📒 Files selected for processing (6)
CHANGELOG.rstexamples/llm_ptq/README.mdexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/quantization/config.py
💤 Files with no reviewable changes (1)
- examples/llm_ptq/example_utils.py
There was a problem hiding this comment.
Pull request overview
This PR extends the NVFP4 PTQ configuration options by introducing an “MLP-only + attention o_proj” variant and refactors parts of the quantization config definitions to reduce duplication, alongside example-script and documentation updates to expose and recommend these formats.
Changes:
- Added
NVFP4_OMLP_ONLY_CFG(and CLI/script plumbing fornvfp4_omlp_only) to quantize MLP/MoE plus attentiono_proj. - Refactored NVFP4-related quantizer sub-configs to reuse shared dicts and removed redundant fields (
enable,axis) where defaults apply. - Updated LLM PTQ examples/README and changelog to document/recommend
nvfp4_mlp_only/nvfp4_omlp_only.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
modelopt/torch/quantization/config.py |
Adds NVFP4_OMLP_ONLY_CFG, refactors NVFP4 config construction, and changes pass_through_bwd default. |
examples/llm_ptq/scripts/huggingface_example.sh |
Accepts nvfp4_omlp_only in CLI validation. |
examples/llm_ptq/hf_ptq.py |
Maps nvfp4_omlp_only to the new mtq config constant. |
examples/llm_ptq/example_utils.py |
Removes model-specific nvfp4 exclusions that previously disabled certain attention/MLA quantization paths. |
examples/llm_ptq/README.md |
Documents accuracy guidance and adds the new quant option to usage snippets. |
CHANGELOG.rst |
Notes the new nvfp4_omlp_only quantization format. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| NVFP4_MLP_ONLY_CFG = { | ||
| "quant_cfg": _nvfp4_mlp_only_quant_cfg, | ||
| "algorithm": "max", | ||
| } | ||
|
|
||
| NVFP4_OMLP_ONLY_CFG = { | ||
| "quant_cfg": { | ||
| "*mlp*weight_quantizer": { | ||
| "num_bits": (2, 1), | ||
| "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, | ||
| "enable": True, | ||
| "pass_through_bwd": True, | ||
| }, | ||
| "*mlp*input_quantizer": { | ||
| "num_bits": (2, 1), | ||
| "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, | ||
| "enable": True, | ||
| "pass_through_bwd": True, | ||
| }, | ||
| **_default_disabled_quantizer_cfg, | ||
| "*o_proj*weight_quantizer": _nvfp4_quantizer, | ||
| "*o_proj*input_quantizer": _nvfp4_quantizer, | ||
| **_nvfp4_mlp_only_quant_cfg, | ||
| }, | ||
| "algorithm": "max", | ||
| } |
There was a problem hiding this comment.
New quantization format NVFP4_OMLP_ONLY_CFG is introduced here, but existing test parametrizations already cover other NVFP4 configs (e.g., NVFP4_MLP_ONLY_CFG). It would be good to add this new config to those existing test matrices to ensure export/quantization paths work for the new format as well.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 970-981: The default for pass_through_bwd was changed to True
which silently alters behavior for all configs; revert the ModeloptField default
back to the previous value (False) to preserve backward compatibility, add an
inline comment above the pass_through_bwd declaration noting that any change to
this default is breaking and must be accompanied by a migration/changelog entry,
and if you intend to make True the new default in a future release, implement an
explicit migration path or feature flag so existing users aren’t impacted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fe1fc81b-7dba-4a78-b75c-27c8096d2ffe
📒 Files selected for processing (2)
CHANGELOG.rstmodelopt/torch/quantization/config.py
🚧 Files skipped from review as they are similar to previous changes (1)
- CHANGELOG.rst
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)
243-260:⚠️ Potential issue | 🟠 MajorMutating
args.qformatin the assertion generator breaks downstream logic.Line 259 uses
for args.qformat in qformat_list, which overwrites the argparse namespace attribute during validation. After this loop completes,args.qformatbecomes the last element fromqformat_list(a single format string), not the original comma-separated value. This breaks lines 876 and 888, which callargs.qformat.split(",")to check the format count—these checks will now always return length 1, regardless of whether multiple formats were originally provided.Suggested fix
assert all( - args.qformat + qformat in [ "fp8", "int8_sq", "int8_wo", "int4_awq", "nvfp4", "nvfp4_awq", "w4a8_awq", "fp8_pb_wo", "w4a8_mxfp4_fp8", "nvfp4_mlp_only", "nvfp4_omlp_only", "mxfp8", ] - for args.qformat in qformat_list + for qformat in qformat_list ), "One or more quantization formats provided are not supported for unified checkpoint export"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 243 - 260, The assertion currently mutates args.qformat by using "for args.qformat in qformat_list"; change the generator to iterate with a new loop variable (e.g., "fmt" or "q") and validate each fmt in qformat_list against the allowed formats so args.qformat is not overwritten; update the assert in the block that references args.qformat and qformat_list (the validation near the top of hf_ptq.py) to use the non-mutating loop variable and keep the same error message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 243-260: The assertion currently mutates args.qformat by using
"for args.qformat in qformat_list"; change the generator to iterate with a new
loop variable (e.g., "fmt" or "q") and validate each fmt in qformat_list against
the allowed formats so args.qformat is not overwritten; update the assert in the
block that references args.qformat and qformat_list (the validation near the top
of hf_ptq.py) to use the non-mutating loop variable and keep the same error
message.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6b0ee13b-f0b7-4308-8cfa-6b3cda2459e5
📒 Files selected for processing (3)
examples/llm_ptq/README.mdexamples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/conversion.py
realAsma
left a comment
There was a problem hiding this comment.
Can we revert removing "enable": True, please? This behavior adds unwarranted complications.
If the objective of this PR is to reduce the length of config.py, that’s not the right goal.
We’re moving toward YAML-based recipes, which will address the recipe expansion issue.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #973 +/- ##
=======================================
Coverage 72.12% 72.13%
=======================================
Files 209 209
Lines 23628 23631 +3
=======================================
+ Hits 17042 17046 +4
+ Misses 6586 6585 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Done |
What does this PR do?
Type of change: ? new feature
Usage
huggingface_script.sh ... --quant nvfp4_omlp_only
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
New Features
Changed
Documentation