Adds AutoQuant support for VLM / Qwen3.5-Qwen3.6 style models#1381
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR enables AutoQuantize to handle fused MoE modules and per-expert quantizers, adds disabled_layers and excluded-module-name-pattern cost plumbing, updates example CLI wiring (VLM gating and outer model usage), extends format allowlist, and adds unit tests for these behaviors. ChangesAutoQuantize: Fused MoE Support and Disabled Layers Configuration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1381 +/- ##
==========================================
+ Coverage 76.27% 76.91% +0.64%
==========================================
Files 489 489
Lines 54415 54495 +80
==========================================
+ Hits 41504 41916 +412
+ Misses 12911 12579 -332
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
a6aa497 to
5de4432
Compare
2a21f66 to
2828faa
Compare
| # TODO: To be refacored into config system. | ||
| _QWEN36_AUTOQ_DISABLED_LAYERS = ( | ||
| "*shared_expert_gate*", | ||
| "*linear_attn.in_proj_a*", | ||
| "*linear_attn.in_proj_b*", | ||
| ) | ||
| _VLM_AUTOQ_DISABLED_LAYERS = ("*visual*", "*mtp*", "*vision_tower*") |
There was a problem hiding this comment.
@juhi10071998 let's move this to the config systems once ready
There was a problem hiding this comment.
thanks @meenchen , yes let me do that, unfortunately I got pulled into a PO day0 model and won't get time until next week
There was a problem hiding this comment.
no problem, just want to make sure we are aligned with the future refactor.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/quantization/algorithms.py (1)
110-126: ⚡ Quick winDuplicate helper with subtly different semantics.
This
_get_module_weight_numelduplicates the same-named function in_auto_quantize_cost.py(lines 93-106) but uses different dispatch logic:
- Here: checks
_is_fused_experts_module()first, then falls back toweight_auto_quantize_cost.py: checksweightfirst, then falls back to projectionsIf a fused-experts module ever has both a
weightattr and projection attrs, the two files will count different elements. Consider consolidating into a single canonical helper (e.g., in_auto_quantize_cost.py) and importing it here to guarantee consistent accounting.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/algorithms.py` around lines 110 - 126, The duplicate helper _get_module_weight_numel in this file has different dispatch order than the one in _auto_quantize_cost.py causing inconsistent counts for modules that expose both weight and fused projections; remove the local copy and import the canonical helper from _auto_quantize_cost (or move a single implementation to a shared location), ensure the canonical implementation handles _is_fused_experts_module, weight, gate_up_proj and down_proj consistently, and update any references here to call the imported/shared _get_module_weight_numel so weight accounting is identical across modules.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@modelopt/torch/quantization/algorithms.py`:
- Around line 110-126: The duplicate helper _get_module_weight_numel in this
file has different dispatch order than the one in _auto_quantize_cost.py causing
inconsistent counts for modules that expose both weight and fused projections;
remove the local copy and import the canonical helper from _auto_quantize_cost
(or move a single implementation to a shared location), ensure the canonical
implementation handles _is_fused_experts_module, weight, gate_up_proj and
down_proj consistently, and update any references here to call the
imported/shared _get_module_weight_numel so weight accounting is identical
across modules.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e0bd6ffb-0ff2-4ac7-a994-86a6a9c888fb
📒 Files selected for processing (4)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/_auto_quantize_cost.pymodelopt/torch/quantization/algorithms.pytests/unit/torch/quantization/test_autoquant.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
The fused-MoE / w4a16_nvfp4 / linear_attn grouping changes look reasonable and include unit tests for the fused-MoE cost path and the lm_head override. A few things to address before this lands:
-
Likely regression in
--recipepath for Qwen3.5/3.6-MoE VLMs. The gate inload_modelflipped fromif args.recipe is None:toif args.auto_quantize_bits is None:. Per the deleted comment, recipe-mode was exactly the case that neededlanguage_model = full_modelso a recipe targeting*lm_head*could see the outer CausalLM'slm_head. Under the new condition, recipe + VLM now extracts the inner LM and passes it tomono_quantize, which is the opposite of what's intended — recipe-drivenlm_headquantization of Qwen3.5/3.6-MoE VLMs will silently misslm_head. The autoquant path can be opted into without breaking the recipe path, e.g.if args.recipe is None and args.auto_quantize_bits is None:. -
Duplicated
_get_module_weight_numel. The same helper is defined in bothmodelopt/torch/quantization/algorithms.pyandmodelopt/torch/quantization/_auto_quantize_cost.py. Pick one canonical location (the cost module is the more natural home;algorithms.pyalready imports from it) and re-export. -
_match_quantizer_cfgnow skipsparent_classentries. This looks like a real bug fix (a{parent_class: nn.BatchNorm1d, quantizer_name: '*', enable: False}entry would previously last-match-win against anyweight_quantizer/input_quantizerlookup and disable everything), but it changes behavior for any recipe that relies onparent_class-scoped global entries reachingget_auto_quantize_configoutput. Worth either calling out explicitly in the PR body or adding a regression test that pins the new behavior. -
Hardcoded VLM/Qwen patterns in
hf_ptq.py. Already acknowledged with a TODO; just noting that*shared_expert_gate*is broader than the existing*mlp.shared_expert_gate.*in_default_disabled_quantizer_cfg— fine in practice, but worth keeping in mind when refactoring into the config system. -
Test fragility.
test_get_auto_quantize_config_keeps_selected_lm_head_enabledrelies onFP8_DEFAULT_CFGcontaining*lm_head*(via_default_disabled_quantizer_cfg) for the assertion to succeed. If that default ever changes, the test will fail confusingly. Consider seeding the search_state's recipe with an explicit*lm_head*: enable=Falseentry so the test is self-contained.
| if args.recipe is None: | ||
| # AutoQuantize walks the outer CausalLM so lm_head is visible to the | ||
| # search. Visual/MTP siblings are excluded by disabled-layer patterns. | ||
| if args.auto_quantize_bits is None: |
There was a problem hiding this comment.
Bot comment.
This gate was previously if args.recipe is None: with a comment explaining that recipe-mode is the exception that needs language_model = full_model so a recipe targeting *lm_head* can see the outer CausalLM's lm_head on Qwen3.5/3.6-MoE VLMs.
Flipping it to if args.auto_quantize_bits is None: reverses that behavior: recipe + VLM now extracts the inner LM and passes it to mono_quantize, so a recipe that quantizes *lm_head* will silently miss it. AutoQuantize can be opted into without breaking the recipe path, e.g.:
if args.recipe is None and args.auto_quantize_bits is None:
extracted_lm, extracted_model_type = extract_and_prepare_language_model_from_vl(full_model)| total = 0 | ||
| for attr in ("gate_up_proj", "down_proj"): | ||
| param = getattr(module, attr, None) | ||
| if param is not None: |
There was a problem hiding this comment.
Bot comment.
This helper is duplicated in _auto_quantize_cost.py. Pick one canonical location (the cost module is the natural home — algorithms.py already imports from it) and re-export from the other to avoid drift.
| for entry in quant_cfg: | ||
| parent_class = entry.get("parent_class") if hasattr(entry, "get") else entry.parent_class | ||
| if parent_class is not None: | ||
| continue |
There was a problem hiding this comment.
Bot comment.
Skipping parent_class-scoped entries here looks like a genuine bug fix (e.g. {parent_class: nn.BatchNorm1d, quantizer_name: '*', enable: False} would previously last-match-win against any weight_quantizer lookup and disable everything), but it's a behavior change worth calling out in the PR body and pinning with a regression test.
| idx for idx, entry in enumerate(quant_cfg) if entry["quantizer_name"] == "*lm_head*" | ||
| ) | ||
| weight_idx = next( | ||
| idx |
There was a problem hiding this comment.
Bot comment.
This assertion implicitly depends on FP8_DEFAULT_CFG containing a *lm_head* disable via _default_disabled_quantizer_cfg. If that default ever changes, the test will fail confusingly. Consider constructing a small explicit recipe (with *lm_head*: enable=False in its quant_cfg) so the test pins the override behavior independent of the default config contents.
2828faa to
ec69be9
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/torch/quantization/test_config_validation.py (1)
493-510: ⚡ Quick winTest logic looks good; consider moving imports to top of file.
The test correctly validates that
parent_class-scoped config entries are ignored during bare-name quantizer lookup, which aligns with the PR's goal to prevent class-scoped globals from overriding selected modules.However, the import at line 495 (and throughout the
TestMatchQuantizerCfgclass) is inside the test method without justification. As per coding guidelines, imports should be at the top of the file so import errors surface at collection time rather than mid-test. Consider refactoring all_match_quantizer_cfgimports in this test class to the module level.📦 Suggested refactor: move imports to module level
Add the import near line 21 with the other imports from
modelopt.torch.quantization:from modelopt.torch.quantization.config import ( FP8_2D_BLOCKWISE_WEIGHT_ONLY_CFG, FP8_DEFAULT_CFG, FP8_PER_CHANNEL_PER_TOKEN_CFG, INT4_AWQ_CFG, NVFP4_DEFAULT_CFG, W4A8_AWQ_BETA_CFG, MaxCalibConfig, QuantizeConfig, find_quant_cfg_entry_by_path, need_calibration, normalize_quant_cfg_list, ) +from modelopt.torch.quantization.algorithms import _match_quantizer_cfgThen remove the in-function imports from all test methods in
TestMatchQuantizerCfg(lines 441, 452, 461, 471, 482, 495, 513, 524, 542).As per coding guidelines: "Imports inside functions or test methods without explicit justification" should be flagged; acceptable in-function imports are only for circular imports or optional dependencies, with a brief comment explaining why.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_config_validation.py` around lines 493 - 510, Move the repeated in-test imports of _match_quantizer_cfg out of the individual test methods and into the module-level imports: add "from modelopt.torch.quantization.algorithms import _match_quantizer_cfg" alongside the other top-level imports, then remove the in-function import statements inside TestMatchQuantizerCfg methods (including the test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup test). If a circular or optional-import reason prevents moving it to module scope, keep the in-function import but add a brief comment explaining the justification; otherwise ensure all tests use the module-level _match_quantizer_cfg import so import errors surface at collection time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/torch/quantization/test_config_validation.py`:
- Around line 493-510: Move the repeated in-test imports of _match_quantizer_cfg
out of the individual test methods and into the module-level imports: add "from
modelopt.torch.quantization.algorithms import _match_quantizer_cfg" alongside
the other top-level imports, then remove the in-function import statements
inside TestMatchQuantizerCfg methods (including the
test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup test). If
a circular or optional-import reason prevents moving it to module scope, keep
the in-function import but add a brief comment explaining the justification;
otherwise ensure all tests use the module-level _match_quantizer_cfg import so
import errors surface at collection time.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fdfb047e-ead7-43a7-a68e-c0c1c1eb67fe
📒 Files selected for processing (5)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/_auto_quantize_cost.pymodelopt/torch/quantization/algorithms.pytests/unit/torch/quantization/test_autoquant.pytests/unit/torch/quantization/test_config_validation.py
🚧 Files skipped from review as they are similar to previous changes (4)
- modelopt/torch/quantization/_auto_quantize_cost.py
- tests/unit/torch/quantization/test_autoquant.py
- examples/llm_ptq/hf_ptq.py
- modelopt/torch/quantization/algorithms.py
ec69be9 to
7f2dca2
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/torch/quantization/test_autoquant.py (1)
693-721: 💤 Low valueThe test addresses the prior feedback but still has one implicit coupling.
The test now explicitly adds
*lm_head*disable to the recipe (line 695), which addresses the prior concern about implicit dependency on default config contents. However, line 721 assertsnum_bits == (4, 3), which still depends onFP8_DEFAULT_CFG's internal quantizer configuration. Consider extracting the expectednum_bitsfrom the recipe config to make the test self-documenting:♻️ Suggested improvement
+ # Extract expected num_bits from the recipe's weight quantizer config + expected_num_bits = None + for entry in recipe_config["quant_cfg"]: + if entry.get("quantizer_name") == "*weight_quantizer": + expected_num_bits = entry["cfg"]["num_bits"] + break + assert weight_entry["enable"] is True - assert weight_entry["cfg"]["num_bits"] == (4, 3) + assert weight_entry["cfg"]["num_bits"] == expected_num_bits🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/quantization/test_autoquant.py` around lines 693 - 721, The assertion in test_get_auto_quantize_config_keeps_selected_lm_head_enabled hardcodes num_bits == (4, 3) and thus implicitly depends on mtq.FP8_DEFAULT_CFG; instead, derive the expected bits from the recipe_config used to build the QuantRecipe: locate the entry in recipe_config["quant_cfg"] whose "quantizer_name" == "lm_head.weight_quantizer" and read its ["cfg"]["num_bits"], then assert weight_entry["cfg"]["num_bits"] equals that extracted value so the test documents and relies only on the recipe it constructed (refer to symbols test_get_auto_quantize_config_keeps_selected_lm_head_enabled, recipe_config, FP8_DEFAULT_CFG, and QuantRecipe).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/torch/quantization/test_autoquant.py`:
- Around line 693-721: The assertion in
test_get_auto_quantize_config_keeps_selected_lm_head_enabled hardcodes num_bits
== (4, 3) and thus implicitly depends on mtq.FP8_DEFAULT_CFG; instead, derive
the expected bits from the recipe_config used to build the QuantRecipe: locate
the entry in recipe_config["quant_cfg"] whose "quantizer_name" ==
"lm_head.weight_quantizer" and read its ["cfg"]["num_bits"], then assert
weight_entry["cfg"]["num_bits"] equals that extracted value so the test
documents and relies only on the recipe it constructed (refer to symbols
test_get_auto_quantize_config_keeps_selected_lm_head_enabled, recipe_config,
FP8_DEFAULT_CFG, and QuantRecipe).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4458fd4a-7fe9-4464-aeb1-54661dd6fb00
📒 Files selected for processing (6)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/_auto_quantize_cost.pymodelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/model_quant.pytests/unit/torch/quantization/test_autoquant.pytests/unit/torch/quantization/test_config_validation.py
✅ Files skipped from review due to trivial changes (2)
- modelopt/torch/quantization/model_quant.py
- tests/unit/torch/quantization/test_config_validation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/quantization/algorithms.py
7f2dca2 to
80f86d1
Compare
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 153-171: The two helpers get_auto_quantize_disabled_layers and
get_auto_quantize_cost_excluded_patterns are currently public; make them private
or explicitly export them: either rename both to
_get_auto_quantize_disabled_layers and _get_auto_quantize_cost_excluded_patterns
and update all local callers, or add a module-level __all__ list at the top of
the file that explicitly exports the public API (and omits these two names).
Ensure references to is_multimodal_model, _QWEN36_AUTOQ_DISABLED_LAYERS and
_VLM_AUTOQ_DISABLED_LAYERS inside those functions remain valid after renaming or
when adjusting imports.
- Around line 1462-1472: Reject --auto_quantize_cost_exclude_patterns when
AutoQuantize is off by validating args after parsing: check if the parsed flag
auto_quantize_cost_exclude_patterns is set while auto_quantize_bits (or whatever
flag controls enabling AutoQuantize) is unset/None, and raise a parser error (or
call parser.error) reporting that --auto_quantize_cost_exclude_patterns requires
--auto_quantize_bits. Place this validation near the argument parsing / main
setup (after parser.parse_args()) so auto_quantize() logic remains unchanged and
misconfigured runs fail fast.
- Around line 144-163: The function get_auto_quantize_disabled_layers currently
always appends _QWEN36_AUTOQ_DISABLED_LAYERS which incorrectly affects non-Qwen
models; update it to only extend with _QWEN36_AUTOQ_DISABLED_LAYERS when the
supplied model is a Qwen family model (e.g., add a predicate
is_qwen_model(model) or check model.config.model_type/ name for "qwen" and use
that to gate the extension), keep the existing multimodal check for
_VLM_AUTOQ_DISABLED_LAYERS and continue to build disabled_layers from
_default_disabled_quantizer_cfg unchanged.
- Around line 584-587: The current guard that keeps the outer CausalLM when
args.recipe and args.auto_quantize_bits are None prevents Nemotron VL from
entering the AutoQuantize path and also leaves args.calib_with_images enabled
which causes auto_quantize() to hard-fail; change the logic so that when
args.auto_quantize_bits is set the code does not keep the outer CausalLM (so
AutoQuantize path is used) and ensure image calibration is disabled for Nemotron
VL by either clearing args.calib_with_images before calling auto_quantize() or
updating auto_quantize() to ignore image calibration for Nemotron VL models
(refer to args.auto_quantize_bits, auto_quantize(), args.calib_with_images, and
the CausalLM wrapping logic).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 30f10133-5668-40ca-86bb-633b5c905fd7
📒 Files selected for processing (6)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/_auto_quantize_cost.pymodelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/model_quant.pytests/unit/torch/quantization/test_autoquant.pytests/unit/torch/quantization/test_config_validation.py
✅ Files skipped from review due to trivial changes (1)
- modelopt/torch/quantization/model_quant.py
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/unit/torch/quantization/test_config_validation.py
- modelopt/torch/quantization/_auto_quantize_cost.py
- tests/unit/torch/quantization/test_autoquant.py
- modelopt/torch/quantization/algorithms.py
| parser.add_argument( | ||
| "--auto_quantize_cost_exclude_patterns", | ||
| nargs="+", | ||
| default=None, | ||
| help=( | ||
| "Wildcard module-name patterns to exclude from AutoQuantize effective-bits cost " | ||
| "accounting. The matched modules can still be disabled from quantization separately; " | ||
| "this flag only changes the budget denominator and selected-cost calculation. " | ||
| "For multimodal models, VLM/MTP sibling modules are excluded by default." | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Reject this flag when AutoQuantize is off.
--auto_quantize_cost_exclude_patterns is only consumed inside auto_quantize(), so plain PTQ and recipe runs silently ignore it. Adding a parser error when --auto_quantize_bits is unset would catch misconfigured jobs early.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/llm_ptq/hf_ptq.py` around lines 1462 - 1472, Reject
--auto_quantize_cost_exclude_patterns when AutoQuantize is off by validating
args after parsing: check if the parsed flag auto_quantize_cost_exclude_patterns
is set while auto_quantize_bits (or whatever flag controls enabling
AutoQuantize) is unset/None, and raise a parser error (or call parser.error)
reporting that --auto_quantize_cost_exclude_patterns requires
--auto_quantize_bits. Place this validation near the argument parsing / main
setup (after parser.parse_args()) so auto_quantize() logic remains unchanged and
misconfigured runs fail fast.
80f86d1 to
ec996da
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
examples/llm_ptq/hf_ptq.py (2)
153-163:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGate Qwen-disabled patterns to Qwen models only.
At Line 160,
_QWEN36_AUTOQ_DISABLED_LAYERSis always appended, so non-Qwen models can be silently over-excluded during AutoQuant search.Suggested minimal fix
def get_auto_quantize_disabled_layers(model) -> list[str]: @@ - disabled_layers.extend(p for p in _QWEN36_AUTOQ_DISABLED_LAYERS if p not in disabled_layers) + model_type = getattr(getattr(model, "config", None), "model_type", "") or "" + if "qwen" in model_type.lower(): + disabled_layers.extend(p for p in _QWEN36_AUTOQ_DISABLED_LAYERS if p not in disabled_layers)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_ptq/hf_ptq.py` around lines 153 - 163, get_auto_quantize_disabled_layers currently always appends _QWEN36_AUTOQ_DISABLED_LAYERS which can over-exclude non-Qwen models; change it to only extend with _QWEN36_AUTOQ_DISABLED_LAYERS when the input model is actually a Qwen model (e.g., guard with an existing is_qwen36_model(model) or add a small check on model.config or model.name to detect Qwen), leaving the multimodal branch for _VLM_AUTOQ_DISABLED_LAYERS unchanged; update the logic in get_auto_quantize_disabled_layers and use the symbol _QWEN36_AUTOQ_DISABLED_LAYERS in the guarded branch.
583-587:⚠️ Potential issue | 🟠 Major | ⚡ Quick winNemotron VL + AutoQuant still hits an immediate hard-fail path.
This new gate keeps the outer model for AutoQuant, but Nemotron VL still defaults
args.calib_with_images=Trueearlier, andauto_quantize()rejects image calibration at Line 332-336. The result is AutoQuant aborting before search starts.Suggested minimal fix
- if is_nemotron_vl_model and not args.calib_with_images: + if ( + is_nemotron_vl_model + and not args.calib_with_images + and args.auto_quantize_bits is None + ): print("Nemotron VL model detected. Enabling image-text calibration by default.") args.calib_with_images = True🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_ptq/hf_ptq.py` around lines 583 - 587, AutoQuant aborts because Nemotron VL defaults args.calib_with_images=True but auto_quantize() rejects image calibration; before invoking auto_quantize (or before the code path that keeps the outer model), detect when args.auto_quantize_bits is set and disable image calibration by setting args.calib_with_images = False (or pass an allow_image_calib=False flag into auto_quantize) so auto_quantize() will not hit the image-calibration rejection; update the code around extract_and_prepare_language_model_from_vl and the auto_quantize() call to enforce this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 153-163: get_auto_quantize_disabled_layers currently always
appends _QWEN36_AUTOQ_DISABLED_LAYERS which can over-exclude non-Qwen models;
change it to only extend with _QWEN36_AUTOQ_DISABLED_LAYERS when the input model
is actually a Qwen model (e.g., guard with an existing is_qwen36_model(model) or
add a small check on model.config or model.name to detect Qwen), leaving the
multimodal branch for _VLM_AUTOQ_DISABLED_LAYERS unchanged; update the logic in
get_auto_quantize_disabled_layers and use the symbol
_QWEN36_AUTOQ_DISABLED_LAYERS in the guarded branch.
- Around line 583-587: AutoQuant aborts because Nemotron VL defaults
args.calib_with_images=True but auto_quantize() rejects image calibration;
before invoking auto_quantize (or before the code path that keeps the outer
model), detect when args.auto_quantize_bits is set and disable image calibration
by setting args.calib_with_images = False (or pass an allow_image_calib=False
flag into auto_quantize) so auto_quantize() will not hit the image-calibration
rejection; update the code around extract_and_prepare_language_model_from_vl and
the auto_quantize() call to enforce this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7226482a-2f36-4775-a3d4-87e15780d5b1
📒 Files selected for processing (6)
examples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/_auto_quantize_cost.pymodelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/model_quant.pytests/unit/torch/quantization/test_autoquant.pytests/unit/torch/quantization/test_config_validation.py
✅ Files skipped from review due to trivial changes (1)
- modelopt/torch/quantization/model_quant.py
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/unit/torch/quantization/test_config_validation.py
- modelopt/torch/quantization/_auto_quantize_cost.py
- tests/unit/torch/quantization/test_autoquant.py
- modelopt/torch/quantization/algorithms.py
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of PR #1381 (AutoQuant for Qwen3.5/3.6 + VLM).
Previous critical comments — addressed:
- 💬 cjluo-nv's
--reciperegression: gate is nowif args.recipe is None and args.auto_quantize_bits is None:at hf_ptq.py:583 — recipe-mode VLMs again retain the outer CausalLM. ✅ - 💬 Duplicated
_get_module_weight_numel: deduped — single canonical helper now lives in_auto_quantize_cost.pyand is imported intoalgorithms.py. ✅ - 💬
_match_quantizer_cfgparent_class-skip behavior change: pinned withtest_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookupregression test. ✅ - 💬 lm_head test fragility: test now appends
{quantizer_name: "*lm_head*", enable: False}to its own recipe (test_autoquant.py:695), so it no longer relies on_default_disabled_quantizer_cfgfor the enable-check. Thenum_bits == (4, 3)assertion still implicitly trustsFP8_DEFAULT_CFG's weight-quantizer shape, but that's a minor coupling.
Outstanding for human sign-off:
- Nemotron VL +
--auto_quantize_bitsstill hard-fails before search.load_modelunconditionally setsargs.calib_with_images = Truefor Nemotron VL (hf_ptq.py ~L471-473), andauto_quantize()then raisesNotImplementedError("AutoQuantize with image-text calibration is not supported yet.")at L332-336. CodeRabbit marked this "✅ Addressed in commit ec996da" but the suggested guard (and args.auto_quantize_bits is None) is not in the current file. The PR's target architecture is Qwen3.5/3.6 (validated end-to-end per the PR body), so this likely doesn't break the shipped use case, but it should either be fixed or explicitly scoped out. - Hardcoded VLM/Qwen disable patterns in
hf_ptq.py. Acknowledged via TODO and inline reply chain (juhi10071998 to follow up next week). The Qwen-only patterns (*shared_expert_gate*,*linear_attn.in_proj_*) are appended for every model, but in practice these names don't appear on non-Qwen architectures, so the over-broadening is benign. Worth confirming that's acceptable until the config-system refactor lands. - The new
get_auto_quantize_disabled_layers/get_auto_quantize_cost_excluded_patternshelpers inexamples/llm_ptq/hf_ptq.pyare public-named but only used locally; minor.
Tests for the new behavior (cost-model exclusion, fused MoE weight accounting, lm_head override, parent_class skip) are present and meaningful. Recommend cjluo-nv take a final look at the Nemotron VL gating before approval.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review: previous critical comments are addressed. Recommending nudge for human sign-off due to remaining surface-area items, not unresolved bugs.
Previously-flagged criticals — all resolved:
- 💬
--reciperegression: gate athf_ptq.pyL607 is nowif args.recipe is None and args.auto_quantize_bits is None:. ✅ - 💬 Duplicated
_get_module_weight_numel: helper now lives canonically in_auto_quantize_cost.pyand is imported byalgorithms.py. ✅ - 💬
_match_quantizer_cfgparent_classskip: pinned bytest_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup. ✅ - 💬 lm_head test fragility: test now appends an explicit
*lm_head*: enable=Falseentry to its own recipe rather than relying onFP8_DEFAULT_CFGdefaults. ✅ - 💬 Nemotron VL + AutoQuantize hard-fail:
parse_args()now errors on--auto_quantize_bits+--calib_with_images, andload_modelno longer auto-enables image calibration whenargs.auto_quantize_bitsis set. Newtests/examples/llm_ptq/test_hf_ptq_args.pyregression-tests both. ✅ - 💬 Qwen-only patterns over-broadening:
_get_auto_quantize_disabled_layersnow gates_QWEN36_AUTOQ_DISABLED_LAYERSbehind_is_qwen_model(model), with a unit test that asserts they are absent for Llama. ✅
Outstanding for human sign-off:
- Public-API surface continues to widen: new
EXCLUDED_MODULE_NAME_PATTERNS_KEY,w4a16_nvfp4qformat choice in bothmodel_quant.auto_quantizedocstring andhf_ptq.pychoice list, newdisabled_layersindefault_state_dict(loaded from search checkpoints). - Hardcoded
_QWEN36_AUTOQ_DISABLED_LAYERS/_VLM_AUTOQ_DISABLED_LAYERSinhf_ptq.pystill have aTODO: Refactor into the config system(juhi10071998 to follow up). Worth confirming the hardcoded list is acceptable for the v1 ship. - End-to-end Qwen3.6 35B +
fp8,w4a16_nvfp4and vLLM-loadable export were validated locally per the PR body, but there is no CI coverage for that path; recommend cjluo-nv take a final look before merge.
| mto.enable_huggingface_checkpointing() | ||
|
|
||
|
|
||
| # TODO: Refactor into the config system. |
There was a problem hiding this comment.
can we move model specific code to example_utils.py?
There was a problem hiding this comment.
Addressed in 0cc2bef: moved the Qwen/VLM AutoQuant model-specific helper logic out of hf_ptq.py and into example_utils.py. hf_ptq.py now imports those helpers from example_utils.
| from .utils import is_quantized_linear | ||
|
|
||
|
|
||
| def _is_fused_experts_module(module: nn.Module) -> bool: |
There was a problem hiding this comment.
why not move it to huggingface.py?
There was a problem hiding this comment.
Addressed in 0cc2bef: moved the HF quantized fused-experts wrapper check into plugins/huggingface.py. algorithms.py now calls it through a late import to avoid the existing registration import cycle.
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
0cc2bef to
61e506a
Compare
realAsma
left a comment
There was a problem hiding this comment.
Why do we need the cost_excluded_patterns?
It seems to add complexity without real benefits.
Without this, AutoQuant takes MTP and Vision tower into the cost function. It will be hard for users to define the effective bits ratio for the LLM part. |
| for pattern in _as_list(search_state.get("disabled_layers")) | ||
| ) | ||
| per_module_entries: list[dict] = [] | ||
| _per_module_attrs = ("input_quantizer", "weight_quantizer", "output_quantizer") |
There was a problem hiding this comment.
get_auto_quantize_config() still emits only standard quantizer attrs, so fused-expert AutoQuant choices may not eplay correctly from generated configs. Can we fix it?
There was a problem hiding this comment.
🤖 Bot comment.
Yes, I think we should fix it. The issue is specifically in the replay config, not in the live searched model.
During search this PR now snapshots/restores fused experts through the real fused attrs:
gate_up_proj_input_quantizergate_up_proj_weight_quantizersdown_proj_input_quantizerdown_proj_weight_quantizers
But get_auto_quantize_config() still uses only:
("input_quantizer", "weight_quantizer", "output_quantizer")So if a fused expert wins a recipe, the generated config emits entries like:
{"quantizer_name": "layers.N.mlp.experts.weight_quantizer", ...}That will not match the actual fused-expert quantizers on replay. Their real names are shaped like layers.N.mlp.experts.gate_up_proj_weight_quantizers.0 / down_proj_weight_quantizers.0. conversion.py normalizes those to ...gate_up_proj_weight_quantizer / ...down_proj_weight_quantizer, not to plain ...weight_quantizer. Because the generated config starts with {"quantizer_name": "*", "enable": False}, the selected fused-expert quantizers can remain disabled/default when someone calls mtq.get_auto_quantize_config(search_state) and replays it through mtq.quantize() or an export flow.
Suggested fix: persist config-replay quantizer attrs into candidate_stats while the model objects are still available, then have get_auto_quantize_config() use those attrs, defaulting old search states to the standard trio for backward compatibility. For fused experts, the replay attrs should be something like:
(
"gate_up_proj_input_quantizer",
"gate_up_proj_weight_quantizer", # singular form matches all ModuleList children via conversion normalization
"down_proj_input_quantizer",
"down_proj_weight_quantizer",
)A regression test should exercise get_auto_quantize_config() for a fused expert hparam and assert the generated quant_cfg contains the gate/up and down fused-expert quantizer names, not only module.weight_quantizer.
There was a problem hiding this comment.
Fixed in 613c82f. This was a real replay-path bug: live AutoQuant search/export used the fused-expert quantizer attrs correctly, but get_auto_quantize_config() could emit plain *.weight_quantizer entries for fused experts. The fix persists replayable quantizer attrs in candidate_stats and emits gate_up_proj/down_proj quantizer names, with a fallback for older search checkpoints. Added regression coverage in test_get_auto_quantize_config_emits_fused_expert_quantizer_names; focused AutoQuant/fused-expert tests and pre-commit passed.
realAsma
left a comment
There was a problem hiding this comment.
Look great!
I am not convinced of the benefits of EXCLUDED_MODULE_NAME_PATTERNS_KEY compared to the complexity it adds. - However I dont have a strong preference.
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
What does this PR do?
Type of change: new feature, bug fix, new tests
Details
w4a16_nvfp4as an AutoQuant search format.lm_headto override default disables.lm_headremains visible.parent_class-scoped quant config entries during AutoQuant bare quantizer matching, preventing class-scoped global entries from last-match overriding every selected module.hf_ptq.pywith a TODO to refactor into the config system.Usage
Testing
/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py::test_get_auto_quantize_config_keeps_selected_lm_head_enabled tests/unit/torch/quantization/test_config_validation.py::TestMatchQuantizerCfg::test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py -k "not data_parallel"(120 passed, 1 deselected)/Users/weimingc/miniconda3/envs/modelopt/bin/python -m py_compile examples/llm_ptq/hf_ptq.py modelopt/torch/quantization/algorithms.py modelopt/torch/quantization/_auto_quantize_cost.py tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py-k "not data_parallel"only failedtest_data_parallel_auto_quantizebecause this local sandbox cannot bind a free socket (PermissionError: Operation not permitted).fp8,w4a16_nvfp4and exported a checkpoint.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
Summary by CodeRabbit
New Features
Improvements
Tests
Documentation