Adds AutoQuant support for VLM / Qwen3.5-Qwen3.6 style models by meenchen · Pull Request #1381 · NVIDIA/Model-Optimizer

meenchen · 2026-05-01T22:20:38Z

What does this PR do?

Type of change: new feature, bug fix, new tests

Details

Enables AutoQuant search over fused MoE expert containers by snapshotting/restoring their per-expert quantizers.
Adds Qwen3.5/3.6 linear-attention grouping rules so fused deployment layers keep compatible quant formats.
Supports w4a16_nvfp4 as an AutoQuant search format.
Preserves disabled AutoQuant layer patterns in generated configs while allowing selected modules like lm_head to override default disables.
Keeps recipe-mode and AutoQuantize VLM paths on the outer CausalLM so Qwen3.5/3.6-MoE lm_head remains visible.
Skips parent_class-scoped quant config entries during AutoQuant bare quantizer matching, preventing class-scoped global entries from last-match overriding every selected module.
Adds temporary hardcoded Qwen/VLM AutoQuant disabled-layer patterns in hf_ptq.py with a TODO to refactor into the config system.

Usage

python examples/llm_ptq/hf_ptq.py \
  --pyt_ckpt_path <model_path> \
  --qformat fp8,w4a16_nvfp4 \
  --auto_quantize_bits 5.0 \
  --auto_quantize_cost_model active_moe \
  --auto_quantize_checkpoint <autoquant_state.pt> \
  --export_path <output_dir>

Testing

/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py::test_get_auto_quantize_config_keeps_selected_lm_head_enabled tests/unit/torch/quantization/test_config_validation.py::TestMatchQuantizerCfg::test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup
/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py -k "not data_parallel" (120 passed, 1 deselected)
/Users/weimingc/miniconda3/envs/modelopt/bin/python -m py_compile examples/llm_ptq/hf_ptq.py modelopt/torch/quantization/algorithms.py modelopt/torch/quantization/_auto_quantize_cost.py tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py
Full local affected-file pytest without -k "not data_parallel" only failed test_data_parallel_auto_quantize because this local sandbox cannot bind a free socket (PermissionError: Operation not permitted).
Ran Qwen3.6 35B AutoQuant e2e with fp8,w4a16_nvfp4 and exported a checkpoint.
Verified exported checkpoint loads in vLLM nightly without local patches.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ✅
Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

New Features
- Added w4a16_nvfp4 quantization format and optional cost-exclusion patterns for AutoQuantize.
Improvements
- Safer multimodal/VLM handling and AutoQuantize now runs on the full outer model when applicable.
- Better fused-MoE support, more accurate weight accounting, and refined attention-grouping for improved quantization choices.
- Dynamic layer-disabling support for targeted disables.
Tests
- New unit tests covering cost-model exclusions, fused-MoE accounting, and config selection.
Documentation
- Updated cost-constraint example to show exclusion-pattern usage.

copy-pr-bot · 2026-05-01T22:20:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-01T22:20:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR enables AutoQuantize to handle fused MoE modules and per-expert quantizers, adds disabled_layers and excluded-module-name-pattern cost plumbing, updates example CLI wiring (VLM gating and outer model usage), extends format allowlist, and adds unit tests for these behaviors.

Changes

AutoQuantize: Fused MoE Support and Disabled Layers Configuration

Layer / File(s)	Summary
Examples: dynamic disabled-layers & CLI wiring `examples/llm_ptq/hf_ptq.py`	Adds `get_auto_quantize_disabled_layers` and `get_auto_quantize_cost_excluded_patterns`, extends allowlist with `w4a16_nvfp4`, replaces hard-coded disabled list with the dynamic helper, gates VLM extraction for auto-quantize, and passes `full_model` to `auto_quantize`.
Cost model: weight-count helper & exclusions `modelopt/torch/quantization/_auto_quantize_cost.py`, `tests/unit/torch/quantization/test_autoquant.py`, `modelopt/torch/quantization/model_quant.py`	Introduces `EXCLUDED_MODULE_NAME_PATTERNS_KEY`, `fnmatch` usage, `_get_module_weight_numel()` for fused-expert-aware element counts, normalizes excluded-module-name-patterns in cost constraints, makes excluded modules yield zero per-module cost weight, updates ActiveMoE support, and adds tests and doc example updates.
Fused-experts detection & per-quantizer attributes `modelopt/torch/quantization/algorithms.py`, `tests/unit/torch/quantization/test_autoquant.py`	Detects fused-experts containers, enumerates per-quantizer attributes for fused and standard layouts, snapshots/restores per-expert quantizers, allocates fresh quantizers preserving ModuleList lengths, and adds related unit tests.
Linear-attn grouping rules `modelopt/torch/quantization/algorithms.py`	Adds regex-based group-key helpers for fused linear-attn (`qkvz` vs `ba`) and integrates them into `quant_grouping_rules`.
disabled_layers default and module recognition `modelopt/torch/quantization/algorithms.py`	Adds `disabled_layers: None` to default state, stores `config['disabled_layers']` in `before_search`, and extends `_is_auto_quantize_module()` to include fused-experts QuantModule containers.
Algorithm weight accounting integration `modelopt/torch/quantization/algorithms.py`, `modelopt/torch/quantization/_auto_quantize_cost.py`	Replaces direct `weight.numel()` calls with `_get_module_weight_numel(module)` in weight-size calculations and cost accounting to correctly count fused-expert parameters.
_as_list and disabled_layers normalization `modelopt/torch/quantization/algorithms.py`	Adds `_as_list()` utility and normalizes `search_state.disabled_layers` into global `quant_cfg` disable entries with `enable=False`.
Per-module config assembly and ordering `modelopt/torch/quantization/algorithms.py`	Collects per-module quantizer cfg entries separately and assembles final `quant_cfg` with global entries first, then per-module entries to preserve path-scoped override semantics.
Tighten quantizer matching to ignore parent_class-scoped entries `modelopt/torch/quantization/algorithms.py`, `tests/unit/torch/quantization/test_config_validation.py`	Updates `_match_quantizer_cfg()` to skip entries that specify `parent_class` for bare-name lookups and adds a unit test verifying this behavior.
Doc update and unit tests `modelopt/torch/quantization/model_quant.py`, `tests/unit/torch/quantization/test_autoquant.py`	Adjusts `active_moe` cost example to include `excluded_module_name_patterns`, expands imports, and adds tests for zero `cost_weight`, cost-model exclusion patterns, fused-expert weight counting, and `get_auto_quantize_config` ordering and LM-head preservation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1497: Both PRs modify AutoQuantize's MoE-related cost/weight accounting; this PR adds exclusion-pattern support and fused-expert-aware counting.

Suggested reviewers

juhi10071998
realAsma

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.66% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding AutoQuant support for VLM and Qwen-style models, which aligns with the core objectives of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No critical security anti-patterns found in modified files: no unsafe torch.load, numpy.load, hardcoded trust_remote_code, eval/exec, nosec bypasses, or unsafe dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch weimingc/autoquat_qwen3p6

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-01T22:25:02Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-08 17:06 UTC

codecov · 2026-05-01T22:33:38Z

Codecov Report

❌ Patch coverage is 92.30769% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.91%. Comparing base (54ce4e0) to head (613c82f).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/_auto_quantize_cost.py	82.60%	4 Missing ⚠️
modelopt/torch/quantization/algorithms.py	95.45%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1381      +/-   ##
==========================================
+ Coverage   76.27%   76.91%   +0.64%     
==========================================
  Files         489      489              
  Lines       54415    54495      +80     
==========================================
+ Hits        41504    41916     +412     
+ Misses      12911    12579     -332

Flag	Coverage Δ
examples	`42.79% <57.14%> (+0.79%)`	⬆️
gpu	`58.41% <57.14%> (-1.47%)`	⬇️
regression	`14.89% <20.87%> (-0.24%)`	⬇️
unit	`54.06% <92.30%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen · 2026-06-03T17:32:07Z

+# TODO: To be refacored into config system.
+_QWEN36_AUTOQ_DISABLED_LAYERS = (
+    "*shared_expert_gate*",
+    "*linear_attn.in_proj_a*",
+    "*linear_attn.in_proj_b*",
+)
+_VLM_AUTOQ_DISABLED_LAYERS = ("*visual*", "*mtp*", "*vision_tower*")


@juhi10071998 let's move this to the config systems once ready

thanks @meenchen , yes let me do that, unfortunately I got pulled into a PO day0 model and won't get time until next week

no problem, just want to make sure we are aligned with the future refactor.

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/quantization/algorithms.py (1)
110-126: ⚡ Quick win

Duplicate helper with subtly different semantics.

This _get_module_weight_numel duplicates the same-named function in _auto_quantize_cost.py (lines 93-106) but uses different dispatch logic:

Here: checks _is_fused_experts_module() first, then falls back to weight

_auto_quantize_cost.py: checks weight first, then falls back to projections

If a fused-experts module ever has both a weight attr and projection attrs, the two files will count different elements. Consider consolidating into a single canonical helper (e.g., in _auto_quantize_cost.py) and importing it here to guarantee consistent accounting.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/algorithms.py` around lines 110 - 126, The
duplicate helper _get_module_weight_numel in this file has different dispatch
order than the one in _auto_quantize_cost.py causing inconsistent counts for
modules that expose both weight and fused projections; remove the local copy and
import the canonical helper from _auto_quantize_cost (or move a single
implementation to a shared location), ensure the canonical implementation
handles _is_fused_experts_module, weight, gate_up_proj and down_proj
consistently, and update any references here to call the imported/shared
_get_module_weight_numel so weight accounting is identical across modules.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@modelopt/torch/quantization/algorithms.py`:
- Around line 110-126: The duplicate helper _get_module_weight_numel in this
file has different dispatch order than the one in _auto_quantize_cost.py causing
inconsistent counts for modules that expose both weight and fused projections;
remove the local copy and import the canonical helper from _auto_quantize_cost
(or move a single implementation to a shared location), ensure the canonical
implementation handles _is_fused_experts_module, weight, gate_up_proj and
down_proj consistently, and update any references here to call the
imported/shared _get_module_weight_numel so weight accounting is identical
across modules.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e0bd6ffb-0ff2-4ac7-a994-86a6a9c888fb

📥 Commits

Reviewing files that changed from the base of the PR and between 862ed5e and 2828faa.

📒 Files selected for processing (4)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/_auto_quantize_cost.py
modelopt/torch/quantization/algorithms.py
tests/unit/torch/quantization/test_autoquant.py

cjluo-nv

Bot review — DM the bot to share feedback.

The fused-MoE / w4a16_nvfp4 / linear_attn grouping changes look reasonable and include unit tests for the fused-MoE cost path and the lm_head override. A few things to address before this lands:

Likely regression in --recipe path for Qwen3.5/3.6-MoE VLMs. The gate in load_model flipped from if args.recipe is None: to if args.auto_quantize_bits is None:. Per the deleted comment, recipe-mode was exactly the case that needed language_model = full_model so a recipe targeting *lm_head* could see the outer CausalLM's lm_head. Under the new condition, recipe + VLM now extracts the inner LM and passes it to mono_quantize, which is the opposite of what's intended — recipe-driven lm_head quantization of Qwen3.5/3.6-MoE VLMs will silently miss lm_head. The autoquant path can be opted into without breaking the recipe path, e.g. if args.recipe is None and args.auto_quantize_bits is None:.
Duplicated _get_module_weight_numel. The same helper is defined in both modelopt/torch/quantization/algorithms.py and modelopt/torch/quantization/_auto_quantize_cost.py. Pick one canonical location (the cost module is the more natural home; algorithms.py already imports from it) and re-export.
_match_quantizer_cfg now skips parent_class entries. This looks like a real bug fix (a {parent_class: nn.BatchNorm1d, quantizer_name: '*', enable: False} entry would previously last-match-win against any weight_quantizer/input_quantizer lookup and disable everything), but it changes behavior for any recipe that relies on parent_class-scoped global entries reaching get_auto_quantize_config output. Worth either calling out explicitly in the PR body or adding a regression test that pins the new behavior.
Hardcoded VLM/Qwen patterns in hf_ptq.py. Already acknowledged with a TODO; just noting that *shared_expert_gate* is broader than the existing *mlp.shared_expert_gate.* in _default_disabled_quantizer_cfg — fine in practice, but worth keeping in mind when refactoring into the config system.
Test fragility. test_get_auto_quantize_config_keeps_selected_lm_head_enabled relies on FP8_DEFAULT_CFG containing *lm_head* (via _default_disabled_quantizer_cfg) for the assertion to succeed. If that default ever changes, the test will fail confusingly. Consider seeding the search_state's recipe with an explicit *lm_head*: enable=False entry so the test is self-contained.

cjluo-nv · 2026-06-03T19:13:17Z

-            if args.recipe is None:
+            # AutoQuantize walks the outer CausalLM so lm_head is visible to the
+            # search. Visual/MTP siblings are excluded by disabled-layer patterns.
+            if args.auto_quantize_bits is None:


Bot comment.

This gate was previously if args.recipe is None: with a comment explaining that recipe-mode is the exception that needs language_model = full_model so a recipe targeting *lm_head* can see the outer CausalLM's lm_head on Qwen3.5/3.6-MoE VLMs.

Flipping it to if args.auto_quantize_bits is None: reverses that behavior: recipe + VLM now extracts the inner LM and passes it to mono_quantize, so a recipe that quantizes *lm_head* will silently miss it. AutoQuantize can be opted into without breaking the recipe path, e.g.:

if args.recipe is None and args.auto_quantize_bits is None: extracted_lm, extracted_model_type = extract_and_prepare_language_model_from_vl(full_model)

cjluo-nv · 2026-06-03T19:13:17Z

+        total = 0
+        for attr in ("gate_up_proj", "down_proj"):
+            param = getattr(module, attr, None)
+            if param is not None:


Bot comment.

This helper is duplicated in _auto_quantize_cost.py. Pick one canonical location (the cost module is the natural home — algorithms.py already imports from it) and re-export from the other to avoid drift.

cjluo-nv · 2026-06-03T19:13:17Z

    for entry in quant_cfg:
+        parent_class = entry.get("parent_class") if hasattr(entry, "get") else entry.parent_class
+        if parent_class is not None:
+            continue


Bot comment.

Skipping parent_class-scoped entries here looks like a genuine bug fix (e.g. {parent_class: nn.BatchNorm1d, quantizer_name: '*', enable: False} would previously last-match-win against any weight_quantizer lookup and disable everything), but it's a behavior change worth calling out in the PR body and pinning with a regression test.

cjluo-nv · 2026-06-03T19:13:17Z

+        idx for idx, entry in enumerate(quant_cfg) if entry["quantizer_name"] == "*lm_head*"
+    )
+    weight_idx = next(
+        idx


Bot comment.

This assertion implicitly depends on FP8_DEFAULT_CFG containing a *lm_head* disable via _default_disabled_quantizer_cfg. If that default ever changes, the test will fail confusingly. Consider constructing a small explicit recipe (with *lm_head*: enable=False in its quant_cfg) so the test pins the override behavior independent of the default config contents.

coderabbitai

🧹 Nitpick comments (1)

tests/unit/torch/quantization/test_config_validation.py (1)
493-510: ⚡ Quick win

Test logic looks good; consider moving imports to top of file.

The test correctly validates that parent_class-scoped config entries are ignored during bare-name quantizer lookup, which aligns with the PR's goal to prevent class-scoped globals from overriding selected modules.

However, the import at line 495 (and throughout the TestMatchQuantizerCfg class) is inside the test method without justification. As per coding guidelines, imports should be at the top of the file so import errors surface at collection time rather than mid-test. Consider refactoring all _match_quantizer_cfg imports in this test class to the module level.
📦 Suggested refactor: move imports to module level

Add the import near line 21 with the other imports from modelopt.torch.quantization:
 from modelopt.torch.quantization.config import (
     FP8_2D_BLOCKWISE_WEIGHT_ONLY_CFG,
     FP8_DEFAULT_CFG,
     FP8_PER_CHANNEL_PER_TOKEN_CFG,
     INT4_AWQ_CFG,
     NVFP4_DEFAULT_CFG,
     W4A8_AWQ_BETA_CFG,
     MaxCalibConfig,
     QuantizeConfig,
     find_quant_cfg_entry_by_path,
     need_calibration,
     normalize_quant_cfg_list,
 )
+from modelopt.torch.quantization.algorithms import _match_quantizer_cfg
Then remove the in-function imports from all test methods in TestMatchQuantizerCfg (lines 441, 452, 461, 471, 482, 495, 513, 524, 542).
As per coding guidelines: "Imports inside functions or test methods without explicit justification" should be flagged; acceptable in-function imports are only for circular imports or optional dependencies, with a brief comment explaining why.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_config_validation.py` around lines 493 -
510, Move the repeated in-test imports of _match_quantizer_cfg out of the
individual test methods and into the module-level imports: add "from
modelopt.torch.quantization.algorithms import _match_quantizer_cfg" alongside
the other top-level imports, then remove the in-function import statements
inside TestMatchQuantizerCfg methods (including the
test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup test). If
a circular or optional-import reason prevents moving it to module scope, keep
the in-function import but add a brief comment explaining the justification;
otherwise ensure all tests use the module-level _match_quantizer_cfg import so
import errors surface at collection time.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/torch/quantization/test_config_validation.py`:
- Around line 493-510: Move the repeated in-test imports of _match_quantizer_cfg
out of the individual test methods and into the module-level imports: add "from
modelopt.torch.quantization.algorithms import _match_quantizer_cfg" alongside
the other top-level imports, then remove the in-function import statements
inside TestMatchQuantizerCfg methods (including the
test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup test). If
a circular or optional-import reason prevents moving it to module scope, keep
the in-function import but add a brief comment explaining the justification;
otherwise ensure all tests use the module-level _match_quantizer_cfg import so
import errors surface at collection time.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fdfb047e-ead7-43a7-a68e-c0c1c1eb67fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2828faa and ec69be9.

📒 Files selected for processing (5)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/_auto_quantize_cost.py
modelopt/torch/quantization/algorithms.py
tests/unit/torch/quantization/test_autoquant.py
tests/unit/torch/quantization/test_config_validation.py

🚧 Files skipped from review as they are similar to previous changes (4)

modelopt/torch/quantization/_auto_quantize_cost.py
tests/unit/torch/quantization/test_autoquant.py
examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/algorithms.py

coderabbitai

🧹 Nitpick comments (1)

tests/unit/torch/quantization/test_autoquant.py (1)

693-721: 💤 Low value

The test addresses the prior feedback but still has one implicit coupling.

The test now explicitly adds *lm_head* disable to the recipe (line 695), which addresses the prior concern about implicit dependency on default config contents. However, line 721 asserts num_bits == (4, 3), which still depends on FP8_DEFAULT_CFG's internal quantizer configuration. Consider extracting the expected num_bits from the recipe config to make the test self-documenting:

♻️ Suggested improvement

+    # Extract expected num_bits from the recipe's weight quantizer config
+    expected_num_bits = None
+    for entry in recipe_config["quant_cfg"]:
+        if entry.get("quantizer_name") == "*weight_quantizer":
+            expected_num_bits = entry["cfg"]["num_bits"]
+            break
+
     assert weight_entry["enable"] is True
-    assert weight_entry["cfg"]["num_bits"] == (4, 3)
+    assert weight_entry["cfg"]["num_bits"] == expected_num_bits

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/quantization/test_autoquant.py` around lines 693 - 721, The
assertion in test_get_auto_quantize_config_keeps_selected_lm_head_enabled
hardcodes num_bits == (4, 3) and thus implicitly depends on mtq.FP8_DEFAULT_CFG;
instead, derive the expected bits from the recipe_config used to build the
QuantRecipe: locate the entry in recipe_config["quant_cfg"] whose
"quantizer_name" == "lm_head.weight_quantizer" and read its ["cfg"]["num_bits"],
then assert weight_entry["cfg"]["num_bits"] equals that extracted value so the
test documents and relies only on the recipe it constructed (refer to symbols
test_get_auto_quantize_config_keeps_selected_lm_head_enabled, recipe_config,
FP8_DEFAULT_CFG, and QuantRecipe).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/torch/quantization/test_autoquant.py`:
- Around line 693-721: The assertion in
test_get_auto_quantize_config_keeps_selected_lm_head_enabled hardcodes num_bits
== (4, 3) and thus implicitly depends on mtq.FP8_DEFAULT_CFG; instead, derive
the expected bits from the recipe_config used to build the QuantRecipe: locate
the entry in recipe_config["quant_cfg"] whose "quantizer_name" ==
"lm_head.weight_quantizer" and read its ["cfg"]["num_bits"], then assert
weight_entry["cfg"]["num_bits"] equals that extracted value so the test
documents and relies only on the recipe it constructed (refer to symbols
test_get_auto_quantize_config_keeps_selected_lm_head_enabled, recipe_config,
FP8_DEFAULT_CFG, and QuantRecipe).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4458fd4a-7fe9-4464-aeb1-54661dd6fb00

📥 Commits

Reviewing files that changed from the base of the PR and between ec69be9 and 7f2dca2.

📒 Files selected for processing (6)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/_auto_quantize_cost.py
modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/model_quant.py
tests/unit/torch/quantization/test_autoquant.py
tests/unit/torch/quantization/test_config_validation.py

✅ Files skipped from review due to trivial changes (2)

modelopt/torch/quantization/model_quant.py
tests/unit/torch/quantization/test_config_validation.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/quantization/algorithms.py

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 153-171: The two helpers get_auto_quantize_disabled_layers and
get_auto_quantize_cost_excluded_patterns are currently public; make them private
or explicitly export them: either rename both to
_get_auto_quantize_disabled_layers and _get_auto_quantize_cost_excluded_patterns
and update all local callers, or add a module-level __all__ list at the top of
the file that explicitly exports the public API (and omits these two names).
Ensure references to is_multimodal_model, _QWEN36_AUTOQ_DISABLED_LAYERS and
_VLM_AUTOQ_DISABLED_LAYERS inside those functions remain valid after renaming or
when adjusting imports.
- Around line 1462-1472: Reject --auto_quantize_cost_exclude_patterns when
AutoQuantize is off by validating args after parsing: check if the parsed flag
auto_quantize_cost_exclude_patterns is set while auto_quantize_bits (or whatever
flag controls enabling AutoQuantize) is unset/None, and raise a parser error (or
call parser.error) reporting that --auto_quantize_cost_exclude_patterns requires
--auto_quantize_bits. Place this validation near the argument parsing / main
setup (after parser.parse_args()) so auto_quantize() logic remains unchanged and
misconfigured runs fail fast.
- Around line 144-163: The function get_auto_quantize_disabled_layers currently
always appends _QWEN36_AUTOQ_DISABLED_LAYERS which incorrectly affects non-Qwen
models; update it to only extend with _QWEN36_AUTOQ_DISABLED_LAYERS when the
supplied model is a Qwen family model (e.g., add a predicate
is_qwen_model(model) or check model.config.model_type/ name for "qwen" and use
that to gate the extension), keep the existing multimodal check for
_VLM_AUTOQ_DISABLED_LAYERS and continue to build disabled_layers from
_default_disabled_quantizer_cfg unchanged.
- Around line 584-587: The current guard that keeps the outer CausalLM when
args.recipe and args.auto_quantize_bits are None prevents Nemotron VL from
entering the AutoQuantize path and also leaves args.calib_with_images enabled
which causes auto_quantize() to hard-fail; change the logic so that when
args.auto_quantize_bits is set the code does not keep the outer CausalLM (so
AutoQuantize path is used) and ensure image calibration is disabled for Nemotron
VL by either clearing args.calib_with_images before calling auto_quantize() or
updating auto_quantize() to ignore image calibration for Nemotron VL models
(refer to args.auto_quantize_bits, auto_quantize(), args.calib_with_images, and
the CausalLM wrapping logic).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 30f10133-5668-40ca-86bb-633b5c905fd7

📥 Commits

Reviewing files that changed from the base of the PR and between 7f2dca2 and 80f86d1.

📒 Files selected for processing (6)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/_auto_quantize_cost.py
modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/model_quant.py
tests/unit/torch/quantization/test_autoquant.py
tests/unit/torch/quantization/test_config_validation.py

✅ Files skipped from review due to trivial changes (1)

modelopt/torch/quantization/model_quant.py

🚧 Files skipped from review as they are similar to previous changes (4)

tests/unit/torch/quantization/test_config_validation.py
modelopt/torch/quantization/_auto_quantize_cost.py
tests/unit/torch/quantization/test_autoquant.py
modelopt/torch/quantization/algorithms.py

coderabbitai · 2026-06-03T22:16:31Z

+    parser.add_argument(
+        "--auto_quantize_cost_exclude_patterns",
+        nargs="+",
+        default=None,
+        help=(
+            "Wildcard module-name patterns to exclude from AutoQuantize effective-bits cost "
+            "accounting. The matched modules can still be disabled from quantization separately; "
+            "this flag only changes the budget denominator and selected-cost calculation. "
+            "For multimodal models, VLM/MTP sibling modules are excluded by default."
+        ),
+    )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject this flag when AutoQuantize is off.

--auto_quantize_cost_exclude_patterns is only consumed inside auto_quantize(), so plain PTQ and recipe runs silently ignore it. Adding a parser error when --auto_quantize_bits is unset would catch misconfigured jobs early.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_ptq/hf_ptq.py` around lines 1462 - 1472, Reject --auto_quantize_cost_exclude_patterns when AutoQuantize is off by validating args after parsing: check if the parsed flag auto_quantize_cost_exclude_patterns is set while auto_quantize_bits (or whatever flag controls enabling AutoQuantize) is unset/None, and raise a parser error (or call parser.error) reporting that --auto_quantize_cost_exclude_patterns requires --auto_quantize_bits. Place this validation near the argument parsing / main setup (after parser.parse_args()) so auto_quantize() logic remains unchanged and misconfigured runs fail fast.

coderabbitai

♻️ Duplicate comments (2)

examples/llm_ptq/hf_ptq.py (2)

153-163: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate Qwen-disabled patterns to Qwen models only.

At Line 160, _QWEN36_AUTOQ_DISABLED_LAYERS is always appended, so non-Qwen models can be silently over-excluded during AutoQuant search.

Suggested minimal fix

 def get_auto_quantize_disabled_layers(model) -> list[str]:
@@
-    disabled_layers.extend(p for p in _QWEN36_AUTOQ_DISABLED_LAYERS if p not in disabled_layers)
+    model_type = getattr(getattr(model, "config", None), "model_type", "") or ""
+    if "qwen" in model_type.lower():
+        disabled_layers.extend(p for p in _QWEN36_AUTOQ_DISABLED_LAYERS if p not in disabled_layers)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 153 - 163,
get_auto_quantize_disabled_layers currently always appends
_QWEN36_AUTOQ_DISABLED_LAYERS which can over-exclude non-Qwen models; change it
to only extend with _QWEN36_AUTOQ_DISABLED_LAYERS when the input model is
actually a Qwen model (e.g., guard with an existing is_qwen36_model(model) or
add a small check on model.config or model.name to detect Qwen), leaving the
multimodal branch for _VLM_AUTOQ_DISABLED_LAYERS unchanged; update the logic in
get_auto_quantize_disabled_layers and use the symbol
_QWEN36_AUTOQ_DISABLED_LAYERS in the guarded branch.

583-587: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Nemotron VL + AutoQuant still hits an immediate hard-fail path.

This new gate keeps the outer model for AutoQuant, but Nemotron VL still defaults args.calib_with_images=True earlier, and auto_quantize() rejects image calibration at Line 332-336. The result is AutoQuant aborting before search starts.

Suggested minimal fix

-    if is_nemotron_vl_model and not args.calib_with_images:
+    if (
+        is_nemotron_vl_model
+        and not args.calib_with_images
+        and args.auto_quantize_bits is None
+    ):
         print("Nemotron VL model detected. Enabling image-text calibration by default.")
         args.calib_with_images = True

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 583 - 587, AutoQuant aborts because
Nemotron VL defaults args.calib_with_images=True but auto_quantize() rejects
image calibration; before invoking auto_quantize (or before the code path that
keeps the outer model), detect when args.auto_quantize_bits is set and disable
image calibration by setting args.calib_with_images = False (or pass an
allow_image_calib=False flag into auto_quantize) so auto_quantize() will not hit
the image-calibration rejection; update the code around
extract_and_prepare_language_model_from_vl and the auto_quantize() call to
enforce this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 153-163: get_auto_quantize_disabled_layers currently always
appends _QWEN36_AUTOQ_DISABLED_LAYERS which can over-exclude non-Qwen models;
change it to only extend with _QWEN36_AUTOQ_DISABLED_LAYERS when the input model
is actually a Qwen model (e.g., guard with an existing is_qwen36_model(model) or
add a small check on model.config or model.name to detect Qwen), leaving the
multimodal branch for _VLM_AUTOQ_DISABLED_LAYERS unchanged; update the logic in
get_auto_quantize_disabled_layers and use the symbol
_QWEN36_AUTOQ_DISABLED_LAYERS in the guarded branch.
- Around line 583-587: AutoQuant aborts because Nemotron VL defaults
args.calib_with_images=True but auto_quantize() rejects image calibration;
before invoking auto_quantize (or before the code path that keeps the outer
model), detect when args.auto_quantize_bits is set and disable image calibration
by setting args.calib_with_images = False (or pass an allow_image_calib=False
flag into auto_quantize) so auto_quantize() will not hit the image-calibration
rejection; update the code around extract_and_prepare_language_model_from_vl and
the auto_quantize() call to enforce this change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7226482a-2f36-4775-a3d4-87e15780d5b1

📥 Commits

Reviewing files that changed from the base of the PR and between 80f86d1 and ec996da.

📒 Files selected for processing (6)

examples/llm_ptq/hf_ptq.py
modelopt/torch/quantization/_auto_quantize_cost.py
modelopt/torch/quantization/algorithms.py
modelopt/torch/quantization/model_quant.py
tests/unit/torch/quantization/test_autoquant.py
tests/unit/torch/quantization/test_config_validation.py

✅ Files skipped from review due to trivial changes (1)

modelopt/torch/quantization/model_quant.py

🚧 Files skipped from review as they are similar to previous changes (4)

tests/unit/torch/quantization/test_config_validation.py
modelopt/torch/quantization/_auto_quantize_cost.py
tests/unit/torch/quantization/test_autoquant.py
modelopt/torch/quantization/algorithms.py

cjluo-nv

Bot review — DM the bot to share feedback.

Re-review of PR #1381 (AutoQuant for Qwen3.5/3.6 + VLM).

Previous critical comments — addressed:

💬 cjluo-nv's --recipe regression: gate is now if args.recipe is None and args.auto_quantize_bits is None: at hf_ptq.py:583 — recipe-mode VLMs again retain the outer CausalLM. ✅
💬 Duplicated _get_module_weight_numel: deduped — single canonical helper now lives in _auto_quantize_cost.py and is imported into algorithms.py. ✅
💬 _match_quantizer_cfg parent_class-skip behavior change: pinned with test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup regression test. ✅
💬 lm_head test fragility: test now appends {quantizer_name: "*lm_head*", enable: False} to its own recipe (test_autoquant.py:695), so it no longer relies on _default_disabled_quantizer_cfg for the enable-check. The num_bits == (4, 3) assertion still implicitly trusts FP8_DEFAULT_CFG's weight-quantizer shape, but that's a minor coupling.

Outstanding for human sign-off:

Nemotron VL + --auto_quantize_bits still hard-fails before search. load_model unconditionally sets args.calib_with_images = True for Nemotron VL (hf_ptq.py ~L471-473), and auto_quantize() then raises NotImplementedError("AutoQuantize with image-text calibration is not supported yet.") at L332-336. CodeRabbit marked this "✅ Addressed in commit ec996da" but the suggested guard (and args.auto_quantize_bits is None) is not in the current file. The PR's target architecture is Qwen3.5/3.6 (validated end-to-end per the PR body), so this likely doesn't break the shipped use case, but it should either be fixed or explicitly scoped out.
Hardcoded VLM/Qwen disable patterns in hf_ptq.py. Acknowledged via TODO and inline reply chain (juhi10071998 to follow up next week). The Qwen-only patterns (*shared_expert_gate*, *linear_attn.in_proj_*) are appended for every model, but in practice these names don't appear on non-Qwen architectures, so the over-broadening is benign. Worth confirming that's acceptable until the config-system refactor lands.
The new get_auto_quantize_disabled_layers / get_auto_quantize_cost_excluded_patterns helpers in examples/llm_ptq/hf_ptq.py are public-named but only used locally; minor.

Tests for the new behavior (cost-model exclusion, fused MoE weight accounting, lm_head override, parent_class skip) are present and meaningful. Recommend cjluo-nv take a final look at the Nemotron VL gating before approval.

cjluo-nv

Bot review — DM the bot to share feedback.

Re-review: previous critical comments are addressed. Recommending nudge for human sign-off due to remaining surface-area items, not unresolved bugs.

Previously-flagged criticals — all resolved:

💬 --recipe regression: gate at hf_ptq.py L607 is now if args.recipe is None and args.auto_quantize_bits is None:. ✅
💬 Duplicated _get_module_weight_numel: helper now lives canonically in _auto_quantize_cost.py and is imported by algorithms.py. ✅
💬 _match_quantizer_cfg parent_class skip: pinned by test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup. ✅
💬 lm_head test fragility: test now appends an explicit *lm_head*: enable=False entry to its own recipe rather than relying on FP8_DEFAULT_CFG defaults. ✅
💬 Nemotron VL + AutoQuantize hard-fail: parse_args() now errors on --auto_quantize_bits + --calib_with_images, and load_model no longer auto-enables image calibration when args.auto_quantize_bits is set. New tests/examples/llm_ptq/test_hf_ptq_args.py regression-tests both. ✅
💬 Qwen-only patterns over-broadening: _get_auto_quantize_disabled_layers now gates _QWEN36_AUTOQ_DISABLED_LAYERS behind _is_qwen_model(model), with a unit test that asserts they are absent for Llama. ✅

Outstanding for human sign-off:

Public-API surface continues to widen: new EXCLUDED_MODULE_NAME_PATTERNS_KEY, w4a16_nvfp4 qformat choice in both model_quant.auto_quantize docstring and hf_ptq.py choice list, new disabled_layers in default_state_dict (loaded from search checkpoints).
Hardcoded _QWEN36_AUTOQ_DISABLED_LAYERS / _VLM_AUTOQ_DISABLED_LAYERS in hf_ptq.py still have a TODO: Refactor into the config system (juhi10071998 to follow up). Worth confirming the hardcoded list is acceptable for the v1 ship.
End-to-end Qwen3.6 35B + fp8,w4a16_nvfp4 and vLLM-loadable export were validated locally per the PR body, but there is no CI coverage for that path; recommend cjluo-nv take a final look before merge.

cjluo-nv · 2026-06-05T21:44:23Z

 mto.enable_huggingface_checkpointing()


+# TODO: Refactor into the config system.


can we move model specific code to example_utils.py?

Addressed in 0cc2bef: moved the Qwen/VLM AutoQuant model-specific helper logic out of hf_ptq.py and into example_utils.py. hf_ptq.py now imports those helpers from example_utils.

cjluo-nv · 2026-06-05T21:46:05Z

 from .utils import is_quantized_linear


+def _is_fused_experts_module(module: nn.Module) -> bool:


why not move it to huggingface.py?

Addressed in 0cc2bef: moved the HF quantized fused-experts wrapper check into plugins/huggingface.py. algorithms.py now calls it through a late import to avoid the existing registration import cycle.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

realAsma

Why do we need the cost_excluded_patterns?
It seems to add complexity without real benefits.

meenchen · 2026-06-05T23:40:40Z

Why do we need the cost_excluded_patterns? It seems to add complexity without real benefits.

Without this, AutoQuant takes MTP and Vision tower into the cost function. It will be hard for users to define the effective bits ratio for the LLM part.

realAsma · 2026-06-05T23:46:47Z

+        for pattern in _as_list(search_state.get("disabled_layers"))
+    )
+    per_module_entries: list[dict] = []
    _per_module_attrs = ("input_quantizer", "weight_quantizer", "output_quantizer")


get_auto_quantize_config() still emits only standard quantizer attrs, so fused-expert AutoQuant choices may not eplay correctly from generated configs. Can we fix it?

🤖 Bot comment.

Yes, I think we should fix it. The issue is specifically in the replay config, not in the live searched model.

During search this PR now snapshots/restores fused experts through the real fused attrs:

gate_up_proj_input_quantizer

gate_up_proj_weight_quantizers

down_proj_input_quantizer

down_proj_weight_quantizers

But get_auto_quantize_config() still uses only:

("input_quantizer", "weight_quantizer", "output_quantizer")

So if a fused expert wins a recipe, the generated config emits entries like:

{"quantizer_name": "layers.N.mlp.experts.weight_quantizer", ...}

That will not match the actual fused-expert quantizers on replay. Their real names are shaped like layers.N.mlp.experts.gate_up_proj_weight_quantizers.0 / down_proj_weight_quantizers.0. conversion.py normalizes those to ...gate_up_proj_weight_quantizer / ...down_proj_weight_quantizer, not to plain ...weight_quantizer. Because the generated config starts with {"quantizer_name": "*", "enable": False}, the selected fused-expert quantizers can remain disabled/default when someone calls mtq.get_auto_quantize_config(search_state) and replays it through mtq.quantize() or an export flow.

Suggested fix: persist config-replay quantizer attrs into candidate_stats while the model objects are still available, then have get_auto_quantize_config() use those attrs, defaulting old search states to the standard trio for backward compatibility. For fused experts, the replay attrs should be something like:

( "gate_up_proj_input_quantizer", "gate_up_proj_weight_quantizer", # singular form matches all ModuleList children via conversion normalization "down_proj_input_quantizer", "down_proj_weight_quantizer", )

A regression test should exercise get_auto_quantize_config() for a fused expert hparam and assert the generated quant_cfg contains the gate/up and down fused-expert quantizer names, not only module.weight_quantizer.

Fixed in 613c82f. This was a real replay-path bug: live AutoQuant search/export used the fused-expert quantizer attrs correctly, but get_auto_quantize_config() could emit plain *.weight_quantizer entries for fused experts. The fix persists replayable quantizer attrs in candidate_stats and emits gate_up_proj/down_proj quantizer names, with a fallback for older search checkpoints. Added regression coverage in test_get_auto_quantize_config_emits_fused_expert_quantizer_names; focused AutoQuant/fused-expert tests and pre-commit passed.

realAsma

Look great!

I am not convinced of the benefits of EXCLUDED_MODULE_NAME_PATTERNS_KEY compared to the complexity it adds. - However I dont have a strong preference.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen changed the title ~~audoquantize for VLM~~ AutoQuant for VLM May 1, 2026

meenchen force-pushed the weimingc/autoquat_qwen3p6 branch 2 times, most recently from a6aa497 to 5de4432 Compare May 4, 2026 21:16

meenchen force-pushed the weimingc/autoquat_qwen3p6 branch 5 times, most recently from 2a21f66 to 2828faa Compare June 3, 2026 16:59

meenchen changed the title ~~AutoQuant for VLM~~ Adds AutoQuant support for VLM / Qwen3.5-Qwen3.6 style models Jun 3, 2026

meenchen marked this pull request as ready for review June 3, 2026 17:30

meenchen requested review from a team as code owners June 3, 2026 17:30

meenchen requested review from cjluo-nv, juhi10071998, realAsma and sugunav14 June 3, 2026 17:30

meenchen self-assigned this Jun 3, 2026

meenchen commented Jun 3, 2026

View reviewed changes