Skip to content

Commit 1f4a489

Browse files
authored
Adds AutoQuant support for VLM / Qwen3.5-Qwen3.6 style models (#1381)
### What does this PR do? Type of change: new feature, bug fix, new tests ### Details - Enables AutoQuant search over fused MoE expert containers by snapshotting/restoring their per-expert quantizers. - Adds Qwen3.5/3.6 linear-attention grouping rules so fused deployment layers keep compatible quant formats. - Supports `w4a16_nvfp4` as an AutoQuant search format. - Preserves disabled AutoQuant layer patterns in generated configs while allowing selected modules like `lm_head` to override default disables. - Keeps recipe-mode and AutoQuantize VLM paths on the outer CausalLM so Qwen3.5/3.6-MoE `lm_head` remains visible. - Skips `parent_class`-scoped quant config entries during AutoQuant bare quantizer matching, preventing class-scoped global entries from last-match overriding every selected module. - Adds temporary hardcoded Qwen/VLM AutoQuant disabled-layer patterns in `hf_ptq.py` with a TODO to refactor into the config system. ### Usage ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path <model_path> \ --qformat fp8,w4a16_nvfp4 \ --auto_quantize_bits 5.0 \ --auto_quantize_cost_model active_moe \ --auto_quantize_checkpoint <autoquant_state.pt> \ --export_path <output_dir> ``` ### Testing - `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py::test_get_auto_quantize_config_keeps_selected_lm_head_enabled tests/unit/torch/quantization/test_config_validation.py::TestMatchQuantizerCfg::test_parent_class_scoped_entries_are_ignored_for_bare_autoquant_lookup` - `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py -k "not data_parallel"` (`120 passed, 1 deselected`) - `/Users/weimingc/miniconda3/envs/modelopt/bin/python -m py_compile examples/llm_ptq/hf_ptq.py modelopt/torch/quantization/algorithms.py modelopt/torch/quantization/_auto_quantize_cost.py tests/unit/torch/quantization/test_autoquant.py tests/unit/torch/quantization/test_config_validation.py` - Full local affected-file pytest without `-k "not data_parallel"` only failed `test_data_parallel_auto_quantize` because this local sandbox cannot bind a free socket (`PermissionError: Operation not permitted`). - Ran Qwen3.6 35B AutoQuant e2e with `fp8,w4a16_nvfp4` and exported a checkpoint. - Verified exported checkpoint loads in vLLM nightly without local patches. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added w4a16_nvfp4 quantization format and optional cost-exclusion patterns for AutoQuantize. * **Improvements** * Safer multimodal/VLM handling and AutoQuantize now runs on the full outer model when applicable. * Better fused-MoE support, more accurate weight accounting, and refined attention-grouping for improved quantization choices. * Dynamic layer-disabling support for targeted disables. * **Tests** * New unit tests covering cost-model exclusions, fused-MoE accounting, and config selection. * **Documentation** * Updated cost-constraint example to show exclusion-pattern usage. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
1 parent 1555e6d commit 1f4a489

9 files changed

Lines changed: 556 additions & 50 deletions

File tree

examples/llm_ptq/example_utils.py

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@
4242
ProcessorMixin,
4343
)
4444

45+
from modelopt.torch.export.model_utils import is_multimodal_model
46+
from modelopt.torch.quantization.config import _default_disabled_quantizer_cfg
47+
4548
try:
4649
from huggingface_hub import snapshot_download
4750
except ImportError:
@@ -51,6 +54,58 @@
5154

5255
SPECULATIVE_MODEL_LIST = ["Eagle", "Medusa"]
5356

57+
# TODO: Refactor into the config system.
58+
_QWEN36_AUTOQ_DISABLED_LAYERS = (
59+
"*shared_expert_gate*",
60+
"*linear_attn.in_proj_a*",
61+
"*linear_attn.in_proj_b*",
62+
)
63+
_VLM_AUTOQ_DISABLED_LAYERS = ("*visual*", "*mtp*", "*vision_tower*")
64+
65+
66+
def _is_qwen_model(model) -> bool:
67+
"""Return True when model/config identifiers indicate a Qwen-family model."""
68+
candidates = [type(model).__name__]
69+
config = getattr(model, "config", None)
70+
configs = [
71+
config,
72+
getattr(config, "text_config", None),
73+
getattr(config, "language_config", None),
74+
]
75+
for cfg in configs:
76+
if cfg is None:
77+
continue
78+
candidates.append(type(cfg).__name__)
79+
model_type = getattr(cfg, "model_type", None)
80+
if model_type is not None:
81+
candidates.append(str(model_type))
82+
architectures = getattr(cfg, "architectures", ()) or ()
83+
if isinstance(architectures, str):
84+
architectures = (architectures,)
85+
candidates.extend(str(architecture) for architecture in architectures)
86+
return any("qwen" in candidate.lower() for candidate in candidates)
87+
88+
89+
def _get_auto_quantize_disabled_layers(model) -> list[str]:
90+
"""Return layer patterns that should be excluded from AutoQuantize search."""
91+
disabled_layers = [
92+
entry["quantizer_name"]
93+
for entry in _default_disabled_quantizer_cfg
94+
if "parent_class" not in entry and entry["quantizer_name"] != "*lm_head*"
95+
]
96+
if _is_qwen_model(model):
97+
disabled_layers.extend(p for p in _QWEN36_AUTOQ_DISABLED_LAYERS if p not in disabled_layers)
98+
if is_multimodal_model(model):
99+
disabled_layers.extend(p for p in _VLM_AUTOQ_DISABLED_LAYERS if p not in disabled_layers)
100+
return disabled_layers
101+
102+
103+
def _get_auto_quantize_cost_excluded_patterns(model) -> list[str]:
104+
"""Return layer patterns excluded only from AutoQuantize cost accounting."""
105+
if is_multimodal_model(model):
106+
return list(_VLM_AUTOQ_DISABLED_LAYERS)
107+
return []
108+
54109

55110
def run_nemotron_vl_preview(
56111
full_model,
@@ -133,7 +188,6 @@ def is_nemotron_vl(model_or_config):
133188
# Try to get config from model, or use directly if it's a config
134189
if hasattr(model_or_config, "config"):
135190
config = model_or_config.config
136-
from modelopt.torch.export.model_utils import is_multimodal_model
137191

138192
if not is_multimodal_model(model_or_config):
139193
return False

examples/llm_ptq/hf_ptq.py

Lines changed: 28 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@
2727
from cast_mxfp4_to_nvfp4 import apply_to_model as apply_cast_mxfp4_to_nvfp4
2828
from cast_mxfp4_to_nvfp4 import force_weight_quantizers_static
2929
from example_utils import (
30+
_get_auto_quantize_cost_excluded_patterns,
31+
_get_auto_quantize_disabled_layers,
3032
build_quant_cfg,
3133
copy_custom_model_files,
3234
create_vlm_calibration_loop,
@@ -72,7 +74,8 @@
7274
save_expert_token_count_table,
7375
)
7476
from modelopt.torch.export.model_utils import get_language_model_from_vl, is_multimodal_model
75-
from modelopt.torch.quantization.config import _default_disabled_quantizer_cfg, need_calibration
77+
from modelopt.torch.quantization._auto_quantize_cost import EXCLUDED_MODULE_NAME_PATTERNS_KEY
78+
from modelopt.torch.quantization.config import need_calibration
7679
from modelopt.torch.quantization.plugins.accelerate import init_quantized_weights
7780
from modelopt.torch.quantization.utils import is_quantized
7881
from modelopt.torch.speculative.eagle.utils import (
@@ -132,6 +135,7 @@ def _kv_cfg_uses_constant_amax(kv_quant_cfg: list[dict[str, Any]]) -> bool:
132135
"nvfp4_awq_lite",
133136
"nvfp4_w4a4_weight_mse_fp8_sweep",
134137
"w4a8_awq_beta",
138+
"w4a16_nvfp4",
135139
"fp8_2d_blockwise_weight_only",
136140
"w4a8_mxfp4_fp8",
137141
"nvfp4_mlp_only",
@@ -387,10 +391,14 @@ def forward_step(model, batch):
387391
"effective_bits": args.auto_quantize_bits,
388392
"cost_model": args.auto_quantize_cost_model,
389393
}
394+
auto_quantize_cost = {}
390395
if args.auto_quantize_active_moe_expert_ratio is not None:
391-
auto_quantize_constraints["cost"] = {
392-
"active_moe_expert_ratio": args.auto_quantize_active_moe_expert_ratio
393-
}
396+
auto_quantize_cost["active_moe_expert_ratio"] = args.auto_quantize_active_moe_expert_ratio
397+
cost_excluded_patterns = _get_auto_quantize_cost_excluded_patterns(language_model)
398+
if cost_excluded_patterns:
399+
auto_quantize_cost[EXCLUDED_MODULE_NAME_PATTERNS_KEY] = cost_excluded_patterns
400+
if auto_quantize_cost:
401+
auto_quantize_constraints["cost"] = auto_quantize_cost
394402

395403
language_model, _ = mtq.auto_quantize(
396404
language_model,
@@ -406,12 +414,7 @@ def forward_step(model, batch):
406414
len(calib_dataloader), max(auto_quantize_score_size // args.batch_size, 1)
407415
),
408416
verbose=True,
409-
# Disable all default disabled layers such as lm_head, mlp.gate, router etc.
410-
disabled_layers=[
411-
entry["quantizer_name"]
412-
for entry in _default_disabled_quantizer_cfg
413-
if "parent_class" not in entry
414-
],
417+
disabled_layers=_get_auto_quantize_disabled_layers(language_model),
415418
method=auto_quantize_method,
416419
checkpoint=auto_quantize_checkpoint,
417420
)
@@ -487,7 +490,7 @@ def load_model(args: argparse.Namespace):
487490
is_nemotron_vl_model = is_nemotron_vl(full_model)
488491

489492
# Default to image-text calibration for VLM models
490-
if is_nemotron_vl_model and not args.calib_with_images:
493+
if is_nemotron_vl_model and not args.calib_with_images and args.auto_quantize_bits is None:
491494
print("Nemotron VL model detected. Enabling image-text calibration by default.")
492495
args.calib_with_images = True
493496

@@ -539,12 +542,10 @@ def load_model(args: argparse.Namespace):
539542
: len(args.dataset)
540543
]
541544

542-
# We only quantize the language model for VLMs other than the type supported above.
543-
# Recipe mode is the exception: in Qwen3.5/3.6-MoE VLMs, lm_head sits
544-
# on the outer CausalLM, not the inner language backbone. A recipe that targets
545-
# lm_head must therefore quantize against the full model and explicitly keep visual
546-
# and MTP siblings disabled.
547-
if args.recipe is None:
545+
# Plain PTQ quantizes only the extracted language model. Recipe and
546+
# AutoQuantize paths keep the outer CausalLM so recipes/search can see
547+
# Qwen3.5/3.6-MoE VLM lm_head.
548+
if args.recipe is None and args.auto_quantize_bits is None:
548549
extracted_lm, extracted_model_type = extract_and_prepare_language_model_from_vl(
549550
full_model
550551
)
@@ -1070,9 +1071,16 @@ def _is_layerwise(obj):
10701071
"Auto quantization needs multiple quantization format."
10711072
)
10721073

1074+
# For VL models, autoquant must walk submodules of the OUTER CausalLM
1075+
# (which carries lm_head and the LM-head forward path) — otherwise
1076+
# lm_head and any sibling-of-language_model modules are silently
1077+
# invisible to the search. ``forward_step`` also needs the outer model
1078+
# to produce ``CausalLMOutputWithPast`` (for ``.loss`` / ``.logits``).
1079+
# Visual tower and MTP siblings are auto-excluded inside
1080+
# ``auto_quantize()`` via *visual* / *mtp* / *vision_tower* patterns.
10731081
auto_quantize(
10741082
args,
1075-
language_model,
1083+
full_model,
10761084
calib_dataloader,
10771085
auto_quantize_method=args.auto_quantize_method,
10781086
auto_quantize_score_size=args.auto_quantize_score_size,
@@ -1437,6 +1445,8 @@ def parse_args() -> argparse.Namespace:
14371445
args = parser.parse_args()
14381446
if args.moe_calib_experts_ratio is not None and not (0.0 < args.moe_calib_experts_ratio <= 1.0):
14391447
parser.error("--moe_calib_experts_ratio must be in the range (0.0, 1.0].")
1448+
if args.auto_quantize_bits is not None and args.calib_with_images:
1449+
parser.error("--calib_with_images is not supported with --auto_quantize_bits.")
14401450
if args.auto_quantize_active_moe_expert_ratio is not None and not (
14411451
0.0 < args.auto_quantize_active_moe_expert_ratio <= 1.0
14421452
):

modelopt/torch/quantization/_auto_quantize_cost.py

Lines changed: 54 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515

1616
"""Cost models for AutoQuantize effective-bits accounting."""
1717

18+
import fnmatch
1819
from collections.abc import Callable, Iterable, Sequence
1920
from typing import Any, Final
2021

@@ -27,6 +28,7 @@
2728

2829
AUTO_QUANTIZE_CONSTRAINT_KEYS: Final = frozenset({"effective_bits", "cost_model", "cost"})
2930
ACTIVE_MOE_EXPERT_RATIO_KEY: Final = "active_moe_expert_ratio"
31+
EXCLUDED_MODULE_NAME_PATTERNS_KEY: Final = "excluded_module_name_patterns"
3032
COST_MODEL_WEIGHT: Final = "weight"
3133
COST_MODEL_ACTIVE_MOE: Final = "active_moe"
3234

@@ -90,11 +92,31 @@ def is_routed_moe_module_name(name: str) -> bool:
9092
return "shared_expert" not in name and _ROUTED_MOE_EXPERT_NAME_RE.search(name) is not None
9193

9294

95+
def _get_module_weight_numel(module: nn.Module) -> int:
96+
"""Return the parameter count for a module's quantizable weights.
97+
98+
Standard quantized linear modules have a single ``weight`` parameter. Fused
99+
MoE expert containers expose projection tensors directly instead, so both
100+
fused projections contribute to AutoQuantize cost accounting.
101+
"""
102+
weight = getattr(module, "weight", None)
103+
if weight is not None:
104+
return weight.numel()
105+
106+
# Fused MoE expert containers expose projection tensors directly instead of
107+
# a single ``weight`` parameter.
108+
return sum(
109+
param.numel()
110+
for attr in ("gate_up_proj", "down_proj")
111+
if (param := getattr(module, attr, None)) is not None
112+
)
113+
114+
93115
class AutoQuantizeCostModel:
94116
"""Base class for AutoQuantize effective-bits cost accounting."""
95117

96118
name: str
97-
supported_cost_keys: frozenset[str] = frozenset()
119+
supported_cost_keys: frozenset[str] = frozenset({EXCLUDED_MODULE_NAME_PATTERNS_KEY})
98120

99121
def normalize_cost_constraints(
100122
self, model: nn.Module, cost_constraints: dict[str, Any]
@@ -103,12 +125,35 @@ def normalize_cost_constraints(
103125
unknown_cost_keys = set(cost_constraints) - self.supported_cost_keys
104126
if unknown_cost_keys:
105127
raise ValueError(f"Unsupported auto_quantize cost constraints: {unknown_cost_keys}.")
128+
excluded_patterns = cost_constraints.get(EXCLUDED_MODULE_NAME_PATTERNS_KEY)
129+
if excluded_patterns is None:
130+
return cost_constraints
131+
if isinstance(excluded_patterns, str):
132+
excluded_patterns = [excluded_patterns]
133+
if not isinstance(excluded_patterns, Sequence) or not all(
134+
isinstance(pattern, str) for pattern in excluded_patterns
135+
):
136+
raise ValueError(
137+
f"constraints['cost']['{EXCLUDED_MODULE_NAME_PATTERNS_KEY}'] must be a string "
138+
"or a sequence of strings."
139+
)
140+
cost_constraints[EXCLUDED_MODULE_NAME_PATTERNS_KEY] = list(excluded_patterns)
106141
return cost_constraints
107142

108143
def module_cost_weight(
109144
self, module_names: Sequence[str], cost_constraints: dict[str, Any]
110145
) -> float:
111146
"""Return the cost multiplier for a group of modules."""
147+
excluded_patterns = cost_constraints.get(EXCLUDED_MODULE_NAME_PATTERNS_KEY, [])
148+
if (
149+
module_names
150+
and excluded_patterns
151+
and all(
152+
any(fnmatch.fnmatch(name, pattern) for pattern in excluded_patterns)
153+
for name in module_names
154+
)
155+
):
156+
return 0.0
112157
return 1.0
113158

114159
def total_weight_size(
@@ -119,7 +164,7 @@ def total_weight_size(
119164
) -> float:
120165
"""Return the cost denominator for the effective-bits constraint."""
121166
return sum(
122-
module.weight.numel() * self.module_cost_weight([name], cost_constraints)
167+
_get_module_weight_numel(module) * self.module_cost_weight([name], cost_constraints)
123168
for name, module in named_modules
124169
if is_auto_quantize_module(module)
125170
)
@@ -135,7 +180,9 @@ class ActiveMoECostModel(AutoQuantizeCostModel):
135180
"""Scale routed MoE expert weights by the active experts per-token ratio."""
136181

137182
name = COST_MODEL_ACTIVE_MOE
138-
supported_cost_keys = frozenset({ACTIVE_MOE_EXPERT_RATIO_KEY})
183+
supported_cost_keys = frozenset(
184+
{ACTIVE_MOE_EXPERT_RATIO_KEY, EXCLUDED_MODULE_NAME_PATTERNS_KEY}
185+
)
139186

140187
def normalize_cost_constraints(
141188
self, model: nn.Module, cost_constraints: dict[str, Any]
@@ -164,9 +211,12 @@ def normalize_cost_constraints(
164211
def module_cost_weight(
165212
self, module_names: Sequence[str], cost_constraints: dict[str, Any]
166213
) -> float:
214+
base_weight = super().module_cost_weight(module_names, cost_constraints)
215+
if base_weight == 0.0:
216+
return 0.0
167217
if any(is_routed_moe_module_name(n) for n in module_names):
168218
return cost_constraints[ACTIVE_MOE_EXPERT_RATIO_KEY]
169-
return 1.0
219+
return base_weight
170220

171221

172222
_COST_MODELS: Final = {

0 commit comments

Comments
 (0)