Skip to content

Commit 077e29a

Browse files
authored
[NVBug 6108145] Fix PTQ calibration and export for fused-experts MoE (Qwen3.5-MoE VLM) (#1340)
### What does this PR do? Type of change: Bug fix Fixes a 4-bug cascade that caused silent PTQ failure on Qwen3.5-MoE VLMs (Qwen3.6-35B-A3B): calibration appeared to succeed but produced token-salad at inference. Root cause: HF's @use_experts_implementation dispatches expert forward to torch._grouped_mm / torch.bmm, bypassing the F.linear hook that captures activations — so gate_up_proj_input_quantizer / down_proj_input_quantizer never calibrated and no input_scale tensors were emitted. Changes: - examples/llm_ptq/hf_ptq.py — force config._experts_implementation = "eager" (recursing into text_config / vision_config / …) so per-expert F.linear calls are visible to the calibration hook. - modelopt/torch/quantization/conversion.py — normalize plural ModuleList quantizer names (weight_quantizers.N → weight_quantizer) before fnmatch, so wildcards like *mlp.experts*weight_quantizer match fused-expert quantizers. - modelopt/torch/export/unified_export_hf.py — hoist the _QuantFusedExperts export branch above the get_quantization_format() gate so _export_fused_experts() runs even when the top-level format query returns QUANTIZATION_NONE (happens for experts-only recipes). - modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml — layerwise: false (VLM nested layer structure breaks the layerwise walker). <!-- Details about the change. --> ### Usage ```python python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path Qwen/Qwen3.6-35B-A3B \ --qformat nvfp4 \ --kv_cache_qformat fp8 \ --calib_size 512 \ --export_path Qwen3.6-35B-A3B-NVFP4 ``` ### Testing <!-- Mention how have you tested your change if applicable. --> Testing End-to-end PTQ → vLLM deploy → NEL eval on Qwen3.6-35B-A3B (256 experts × 40 layers, 35B params): Hook-call diagnostic: 0 → 6720 per-expert F.linear calls during calibration after the fix; 0 → 30720 input_scale tensors emitted in the exported checkpoint. FP8 fused-MoE path still produces gibberish — separate follow-up (vLLM per-expert weight_scale handling). * vLLM full-FP8: the FlashInfer TRTLLM Fp8MoE loader doesn't stack the 256 per-expert scalar weight_scale tensors into a [num_experts] per-expert vector — it ends up applying one expert's scale across all 256, so every routed expert dequants with the wrong amplitude → coherent token stream collapses into multilingual gibberish. * SGLang full-FP8: qwen3_5.py::_make_packed_weight_loader rejects with AssertionError: Unexpected scalar for tuple shard load: loaded_shard_id=(0,1,2), split_sizes=[1,1,1] — its packed-loader has no path for "N independent per-tensor source scalars combining into one fused-shard parameter," so the fused QKV (or in_proj_qkvz) load is structurally refused and the model never finishes loading. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Better fused-expert export flow, a plugin to force eager expert execution during calibration/export, and a representative quantizer discovery utility. * **Bug Fixes** * Reliable matching/discovery of per-expert indexed quantizers enabling correct calibration and mixed-precision export; fixes for calibration in nested decoder layouts. * **Documentation** * Clarified PTQ config guidance on layerwise calibration. * **Tests** * Added fused-experts calibration, export, and name-normalization tests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
1 parent e5ce0ae commit 077e29a

9 files changed

Lines changed: 533 additions & 30 deletions

File tree

modelopt/torch/export/plugins/vllm_fakequant_hf.py

Lines changed: 59 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,14 +47,18 @@
4747
"merge_amax_tensors_for_group",
4848
]
4949

50-
# Matches ``…weight_quantizer``, ``…weight_quantizer.0``, ``…w13_weight_quantizer.0``, etc.
51-
_WEIGHT_QUANTIZER_STATE_KEY = re.compile(r"(?:^|\.)(?:\w+_)?weight_quantizer(?:\.\d+)*$")
50+
# Matches ``…weight_quantizer``, ``…weight_quantizer.0``, ``…w13_weight_quantizer.0``,
51+
# and the plural fused-experts form ``…weight_quantizers.0`` (per-expert ModuleList).
52+
_WEIGHT_QUANTIZER_STATE_KEY = re.compile(r"(?:^|\.)(?:\w+_)?weight_quantizers?(?:\.\d+)*$")
5253

5354

5455
def is_weight_quantizer_state_key(key: str) -> bool:
55-
"""Return True for weight-quantizer state keys, including SequentialQuantizer entries.
56+
"""Return True for weight-quantizer state keys.
5657
57-
Matches ``weight_quantizer``, ``w13_weight_quantizer``, ``weight_quantizer.0``, etc.
58+
Includes ``SequentialQuantizer`` entries and fused-experts ``ModuleList``
59+
entries (``*_weight_quantizers.<idx>``). Matches ``weight_quantizer``,
60+
``w13_weight_quantizer``, ``weight_quantizer.0``,
61+
``gate_up_proj_weight_quantizers.0``, etc.
5862
"""
5963
return bool(_WEIGHT_QUANTIZER_STATE_KEY.search(key))
6064

@@ -142,6 +146,56 @@ def disable_rotate(quantizer: TensorQuantizer):
142146
return False
143147

144148

149+
def _fakequant_fused_experts_weights(
150+
module: nn.Module,
151+
module_name: str,
152+
state_dict: dict | None,
153+
fakequant_weights: set,
154+
inplace: bool,
155+
):
156+
"""Apply per-expert fake-quant to a ``_QuantFusedExperts`` module's 3-D weights.
157+
158+
The base loop in :func:`_fakequant_module_weights` only handles singular
159+
``*_weight_quantizer`` attrs (one TensorQuantizer per weight). Fused-experts
160+
modules expose ``*_weight_quantizers`` (``nn.ModuleList`` with one entry per
161+
expert) that the base loop skips, leaving the fused 3-D weight unquantized
162+
in the export and breaking weight-fold round-trips.
163+
"""
164+
for w_attr, q_attr in (
165+
("gate_up_proj", "gate_up_proj_weight_quantizers"),
166+
("down_proj", "down_proj_weight_quantizers"),
167+
):
168+
quantizers = getattr(module, q_attr, None)
169+
if not isinstance(quantizers, nn.ModuleList):
170+
continue
171+
if not any(
172+
isinstance(q, TensorQuantizer) and q.fake_quant and q.is_enabled for q in quantizers
173+
):
174+
continue
175+
sd_key = f"{module_name}.{w_attr}" if module_name else w_attr
176+
if sd_key in fakequant_weights:
177+
raise RuntimeError(f"Weight {sd_key} has already been fakequantized")
178+
179+
if inplace:
180+
w = getattr(module, w_attr)
181+
for idx, q in enumerate(quantizers):
182+
if not (isinstance(q, TensorQuantizer) and q.fake_quant and q.is_enabled):
183+
continue
184+
slice_ = w.data[idx]
185+
slice_.copy_(q(slice_.float()).to(w.dtype))
186+
else:
187+
if state_dict is None or sd_key not in state_dict:
188+
continue
189+
w_3d = state_dict[sd_key].clone()
190+
for idx, q in enumerate(quantizers):
191+
if not (isinstance(q, TensorQuantizer) and q.fake_quant and q.is_enabled):
192+
continue
193+
slice_ = w_3d[idx]
194+
w_3d[idx] = q(slice_.float()).to(slice_.dtype)
195+
state_dict[sd_key] = w_3d.cpu()
196+
fakequant_weights.add(sd_key)
197+
198+
145199
def _fakequant_module_weights(
146200
module: nn.Module,
147201
module_name: str,
@@ -159,6 +213,7 @@ def _fakequant_module_weights(
159213
"""
160214
if not isinstance(module, QuantModule):
161215
return
216+
_fakequant_fused_experts_weights(module, module_name, state_dict, fakequant_weights, inplace)
162217
for attr_name, quantizer in module.named_children():
163218
if not (
164219
attr_name.endswith("weight_quantizer")

modelopt/torch/export/quant_utils.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
QuantizerAttrNames,
4343
quantizer_attr_names,
4444
reduce_block_amax,
45+
representative_weight_quantizer,
4546
weight_attr_names,
4647
)
4748
from modelopt.torch.utils import clear_cuda_cache
@@ -546,7 +547,7 @@ def _compute_kv_cache_dtype(
546547

547548
def get_weight_block_size(module: nn.Module, weight_name: str = "weight") -> int:
548549
"""Returns the weight block size."""
549-
weight_quantizer = getattr(module, quantizer_attr_names(weight_name).weight_quantizer, None)
550+
weight_quantizer = representative_weight_quantizer(module, weight_name)
550551

551552
if weight_quantizer is None:
552553
return 0
@@ -572,7 +573,11 @@ def get_quantization_format(module) -> str | None:
572573
"""
573574

574575
def _get_quantization_from_layer(layer, quantizer_attr_names: QuantizerAttrNames):
575-
weight_quantizer = getattr(layer, quantizer_attr_names.weight_quantizer, None)
576+
# Singular form first, plural ModuleList fallback (fused-experts).
577+
# Strip the "_weight_quantizer" suffix to recover the weight attr name.
578+
weight_attr = quantizer_attr_names.weight_quantizer
579+
weight_name = weight_attr[: -len("_weight_quantizer")].rstrip("_") or "weight"
580+
weight_quantizer = representative_weight_quantizer(layer, weight_name)
576581
input_quantizer = getattr(layer, quantizer_attr_names.input_quantizer, None)
577582

578583
if weight_quantizer is None or not weight_quantizer.is_enabled:

modelopt/torch/export/unified_export_hf.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@
8888
QUANTIZATION_W4A8_NVFP4_FP8,
8989
)
9090
from .model_utils import get_language_model_from_vl, is_multimodal_model
91+
from .moe_utils import _export_fused_experts
9192
from .plugins import SpeculativeDecodingExporter, has_spec_opt
9293
from .quant_utils import (
9394
fuse_prequant_layernorm,
@@ -642,11 +643,20 @@ def _process_quantized_modules(
642643
if is_modelopt_qlora and (hasattr(sub_module, "base_layer")):
643644
continue
644645

646+
# Preprocessing: restore unpacked weight so the export path can read
647+
# the live quantizer state. Falls through to the export branches below.
645648
if hasattr(sub_module, "weight_packed") or (
646649
"QuantFP8Linear" in type(sub_module).__name__ and sub_module.weight.element_size() <= 1
647650
):
648651
sub_module.unpack_weight()
649-
if get_quantization_format(sub_module) != QUANTIZATION_NONE:
652+
653+
if hasattr(sub_module, "gate_up_proj_weight_quantizers"):
654+
# _QuantFusedExperts uses plural `gate_up_proj_weight_quantizers` (ModuleList),
655+
# which get_quantization_format's singular-weight_quantizer check misses. Handle
656+
# it explicitly before the format gate so fused-experts get split + quantized.
657+
with fsdp2_aware_weight_update(model, sub_module, reshard=False):
658+
_export_fused_experts(sub_module, dtype)
659+
elif get_quantization_format(sub_module) != QUANTIZATION_NONE:
650660
# Skip QuantMoELinear - it's handled separately in _reconstruct_fused_moe_linear
651661
if type(sub_module).__name__ == "QuantMoELinear":
652662
continue
@@ -677,13 +687,6 @@ def _process_quantized_modules(
677687
with fsdp2_aware_weight_update(model, sub_module, reshard=False):
678688
for weight_name in ["gate_up_proj", "down_proj"]:
679689
_export_quantized_weight(sub_module, dtype, weight_name)
680-
elif hasattr(sub_module, "gate_up_proj_weight_quantizers"):
681-
# Generic fused MoE experts (_QuantFusedExperts) with per-expert
682-
# quantizer ModuleLists. Split into per-expert modules and export.
683-
from modelopt.torch.export.moe_utils import _export_fused_experts
684-
685-
with fsdp2_aware_weight_update(model, sub_module, reshard=False):
686-
_export_fused_experts(sub_module, dtype)
687690

688691

689692
def _export_transformers_checkpoint(

modelopt/torch/quantization/conversion.py

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
"""Quantization conversion/restore utilities."""
1717

1818
import fnmatch
19+
import re
1920
import warnings
2021
from collections.abc import Callable
2122
from contextlib import contextmanager
@@ -286,6 +287,33 @@ def set_quantizer_by_cfg(quant_model: nn.Module, quant_cfg: QuantizeQuantCfgType
286287
set_quantizer_attributes_full(quant_model, quantizer_name, attributes, parent_class)
287288

288289

290+
_FUSED_EXPERTS_QUANTIZER_LIST_RE = re.compile(
291+
r"(weight_quantizers?|input_quantizers?)\.\d+(?=$|\.)"
292+
)
293+
294+
295+
def _normalize_fused_experts_quantizer_name(name: str) -> str:
296+
"""Strip the per-expert index from per-expert quantizer ModuleList names.
297+
298+
Fused-experts modules register per-expert weight/input quantizers in a
299+
``nn.ModuleList``; its children surface as dotted names like
300+
``...gate_up_proj_weight_quantizers.0`` (plural) or — if a variant uses
301+
singular naming — ``...gate_up_proj_weight_quantizer.0``. Neither matches
302+
the singular-suffix wildcards (``*weight_quantizer``) used in the stock
303+
configs, so the experts stay at their defaults.
304+
305+
Return a normalized name where either ``weight_quantizer[s]?.N`` or
306+
``input_quantizer[s]?.N`` collapses to the singular form without the index
307+
so the standard wildcards match.
308+
"""
309+
310+
def _repl(m: re.Match) -> str:
311+
base = m.group(1)
312+
return base.removesuffix("s")
313+
314+
return _FUSED_EXPERTS_QUANTIZER_LIST_RE.sub(_repl, name)
315+
316+
289317
def _match_quantizer(
290318
wildcard_or_filter_func: str | Callable,
291319
name: str,
@@ -296,7 +324,11 @@ def _match_quantizer(
296324
if not isinstance(module, (TensorQuantizer, SequentialQuantizer)):
297325
return False
298326
if isinstance(wildcard_or_filter_func, str):
299-
if not fnmatch.fnmatch(name, wildcard_or_filter_func):
327+
normalized = _normalize_fused_experts_quantizer_name(name)
328+
if not (
329+
fnmatch.fnmatch(name, wildcard_or_filter_func)
330+
or (normalized != name and fnmatch.fnmatch(normalized, wildcard_or_filter_func))
331+
):
300332
return False
301333
elif callable(wildcard_or_filter_func):
302334
if not wildcard_or_filter_func(name):

modelopt/torch/quantization/plugins/huggingface.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -900,6 +900,33 @@ def forward(self, *args, **kwargs):
900900
self._down_proj_linear = False
901901
return super().forward(*args, **kwargs)
902902

903+
def fold_weight(self, keep_attrs: bool = False):
904+
"""Fold per-expert weight quantizers into the fused 3-D weights.
905+
906+
The base ``fold_weight`` only handles singular ``*_weight_quantizer``
907+
attributes. Fused experts use ``nn.ModuleList`` of per-expert quantizers
908+
(``gate_up_proj_weight_quantizers``, ``down_proj_weight_quantizers``),
909+
which would otherwise be skipped, leaving ``_amax`` on every quantizer.
910+
"""
911+
for weight_name, quantizers_name in (
912+
("gate_up_proj", "gate_up_proj_weight_quantizers"),
913+
("down_proj", "down_proj_weight_quantizers"),
914+
):
915+
weight = getattr(self, weight_name, None)
916+
quantizers = getattr(self, quantizers_name, None)
917+
if weight is None or quantizers is None:
918+
continue
919+
for idx, q in enumerate(quantizers):
920+
if not (isinstance(q, TensorQuantizer) and q.fake_quant):
921+
continue
922+
slice_ = weight.data[idx]
923+
slice_.copy_(q(slice_.float()).to(weight.dtype))
924+
q.disable()
925+
if not keep_attrs:
926+
for attr_name in ("_pre_quant_scale", "_amax"):
927+
if hasattr(q, attr_name):
928+
delattr(q, attr_name)
929+
903930

904931
class _QuantDbrxFFN(_QuantSparseSequentialMoe):
905932
@property
@@ -1438,6 +1465,38 @@ def register_fused_experts_on_the_fly(model):
14381465
QuantModuleRegistry.register({mod_type: f"hf.{mod_type.__name__}"})(_QuantFusedExperts)
14391466

14401467

1468+
def force_eager_experts_impl_on_the_fly(model):
1469+
"""Force HF fused-experts modules onto the eager ``F.linear``-based forward.
1470+
1471+
HF transformers 5.0+ decorates fused-experts forwards with
1472+
``@use_experts_implementation``, which may dispatch to ``torch._grouped_mm``
1473+
or ``torch.bmm`` backends. Those backends bypass ``F.linear`` and so bypass
1474+
``_QuantFusedExperts``'s input/weight quantizer hooks — calibration silently
1475+
does nothing, no ``input_scale`` / ``amax`` is collected, and the exported
1476+
checkpoint produces garbage at inference.
1477+
1478+
Sets ``config._experts_implementation = "eager"`` on the model config (and
1479+
recursively on ``text_config`` / ``vision_config`` / ``audio_config`` /
1480+
``speech_config``) whenever a fused-experts module is present.
1481+
"""
1482+
if not any(_is_fused_experts_module(m) for m in model.modules()):
1483+
return
1484+
1485+
nested_cfg_attrs = ("text_config", "vision_config", "audio_config", "speech_config")
1486+
1487+
def _force(cfg):
1488+
if cfg is None:
1489+
return
1490+
if hasattr(cfg, "_experts_implementation"):
1491+
cfg._experts_implementation = "eager"
1492+
for sub in nested_cfg_attrs:
1493+
if hasattr(cfg, sub):
1494+
_force(getattr(cfg, sub))
1495+
1496+
if hasattr(model, "config"):
1497+
_force(model.config)
1498+
1499+
14411500
def _is_supported_hf_model(model):
14421501
"""Check if the model a valid model for transformers quantization specific support."""
14431502
supported_models = [transformers.PreTrainedModel]
@@ -1665,6 +1724,7 @@ def _reconstruct_fused_moe_linear(model: nn.Module) -> None:
16651724
register_dbrx_moe_on_the_fly,
16661725
register_step3p5_moe_on_the_fly,
16671726
register_fused_experts_on_the_fly,
1727+
force_eager_experts_impl_on_the_fly,
16681728
register_sparse_moe_on_the_fly,
16691729
register_hf_attentions_on_the_fly,
16701730
convert_hf_parallel_linears_on_the_fly,

modelopt/torch/quantization/utils/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
"reduce_amax",
3131
"reduce_sum",
3232
"replace_function",
33+
"representative_weight_quantizer",
3334
"update_quant_cfg_with_kv_cache_quant",
3435
"weight_attr_names",
3536
]

modelopt/torch/quantization/utils/core_utils.py

Lines changed: 44 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -202,27 +202,57 @@ def reduce_sum(input, axis=None, keepdims=True):
202202
return output
203203

204204

205-
def weight_attr_names(module: nn.Module) -> "Generator[str, None, None]":
206-
"""Get the weight param attribute names in a converted module, non-recursive.
205+
def representative_weight_quantizer(module: nn.Module, weight_name: str = "weight"):
206+
"""Return the representative weight quantizer for ``weight_name`` on ``module``.
207+
208+
Handles two layouts:
209+
210+
- singular ``<name>_weight_quantizer`` — standard ``nn.Linear`` / ``_QuantLinear``.
211+
- plural ``<name>_weight_quantizers`` (``nn.ModuleList``) — fused-experts modules
212+
(``_QuantFusedExperts``) hold one ``TensorQuantizer`` per expert. Per-expert
213+
formats are identical, so the first element is representative.
207214
208-
We consider the following two cases for each weight param attribute:
209-
- The standard weight attribute (e.g. nn.Linear).
210-
- The custom `weight_attr_name`. (e.g. Llama4TextExperts has weight attributes `gate_up_proj` and `down_proj`)
215+
Returns ``None`` if no matching quantizer is found.
211216
"""
212217
from ..nn import SequentialQuantizer, TensorQuantizer
213218

214-
# the standard weight and quantizer case
215-
weight = getattr(module, "weight", None)
216-
weight_quantizer = getattr(module, "weight_quantizer", None)
217-
if weight is not None and isinstance(weight_quantizer, (TensorQuantizer, SequentialQuantizer)):
218-
yield "weight"
219+
singular = quantizer_attr_names(weight_name).weight_quantizer
220+
q = getattr(module, singular, None)
221+
if isinstance(q, (TensorQuantizer, SequentialQuantizer)):
222+
return q
219223

220-
# other weight and quantizer case
224+
plural = getattr(module, singular + "s", None)
225+
if isinstance(plural, nn.ModuleList) and len(plural) > 0:
226+
first = plural[0]
227+
if isinstance(first, (TensorQuantizer, SequentialQuantizer)):
228+
return first
229+
return None
230+
231+
232+
def weight_attr_names(module: nn.Module) -> "Generator[str, None, None]":
233+
"""Get the weight param attribute names in a converted module, non-recursive.
234+
235+
Covers three layouts:
236+
237+
- standard ``nn.Linear``: ``weight`` + ``weight_quantizer``.
238+
- custom per-weight quantizer (e.g. ``Llama4TextExperts`` with ``gate_up_proj`` +
239+
``gate_up_proj_weight_quantizer``).
240+
- fused-experts ``nn.ModuleList`` quantizers (``_QuantFusedExperts`` with
241+
``gate_up_proj`` + ``gate_up_proj_weight_quantizers`` plural list).
242+
"""
243+
# standard: "weight" + "weight_quantizer" (singular) or "weight_quantizers" (plural)
244+
if getattr(module, "weight", None) is not None:
245+
if representative_weight_quantizer(module, "weight") is not None:
246+
yield "weight"
247+
248+
# per-parameter custom attr names
221249
for name, _ in module.named_parameters(recurse=False):
250+
if name == "weight":
251+
continue
222252
weight = getattr(module, name, None)
223-
weight_quantizer = getattr(module, f"{name}_weight_quantizer", None)
224-
if isinstance(weight, nn.Parameter) and isinstance(
225-
weight_quantizer, (TensorQuantizer, SequentialQuantizer)
253+
if (
254+
isinstance(weight, nn.Parameter)
255+
and representative_weight_quantizer(module, name) is not None
226256
):
227257
yield name
228258

modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@ quantize:
2020
algorithm:
2121
method: max
2222
# Max calibration is fast and does not typically need checkpointing.
23-
layerwise: true
23+
# layerwise=false required for VLMs where the decoder layers are nested under
24+
# `model.language_model.layers` (layerwise_calibrate can't find them otherwise).
25+
layerwise: false
2426
quant_cfg:
2527
- quantizer_name: '*'
2628
enable: false

0 commit comments

Comments
 (0)