Skip to content

Commit e77ad02

Browse files
cjluo-nvkevalmorabia97
authored andcommitted
Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio (#1062)
Cherry-pick for 0.43.0 ## Summary - **Remove `moe_count_expert_calib_tokens`** config field and the `_moe_count_expert_calib_tokens` internal flag. Token counting is now implicitly enabled when `moe_calib_experts_ratio` is set, removing a redundant knob. - **Change `--moe_calib_experts_ratio` default to `None`** in `hf_ptq.py` (was `1.0`). Previously all experts were force-calibrated by default; now the feature is opt-in and non-MoE models are unaffected without any flag. - **Disable `layer_sync_moe_local_experts_amax`** when `moe_calib_experts_ratio` is set, since each expert is calibrated independently with sufficient token coverage in that mode. - **Simplify `_QuantSparseMoe.forward`**: remove redundant truthy checks on `_moe_calib_experts_ratio` inside the branch that already assumes it is set. ## Changed files | File | Change | |------|--------| | `modelopt/torch/quantization/config.py` | Remove `moe_count_expert_calib_tokens` field; update `moe_calib_experts_ratio` description to document amax sync behavior | | `modelopt/torch/quantization/mode.py` | Remove `moe_count_expert_calib_tokens` propagation in `wrapped_calib_func` | | `modelopt/torch/quantization/plugins/huggingface.py` | Remove `_moe_count_expert_calib_tokens` from `_QuantSparseMoe`; simplify `forward`; skip `layer_sync_moe_local_experts_amax` when ratio is set | | `examples/llm_ptq/hf_ptq.py` | Default `--moe_calib_experts_ratio` to `None`; guard validation | | `tests/unit/.../test_sparse_moe.py` | Update tests to use `_moe_calib_experts_ratio` instead of removed flag | ## Test plan - [x] Verify `hf_ptq.py` works without `--moe_calib_experts_ratio` (non-MoE model, default `None`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Configuration Changes** * moe_calib_experts_ratio now defaults to None (disabled) instead of 1.0; validation only occurs when a value is provided. * **Refactor** * Simplified MoE calibration flow and token-counting behavior; removed a deprecated expert-calibration configuration field. * **Documentation** * Changelog and docstrings updated to reflect the new default and calibration behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
1 parent 8254d93 commit e77ad02

File tree

6 files changed

+35
-46
lines changed

6 files changed

+35
-46
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ NVIDIA Model Optimizer Changelog
1818
- Add ``fp8_cast`` and ``nvfp4_cast`` modes for ``--kv_cache_qformat`` in ``hf_ptq.py``. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new ``use_constant_amax`` field in :class:`QuantizerAttributeConfig <modelopt.torch.quantization.config.QuantizerAttributeConfig>` controls this behavior.
1919
- User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
2020
- ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
21-
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
21+
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
2222
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
2323
- Add support for rotating the input before quantization for RHT.
2424
- Add support for advanced weight scale search for NVFP4 quantization and its export path.

examples/llm_ptq/hf_ptq.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1207,16 +1207,16 @@ def parse_args() -> argparse.Namespace:
12071207
parser.add_argument(
12081208
"--moe_calib_experts_ratio",
12091209
type=float,
1210-
default=1.0,
1210+
default=None,
12111211
help=(
12121212
"Fraction of experts to calibrate during forward pass (ratio in (0.0, 1.0]). "
1213-
"Only used for MOE models; used to reduce the number of experts calibrated during the forward pass."
1213+
"Only used for MOE models; used to reduce the number of experts calibrated during the forward pass. "
12141214
"Does not impact non-MOE models."
12151215
),
12161216
)
12171217

12181218
args = parser.parse_args()
1219-
if not (0.0 < args.moe_calib_experts_ratio <= 1.0):
1219+
if args.moe_calib_experts_ratio is not None and not (0.0 < args.moe_calib_experts_ratio <= 1.0):
12201220
parser.error("--moe_calib_experts_ratio must be in the range (0.0, 1.0].")
12211221

12221222
return args

modelopt/torch/quantization/config.py

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1066,21 +1066,15 @@ class QuantizeAlgorithmConfig(ModeloptBaseConfig):
10661066

10671067
moe_calib_experts_ratio: float | None = ModeloptField(
10681068
default=None,
1069+
gt=0.0,
1070+
le=1.0,
10691071
title="% of experts to calibrate during forward pass.",
10701072
description=(
10711073
"If specified, we force forward tokens to % of experts during the calibration"
10721074
" pass. This forward is for calibration purpose only and will not affect the"
1073-
" actual inference. Not supported for all MoE architectures; currently works"
1074-
" with a few HuggingFace models such as Mixtral, Qwen3Moe, MiniMax."
1075-
),
1076-
)
1077-
1078-
moe_count_expert_calib_tokens: bool = ModeloptField(
1079-
default=False,
1080-
title="Enable expert token counting during MoE calibration.",
1081-
description=(
1082-
"If True, counts how many tokens are routed to each expert during calibration."
1083-
" Not supported for all MoE architectures; currently works with a few HuggingFace"
1075+
" actual inference. NOTE: when set, ``layer_sync_moe_local_experts_amax`` is"
1076+
" disabled so each expert maintains its own calibration statistics. Not"
1077+
" supported for all MoE architectures; currently works with a few HuggingFace"
10841078
" models such as Mixtral, Qwen3Moe, MiniMax."
10851079
),
10861080
)

modelopt/torch/quantization/mode.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -236,12 +236,6 @@ def wrapped_calib_func(
236236
if hasattr(module, "_moe_calib_experts_ratio"):
237237
module._moe_calib_experts_ratio = moe_calib_experts_ratio
238238

239-
moe_count_expert_calib_tokens = kwargs.pop("moe_count_expert_calib_tokens", False)
240-
if moe_count_expert_calib_tokens:
241-
for module in model.modules():
242-
if hasattr(module, "_moe_count_expert_calib_tokens"):
243-
module._moe_count_expert_calib_tokens = True
244-
245239
if func is not None:
246240
if sequential:
247241
if forward_loop is None:

modelopt/torch/quantization/plugins/huggingface.py

Lines changed: 18 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -446,16 +446,15 @@ class _QuantSparseMoe(QuantModule):
446446
447447
Supports ``layer_sync_moe_local_experts_amax`` to sync input quantizer amax across experts.
448448
449-
Optionally supports two config-driven features (disabled by default):
449+
Optionally supports config-driven features (disabled by default):
450450
- ``_moe_calib_experts_ratio``: force-forward tokens to more experts during calibration.
451-
- ``_moe_count_expert_calib_tokens``: count tokens routed to each expert during calibration.
451+
When set to a value > 0, also enables token counting per expert.
452452
453-
When both are disabled, forward is a direct pass-through with zero overhead.
453+
When disabled, forward is a direct pass-through with zero overhead.
454454
"""
455455

456456
def _setup(self):
457457
self._moe_calib_experts_ratio = None
458-
self._moe_count_expert_calib_tokens = False
459458
self._token_counting_initialized = False
460459

461460
def _init_token_counting(self):
@@ -503,24 +502,18 @@ def _gate_forward_hook(self, module, input, output):
503502
self.expert_token_count += counts.to(self.expert_token_count.device)
504503

505504
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
506-
if not self._moe_calib_experts_ratio and not self._moe_count_expert_calib_tokens:
505+
if self._moe_calib_experts_ratio is None:
507506
return super().forward(hidden_states)
508507

509-
if self._moe_count_expert_calib_tokens and not self._token_counting_initialized:
510-
self._init_token_counting()
511-
512508
is_calib = any(getattr(m, "_if_calib", False) for m in self.experts.modules())
513-
self._count_expert_tokens = is_calib and self._moe_count_expert_calib_tokens
514-
515-
# If any of the experts are in calibration mode, we will forward all tokens to
516-
# self._moe_calib_experts_ratio % of the experts to improve the calibration coverage.
517-
# This is used only for calibration, we need to re-calculate the actual outputs again using
518-
# the original top_k
519-
if is_calib and self._moe_calib_experts_ratio:
520-
self._count_expert_tokens = True
521-
assert 0 < self._moe_calib_experts_ratio <= 1, (
522-
"moe_calib_experts_ratio must be between 0 and 1"
523-
)
509+
510+
# During calibration, forward all tokens to a larger fraction of experts to improve
511+
# calibration coverage, then re-run with the original top_k for actual outputs.
512+
if is_calib:
513+
# Skip counting when all experts are calibrated (ratio == 1.0).
514+
self._count_expert_tokens = self._moe_calib_experts_ratio < 1.0
515+
if self._count_expert_tokens and not self._token_counting_initialized:
516+
self._init_token_counting()
524517
if TRANSFORMERS_VERSION_GE_5_0:
525518
assert hasattr(self, "gate") and hasattr(self.gate, "top_k")
526519
original_top_k = self.gate.top_k
@@ -561,7 +554,12 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
561554
return output
562555

563556
def layer_sync_moe_local_experts_amax(self):
564-
"""Sync input_quantizer amax across experts so all share the same amax per quantizer."""
557+
"""Sync input_quantizer amax across experts so all share the same amax per quantizer.
558+
559+
Skipped when _moe_calib_experts_ratio is set, as each expert is calibrated independently.
560+
"""
561+
if self._moe_calib_experts_ratio is not None:
562+
return
565563
sync_moe_expert_amax(self.experts)
566564

567565

tests/unit/torch/quantization/plugins/test_sparse_moe.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,6 @@ def test_setup_config_knobs_default(self):
202202

203203
converted = QuantModuleRegistry.convert(moe_block)
204204
assert converted._moe_calib_experts_ratio is None
205-
assert converted._moe_count_expert_calib_tokens is False
206205
assert not hasattr(converted, "expert_token_count")
207206

208207
def test_forward_default_config_passthrough(self):
@@ -259,17 +258,22 @@ def test_forward_calib_restores_top_k(self):
259258
assert converted.top_k == original_top_k
260259

261260
def test_token_counting_lazy_init(self):
262-
"""When moe_count_expert_calib_tokens is enabled, token counting infra is lazy-inited."""
261+
"""When moe_calib_experts_ratio > 0, token counting infra is lazy-inited."""
263262
model = get_tiny_qwen3_moe()
264263
moe_block = self._get_moe_block(model)
265264
if QuantModuleRegistry.get(type(moe_block)) is None:
266265
register_sparse_moe_on_the_fly(model)
267266

268267
converted = QuantModuleRegistry.convert(moe_block)
269-
converted._moe_count_expert_calib_tokens = True
268+
converted._moe_calib_experts_ratio = 0.5
270269

271270
assert not hasattr(converted, "expert_token_count")
272271

272+
# Simulate calibration mode so lazy-init triggers during forward
273+
# Set _if_calib on an expert sub-module (not set by default since only the MoE
274+
# block was converted, not the full model).
275+
next(converted.experts.modules())._if_calib = True
276+
273277
x = torch.randn(1, 4, 32)
274278
with torch.no_grad():
275279
converted(x)
@@ -305,8 +309,7 @@ def test_qwen3_moe_quantize_with_token_forcing_and_counting():
305309
quant_cfg = copy.deepcopy(mtq.INT8_DEFAULT_CFG)
306310
quant_cfg["algorithm"] = {
307311
"method": "max",
308-
"moe_calib_experts_ratio": 1.0,
309-
"moe_count_expert_calib_tokens": True,
312+
"moe_calib_experts_ratio": 0.5,
310313
}
311314

312315
def calib_fn(model):

0 commit comments

Comments
 (0)