You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove _moe_count_expert_calib_tokens flag; tie token counting to moe_calib_experts_ratio (#1062)
Cherry-pick for 0.43.0
## Summary
- **Remove `moe_count_expert_calib_tokens`** config field and the
`_moe_count_expert_calib_tokens` internal flag. Token counting is now
implicitly enabled when `moe_calib_experts_ratio` is set, removing a
redundant knob.
- **Change `--moe_calib_experts_ratio` default to `None`** in
`hf_ptq.py` (was `1.0`). Previously all experts were force-calibrated by
default; now the feature is opt-in and non-MoE models are unaffected
without any flag.
- **Disable `layer_sync_moe_local_experts_amax`** when
`moe_calib_experts_ratio` is set, since each expert is calibrated
independently with sufficient token coverage in that mode.
- **Simplify `_QuantSparseMoe.forward`**: remove redundant truthy checks
on `_moe_calib_experts_ratio` inside the branch that already assumes it
is set.
## Changed files
| File | Change |
|------|--------|
| `modelopt/torch/quantization/config.py` | Remove
`moe_count_expert_calib_tokens` field; update `moe_calib_experts_ratio`
description to document amax sync behavior |
| `modelopt/torch/quantization/mode.py` | Remove
`moe_count_expert_calib_tokens` propagation in `wrapped_calib_func` |
| `modelopt/torch/quantization/plugins/huggingface.py` | Remove
`_moe_count_expert_calib_tokens` from `_QuantSparseMoe`; simplify
`forward`; skip `layer_sync_moe_local_experts_amax` when ratio is set |
| `examples/llm_ptq/hf_ptq.py` | Default `--moe_calib_experts_ratio` to
`None`; guard validation |
| `tests/unit/.../test_sparse_moe.py` | Update tests to use
`_moe_calib_experts_ratio` instead of removed flag |
## Test plan
- [x] Verify `hf_ptq.py` works without `--moe_calib_experts_ratio`
(non-MoE model, default `None`)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Configuration Changes**
* moe_calib_experts_ratio now defaults to None (disabled) instead of
1.0; validation only occurs when a value is provided.
* **Refactor**
* Simplified MoE calibration flow and token-counting behavior; removed a
deprecated expert-calibration configuration field.
* **Documentation**
* Changelog and docstrings updated to reflect the new default and
calibration behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ NVIDIA Model Optimizer Changelog
18
18
- Add ``fp8_cast`` and ``nvfp4_cast`` modes for ``--kv_cache_qformat`` in ``hf_ptq.py``. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new ``use_constant_amax`` field in :class:`QuantizerAttributeConfig <modelopt.torch.quantization.config.QuantizerAttributeConfig>` controls this behavior.
19
19
- User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
20
20
- ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
21
-
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
21
+
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
22
22
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
23
23
- Add support for rotating the input before quantization for RHT.
24
24
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
0 commit comments