You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support force tokens to % of total experts during calibration (#910)
## What does this PR do?
**Type of change:** New feature
**Overview:** Adds a configurable `moe_calib_experts_ratio` parameter
that controls the percentage of experts to calibrate during the forward
pass in MoE (Mixture of Experts) models. Previously, the calibration
forward always routed tokens to **all** experts, which is expensive.
This PR allows the user to specify a ratio (default: still all experts
so no behavior change) to improve expert calibration coverage without
the cost of a full-expert forward. The token counting for the expert
coverage table now tracks the calibration routing and runs on CUDA for
efficiency.
**Changes include:**
- New `moe_calib_experts_ratio` field in `QuantizeAlgorithmConfig`
(`config.py`)
- Propagation of the ratio from the algorithm config to MoE modules
during calibration (`mode.py`)
- Updated `_QuantSparseMoe.forward` to use the configurable ratio
instead of hard-coding all experts (`huggingface.py`)
- New `--moe_calib_experts_ratio` CLI flag in `hf_ptq.py` (default
`0.25`)
- Moved `expert_token_count` tensor to CUDA and updated the HTML table
title in `moe_utils.py`
## Usage
Via hf_ptq.py CLI — calibrate 50% of experts during MoE calibration
python hf_ptq.py --model <model> --qformat int4_awq
--moe_calib_experts_ratio 0.5
Via Python API — pass the ratio through the algorithm config
import modelopt.torch.quantization as mtq
quant_cfg = {
"quant_cfg": { ... },
"algorithm": {
"method": "awq_lite",
"moe_calib_experts_ratio": 0.25, # calibrate 1/4 of experts
},
}
mtq.quantize(model, quant_cfg, forward_loop=calib_loop)
## Testing
Test with Qwen3 30B A3B calibration and check the tokens per expert.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Added support for configurable expert calibration during Mixture of
Experts (MOE) model quantization. Users can now specify the percentage
of experts to include during calibration, enabling better expert
coverage and improved quantization accuracy for MOE models. Default: 25%
of all experts.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: realAsma <86726418+realAsma@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,7 @@ NVIDIA Model Optimizer Changelog (Linux)
8
8
9
9
- User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
10
10
- ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
11
+
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
11
12
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
12
13
- Add support for rotating the input before quantization for RHT.
0 commit comments