Commit a5d46ff
Auto Quantize improvements and bug fixes for large sparse MoEs (#953)
## What does this PR do?
**Type of change:** New feature + Bug fixes
**Overview:**
Enable AutoQuantize for NemotronH and large SparseMoE models, and update
the FP8 workflow split between `mtq.auto_quantize` and `mtq.quantize`.
`mtq.auto_quantize` is now positioned as the lightweight search phase
(lite calibration + scoring), while `mtq.quantize` is used for
heavier/final calibration workflows (longer calibration passes,
force-all-token style MoE calibration, and advanced recipes such as
GPTQ, MSE, etc.).
### Algorithm & feature changes
- **NemotronH / SparseMoE support**: Updated `quant_module` and
`score_module` rules (should eventually move to the proposed modeling
lib). In future, this should be the only change needed to support new
models — the bug fixes below were unearthed while enabling NemotronH
- **Config generation**: Added
`mtq.get_auto_quantize_config(search_state, constraints=None,
verbose=False)` to re-solve from `search_state` and produce plain-dict
configs (no redundant `output_quantizer`), with optional verbose summary
- **FP8 workflow split**: Use lite calibration in `mtq.auto_quantize`,
then run longer/final calibration with `mtq.quantize` using the
generated config
- **Performance**: Pass `name_to_module` to
`enable_weight_access_and_writeback` to avoid O(N^2) overhead on large
MoE models
- **Calibration caching in checkpoint**: Save/restore quantizer
calibration states (metadata + state_dict) per recipe in the
AutoQuantize checkpoint, so resuming a search skips redundant
calibration
- **Per-rank distributed checkpointing**: When `torch.distributed` is
initialized, each rank saves/loads its own checkpoint file
(`search_state{rank}.pt`), with backward-compatible fallback to the
single-file path
### API updates
- **Config API naming**: Use `mtq.get_auto_quantize_config(...)` for
exporting the searched recipe into a quantize-ready config
- **Recommended usage pattern**:
```python
# 1) Lightweight search + lite calibration
model, search_state = mtq.auto_quantize(
model,
constraints={"effective_bits": 6.0},
quantization_formats=[mtq.NVFP4_DEFAULT_CFG, mtq.FP8_DEFAULT_CFG],
data_loader=data_loader,
forward_step=forward_step,
loss_func=loss_func,
num_calib_steps=64, # lite calibration during search
num_score_steps=128,
)
# 2) Export searched config (optionally re-solve constraints)
auto_quantize_config = mtq.get_auto_quantize_config(
search_state,
constraints={"effective_bits": 6.0},
verbose=True,
)
# 3) Final / longer calibration pass with quantize
model = mtq.quantize(
model,
config=auto_quantize_config,
forward_loop=long_calibration_loop, # e.g. force-all-token style MoE calibration
)
```
### Bug fixes
- Fixed `disabled_layers` handling so fused kernels (e.g. Mamba blocks)
are properly skipped
- Fixed gradient checkpointing to keep all modules except the
checkpointed modules in eval
- Fixed FP8 fake quant NaN/inf when `amax ≈ 0`
- Fixed `SequentialQuantizer.convert_to_single_quantizer` to operate on
`module` instead of `model`, avoiding O(N^2) CPU iteration on SparseMoE
models with 1000s of submodules
- Switched to proper `F.kl_div` for KL divergence scoring
### Not yet exposed to `llm_ptq`
`mtq.get_auto_quantize_config` is not yet wired into `llm_ptq`. The
plain config records per-expert quantization settings for all MoE
experts, resulting in large JSON files. For my experiments I used a
quick workaround. A follow-up PR will add a better config representation
and expose it to `llm_ptq`.
## Testing
- Tested on NemotronH-tiny and Nemotron-Super-RL models
- Verified auto_quantize scoring + config generation end-to-end
- Unit test for checkpoint resume verifies calibration cache correctness
(metadata + tensor values)
- Existing unit tests pass
## Before your PR is "*Ready for review*"
- **Is this change backward compatible?**: Yes
- **Did you write any new necessary tests?**: AutoQuantize end-to-end
requires GPU + large MoE models; verified manually on NemotronH-tiny and
Nemotron-Super-RL. Unit test coverage to follow.
- **Did you add or update any necessary documentation?**: No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes
## Additional Information
Follow-up planned: expose `mtq.get_auto_quantize_config` to `llm_ptq`
with a compact config format for MoE models. AWQ support in AutoQuantize
can also be removed in a future PR to keep it lightweight.
---------
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent fff65b0 commit a5d46ff
12 files changed
Lines changed: 450 additions & 97 deletions
File tree
- modelopt/torch
- opt
- quantization
- plugins
- src
- tests
- gpu/torch/quantization
- unit/torch/quantization
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
24 | 27 | | |
25 | 28 | | |
26 | 29 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
236 | 236 | | |
237 | 237 | | |
238 | 238 | | |
239 | | - | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
240 | 251 | | |
241 | | - | |
242 | | - | |
| 252 | + | |
243 | 253 | | |
244 | 254 | | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
245 | 262 | | |
246 | 263 | | |
247 | 264 | | |
248 | 265 | | |
249 | | - | |
250 | 266 | | |
251 | 267 | | |
252 | 268 | | |
253 | | - | |
254 | | - | |
255 | | - | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
256 | 273 | | |
257 | 274 | | |
258 | 275 | | |
259 | 276 | | |
260 | | - | |
261 | | - | |
262 | | - | |
| 277 | + | |
| 278 | + | |
263 | 279 | | |
264 | 280 | | |
265 | | - | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
266 | 287 | | |
267 | 288 | | |
268 | 289 | | |
| |||
0 commit comments