Commit 62bde15
Fix weight-only quantization for TEGroupedMLP (MoE models) (#971)
### What does this PR do?
This PR fixes a critical issue where weight-only quantization fails for
MoE models utilizing `TEGroupedMLP` (e.g., Qwen3-30B-A3B).
#### The Problem:
In `TEGroupedMLP`, weights are stored per-expert as `weight0`,
`weight1`, ..., `weightN`. During `_QuantTEGroupedLinear._setup`, the
standard `self.weight` attribute is deleted.
The existing `weight_only_quantize` logic expects to find a
`self.weight` associated with the quantizer. Because it couldn't find
these "hidden" expert weights, the `weight_quantizer` failed to
calibrate, resulting in a missing `_amax` attribute. This leads to the
following crash during export/inference:
<img width="2792" height="1034" alt="image"
src="https://github.com/user-attachments/assets/9e2b1abd-80f4-4b8b-bb95-f8ee7a8f686a"
/>
```python
File ".../modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer
assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax"
```
#### The Solution:
1. **Calibration Interface:** Introduced `iter_weights_for_calibration`
in the `QuantModule` base class.
2. **MoE Support:** Overrode this method in `_QuantTEGroupedLinear` to
yield all per-expert weights (`weight0`...`weightN`) that share the same
quantizer. This ensures the calibrator "sees" all expert weights and
calculates a valid `_amax`.
---
### 2. Type of change
* [x] Bug fix
---
### 3. Usage / Reproduction
This issue is reproducible when running weight-only quantization on MoE
models like Qwen3-30B-A3B:
```bash
# Step 1: Quantization
torchrun --nproc_per_node 8 examples/quantization/quantize.py \
--hf-model-id Qwen/Qwen3-30B-A3B \
--export-quant-cfg nvfp4 \
--tp 2 \
--ep 8 \
--weight-only \
--megatron-save-path ./qwen3_30b_nvfp4
```
---
### 4. Testing & Verification
* **Models Tested:** Qwen3-8B (Dense), Qwen3-30B-A3B (MoE).
* **Quantization:** NVFP4/FP8 weight-only quantization.
* **Verification:** - Confirmed that `QuantTEGroupedMLP` now correctly
shows calculated `_amax` values in the quantization statistics table
instead of remaining `dynamic`.
* Validated that the change does not regress dense model (Qwen3-8B)
quantization flow.
* After fix, the amax of experts can be calculated correctly.
```
Quantization Statistics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓
┃ Parameter Name ┃ Shape ┃ Max Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩
│ decoder.layers.0.self_attention.linear_proj.weight_quantizer._amax │ () │ 7.5781e-01 │
│ decoder.layers.0.self_attention.linear_qkv.weight_quantizer._amax │ () │ 2.8711e-01 │
│ decoder.layers.0.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.1094e-01 │
│ decoder.layers.0.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 8.6719e-01 │
│ decoder.layers.1.self_attention.linear_proj.weight_quantizer._amax │ () │ 5.8594e-01 │
│ decoder.layers.1.self_attention.linear_qkv.weight_quantizer._amax │ () │ 7.4219e-01 │
│ decoder.layers.1.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.2266e-01 │
│ decoder.layers.1.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9922e+00 │
│ decoder.layers.2.self_attention.linear_proj.weight_quantizer._amax │ () │ 1.0859e+00 │
│ decoder.layers.2.self_attention.linear_qkv.weight_quantizer._amax │ () │ 1.7812e+00 │
│ decoder.layers.2.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.3047e-01 │
│ decoder.layers.2.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9219e+00 │
```
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Enhanced weight-only quantization calibration with improved support
for specialized quantization modules and grouped-linear quantization
paths.
* **Bug Fixes**
* Fixed handling of missing weight attributes during quantization
calibration to prevent incorrect processing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: larkzhang-nv <larkz@nvidia.com>
Signed-off-by: larkz <larkz@nvidia.com>1 parent 2252074 commit 62bde15
File tree
4 files changed
+20
-6
lines changed- modelopt/torch/quantization
- nn/modules
- plugins
- utils
4 files changed
+20
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | | - | |
| 70 | + | |
| 71 | + | |
71 | 72 | | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
| 73 | + | |
| 74 | + | |
76 | 75 | | |
77 | 76 | | |
78 | 77 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
122 | 130 | | |
123 | 131 | | |
124 | 132 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
151 | 151 | | |
152 | 152 | | |
153 | 153 | | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
154 | 161 | | |
155 | 162 | | |
156 | 163 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
213 | 213 | | |
214 | 214 | | |
215 | 215 | | |
216 | | - | |
| 216 | + | |
217 | 217 | | |
218 | 218 | | |
219 | 219 | | |
| |||
0 commit comments