Fix weight-only quantization for TEGroupedMLP (MoE models) (#971)

jQizhang · kevalmorabia97 · commit 62bde157bbd1 · 2026-04-06T21:01:14.000+05:30
### What does this PR do? This PR fixes a critical issue where weight-only quantization fails for MoE models utilizing `TEGroupedMLP` (e.g., Qwen3-30B-A3B). #### The Problem: In `TEGroupedMLP`, weights are stored per-expert as `weight0`, `weight1`, ..., `weightN`. During `_QuantTEGroupedLinear._setup`, the standard `self.weight` attribute is deleted. The existing `weight_only_quantize` logic expects to find a `self.weight` associated with the quantizer. Because it couldn't find these "hidden" expert weights, the `weight_quantizer` failed to calibrate, resulting in a missing `_amax` attribute. This leads to the following crash during export/inference: <img width="2792" height="1034" alt="image" src="https://github.com/user-attachments/assets/9e2b1abd-80f4-4b8b-bb95-f8ee7a8f686a" /> ```python File ".../modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 59, in get_weights_scaling_factor_2_from_quantizer assert hasattr(weight_quantizer, "_amax"), "Weight quantizer does not have attribute amax" ``` #### The Solution: 1. **Calibration Interface:** Introduced `iter_weights_for_calibration` in the `QuantModule` base class. 2. **MoE Support:** Overrode this method in `_QuantTEGroupedLinear` to yield all per-expert weights (`weight0`...`weightN`) that share the same quantizer. This ensures the calibrator "sees" all expert weights and calculates a valid `_amax`. --- ### 2. Type of change * [x] Bug fix --- ### 3. Usage / Reproduction This issue is reproducible when running weight-only quantization on MoE models like Qwen3-30B-A3B: ```bash # Step 1: Quantization torchrun --nproc_per_node 8 examples/quantization/quantize.py \ --hf-model-id Qwen/Qwen3-30B-A3B \ --export-quant-cfg nvfp4 \ --tp 2 \ --ep 8 \ --weight-only \ --megatron-save-path ./qwen3_30b_nvfp4 ``` --- ### 4. Testing & Verification * **Models Tested:** Qwen3-8B (Dense), Qwen3-30B-A3B (MoE). * **Quantization:** NVFP4/FP8 weight-only quantization. * **Verification:** - Confirmed that `QuantTEGroupedMLP` now correctly shows calculated `_amax` values in the quantization statistics table instead of remaining `dynamic`. * Validated that the change does not regress dense model (Qwen3-8B) quantization flow. * After fix, the amax of experts can be calculated correctly. ``` Quantization Statistics ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┓ ┃ Parameter Name ┃ Shape ┃ Max Value ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━┩ │ decoder.layers.0.self_attention.linear_proj.weight_quantizer._amax │ () │ 7.5781e-01 │ │ decoder.layers.0.self_attention.linear_qkv.weight_quantizer._amax │ () │ 2.8711e-01 │ │ decoder.layers.0.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.1094e-01 │ │ decoder.layers.0.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 8.6719e-01 │ │ decoder.layers.1.self_attention.linear_proj.weight_quantizer._amax │ () │ 5.8594e-01 │ │ decoder.layers.1.self_attention.linear_qkv.weight_quantizer._amax │ () │ 7.4219e-01 │ │ decoder.layers.1.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.2266e-01 │ │ decoder.layers.1.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9922e+00 │ │ decoder.layers.2.self_attention.linear_proj.weight_quantizer._amax │ () │ 1.0859e+00 │ │ decoder.layers.2.self_attention.linear_qkv.weight_quantizer._amax │ () │ 1.7812e+00 │ │ decoder.layers.2.mlp.experts.linear_fc1.weight_quantizer._amax │ () │ 7.3047e-01 │ │ decoder.layers.2.mlp.experts.linear_fc2.weight_quantizer._amax │ () │ 1.9219e+00 │ ```  ## Summary by CodeRabbit * **New Features** * Enhanced weight-only quantization calibration with improved support for specialized quantization modules and grouped-linear quantization paths. * **Bug Fixes** * Fixed handling of missing weight attributes during quantization calibration to prevent incorrect processing.  --------- Signed-off-by: larkzhang-nv <larkz@nvidia.com> Signed-off-by: larkz <larkz@nvidia.com>
diff --git a/modelopt/torch/quantization/model_calib.py b/modelopt/torch/quantization/model_calib.py
@@ -67,12 +67,11 @@ def weight_only_quantize(model: nn.Module):
     for module in name_to_module.values():
         if module in seen_modules:
             continue
-        for weight_name in weight_attr_names(module):
+
+        if isinstance(module, QuantModule):
             with enable_weight_access_and_writeback(module, model, name_to_module):
-                weight_quantizer = getattr(
-                    module, quantizer_attr_names(weight_name).weight_quantizer
-                )
-                weight_quantizer(getattr(module, weight_name))
+                for weight, weight_quantizer in module.iter_weights_for_calibration():
+                    weight_quantizer(weight)
         seen_modules.add(module)
 
 
diff --git a/modelopt/torch/quantization/nn/modules/quant_module.py b/modelopt/torch/quantization/nn/modules/quant_module.py
@@ -119,6 +119,14 @@ def modelopt_post_restore(self, prefix: str = ""):
             if isinstance(module, TensorQuantizer):
                 module.to(non_tq_param_or_buffer.device)
 
+    def iter_weights_for_calibration(self):
+        """Yield ``(weight, weight_quantizer)`` pairs for weight-only calibration."""
+        from modelopt.torch.quantization.utils import quantizer_attr_names, weight_attr_names
+
+        for weight_name in weight_attr_names(self):
+            weight_quantizer = getattr(self, quantizer_attr_names(weight_name).weight_quantizer)
+            yield getattr(self, weight_name), weight_quantizer
+
     def fold_weight(self, keep_attrs: bool = False):
         """Fold the weight for faster eval."""
         # Handle all attributes that end with _weight_quantizer
diff --git a/modelopt/torch/quantization/plugins/transformer_engine.py b/modelopt/torch/quantization/plugins/transformer_engine.py
@@ -151,6 +151,13 @@ def modelopt_post_restore(self, prefix: str = ""):
         # Remove self.weight after post_restore.
         delattr(self, "weight")
 
+    def iter_weights_for_calibration(self):
+        """Yield ``(weight_i, weight_quantizer)`` for each of the ``num_gemms`` grouped weights."""
+        for i in range(self.num_gemms):
+            weight_i = getattr(self, f"weight{i}", None)
+            if weight_i is not None:
+                yield weight_i, self.weight_quantizer
+
     @staticmethod
     def te_grouped_quantized_linear_fn(package, func_name, self, *args):
         _assert_te_fp8_enabled()
diff --git a/modelopt/torch/quantization/utils/core_utils.py b/modelopt/torch/quantization/utils/core_utils.py
@@ -213,7 +213,7 @@ def weight_attr_names(module: nn.Module) -> "Generator[str, None, None]":
     # the standard weight and quantizer case
     weight = getattr(module, "weight", None)
     weight_quantizer = getattr(module, "weight_quantizer", None)
-    if isinstance(weight_quantizer, (TensorQuantizer, SequentialQuantizer)):
+    if weight is not None and isinstance(weight_quantizer, (TensorQuantizer, SequentialQuantizer)):
         yield "weight"
 
     # other weight and quantizer case