Fix non-scalar input amax in preprocess_linear_fusion for MoE export

AEON-7 · AEON-7 · commit 3fcc5a750c5c · 2026-04-15T00:25:06.000-04:00
preprocess_linear_fusion unconditionally asserts
`modules[0].input_quantizer.amax.numel() == 1`, which breaks for NVFP4
quantization when the model has per-expert-decomposed MoE linears
(gate_proj/up_proj pairs per expert). NVFP4's per-channel input quantizer
produces a vector amax, not a scalar, so the assertion trips immediately
on the first expert during `export_hf_checkpoint()`.

Root cause: the function was written assuming fused linears have per-tensor
scalar input amax. That's true for dense FP8/INT8 paths but false for
NVFP4's per-channel activation statistics, which modelopt's own
NVFP4_AWQ_FULL_CFG produces.

This change:
- Keeps the existing scalar-amax path (dense + FP8/INT8 unchanged)
- Adds a non-scalar path using elementwise max (`.amax(dim=0)`) across the
  stacked per-channel amax tensors of the modules being fused

Numerical correctness for the MoE case: the modules being fused here
(e.g. gate_proj and up_proj of one expert) consume the *same* input
tensor by construction, so their per-channel input amax tensors are
identical. Elementwise max is therefore a no-op, and is the correct
unification rule if they ever differ due to floating-point accumulation.

Validated end-to-end on SuperGemma4 26B (128-expert MoE) with
NVFP4_AWQ_FULL_CFG; export now completes and the serialized checkpoint
loads + generates correctly. Before: export failed with
`AssertionError: Only support scalar input quant amax` after 2h 24min of
successful calibration.

Signed-off-by: AEON-7 &lt;m2vgz48wpp@privaterelay.appleid.com&gt;
diff --git a/modelopt/torch/export/quant_utils.py b/modelopt/torch/export/quant_utils.py
@@ -1375,11 +1375,21 @@ def preprocess_linear_fusion(modules: list[torch.nn.Module], resmooth_only=False
             return
 
         if modules[0].input_quantizer.is_enabled and modules[0].input_quantizer.amax is not None:
-            assert modules[0].input_quantizer.amax.numel() == 1, (
-                "Only support scalar input quant amax"
-            )
-
-            input_amax = torch.max(torch.stack([module.input_quantizer.amax for module in modules]))
+            if modules[0].input_quantizer.amax.numel() == 1:
+                # Scalar amax (e.g. dense layers with per-tensor activation quant):
+                # unify via scalar max across the modules being fused.
+                input_amax = torch.max(
+                    torch.stack([module.input_quantizer.amax for module in modules])
+                )
+            else:
+                # Non-scalar amax (e.g. NVFP4 per-channel input quantizer on
+                # per-expert-decomposed MoE). Modules being fused here share the
+                # same input tensor, so their per-channel amax vectors are
+                # identical by construction. Elementwise max is a no-op in that
+                # case and is the correct unification rule if they ever differ.
+                input_amax = torch.stack(
+                    [module.input_quantizer.amax for module in modules]
+                ).amax(dim=0)
             for module in modules:
                 module.input_quantizer.amax = input_amax