[NVBug: 6038899] Fix MoE export crash on meta tensors with CPU offload (#1155)

cjluo-nv · web-flow · commit fcb09bf11d4a · 2026-04-02T06:59:57.000Z
## Summary Fixes `NotImplementedError` in `sync_moe_gate_up_amax` when quantizing MoE models (e.g. Qwen3-30B-A3B) on a single GPU with insufficient VRAM. When GPU memory is insufficient, ModelOpt enables CPU offload via accelerate, leaving uncalibrated expert parameters on the `meta` device. During export, `sync_moe_gate_up_amax` calls `torch.equal()` on these meta tensors, which raises `NotImplementedError` because `aten::equal` does not support meta tensors — even though calibration itself completed successfully. ## Changes - Add a guard in `sync_moe_gate_up_amax` to skip amax sync for meta tensors (which have no real data to sync) and emit a warning explaining the root cause. Bug: https://nvbugspro.nvidia.com/bug/6038899 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Added warning messages for unsupported tensor configurations in quantization workflows. * Improved edge case detection to gracefully skip processing in incompatible scenarios.  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
diff --git a/modelopt/torch/export/layer_utils.py b/modelopt/torch/export/layer_utils.py
@@ -1184,6 +1184,18 @@ def sync_moe_gate_up_amax(model: nn.Module) -> int:
                 up_amax = getattr(up_wq, "amax", None)
                 if gate_amax is None or up_amax is None:
                     break
+                # Meta tensors have no storage (e.g. CPU-offloaded experts that
+                # were never activated during calibration). Skip — there is no
+                # real amax data to sync.
+                if gate_amax.is_meta or up_amax.is_meta:
+                    warn(
+                        f"Skipping gate/up amax sync for expert with meta tensors "
+                        f"(gate_amax.is_meta={gate_amax.is_meta}, "
+                        f"up_amax.is_meta={up_amax.is_meta}). "
+                        f"This typically means the expert was CPU-offloaded and "
+                        f"not activated during calibration."
+                    )
+                    break
                 if not torch.equal(gate_amax, up_amax):
                     shared_amax = torch.max(gate_amax, up_amax)
                     gate_wq.amax = shared_amax