Fix Sequential MLP amax sync deadlock (#862)

ChenhanYu · danielkorzekwa · commit 35d0f52fd2bb · 2026-02-17T06:44:30.000-08:00
## What does this PR do? **Type of change:** ?  Bug fix **Overview:** ? After `QuantMoELayer`, we rely on `layer_sync_moe_local_experts_amax` to first perform local sync. This is supposed to create `input_quantizer.amax` for all experts but the current logic will only update experts that already have `amax`. This results in some experts are still missing `amax`. With the fact above, `sync_quantizer_amax_across_dp_ep` will actually deadlock seems the collective is called based on whether `quantizer._amax is None`. Any expert with `None` amax will not call collective hence will never arrive the collective and cause a deadlock. We fix `layer_sync_moe_local_experts_amax` such that even if an expert does not have `amax`, we will overwrite it with a clone of the global amax. The post condition should be all experts have `amax` and the pre condition of `sync_quantizer_amax_across_dp_ep` should be the same. **Note:** we found that `_check_moe_calibration_complete` actually didn't raise any error even some experts have no amax. Didn't look into this problem. ## Usage  ```python # Add a code snippet demonstrating how to use this ``` ## Testing  ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Improved synchronization of quantization parameters for Mixture of Experts (MoE) models with more flexible configuration support.  Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
diff --git a/modelopt/torch/quantization/model_calib.py b/modelopt/torch/quantization/model_calib.py
@@ -115,8 +115,8 @@ def max_calibrate(model: nn.Module, forward_loop: ForwardLoop | None = None, dis
 
     # Sync amax across local experts within each rank (for SequentialMLP)
     for name, module in model.named_modules():
-        if hasattr(module, "sync_moe_local_experts_amax"):
-            module.sync_moe_local_experts_amax()
+        if hasattr(module, "layer_sync_moe_local_experts_amax"):
+            module.layer_sync_moe_local_experts_amax()
 
     if not distributed_sync:
         return
diff --git a/modelopt/torch/quantization/plugins/megatron.py b/modelopt/torch/quantization/plugins/megatron.py
@@ -583,6 +583,12 @@ def layer_sync_moe_local_experts_amax(self):
         Distributed amax sync across EP and ETP (for RowParallel) happens in model_calib.max_calibrate().
         This function should be called before the distributed sync to ensure the amax values
         are synchronized across the layer first.
+
+        Note:
+            Because there are logic which calls collective communication based on whether amax is not None,
+            We need to garuantee that all experts must have amax. Otherwise, there will be deadlock
+            when synchroizing over EP since some ranks may have amax None and not calling the collective
+            communication.
         """
         # Collect amax from all local experts
         amax_dict = {}
@@ -600,8 +606,8 @@ def layer_sync_moe_local_experts_amax(self):
         # Apply synchronized amax values back to all local experts
         for expert in self.local_experts:
             for name, module in expert.named_modules():
-                if isinstance(module, TensorQuantizer) and module.amax is not None:
-                    module.amax = amax_dict[name].detach().clone().to(module.amax.device)
+                if isinstance(module, TensorQuantizer) and name in amax_dict:
+                    module.amax = amax_dict[name].detach().clone()
 
     def sharded_state_dict(self, prefix="", sharded_offsets=(), metadata=None):
         """Override the default to enable singleton_local_shards.