Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 4 additions & 9 deletions modelopt/torch/quantization/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,13 +291,10 @@ def get_score(self, recipe: QuantRecipe) -> float:
total_score += importance.cpu().item()
continue

if parallel_state.expert_model_parallel_group.is_initialized():
# TODO: Support expert model parallelism for score estimation
warnings.warn("AutoQuantize does not support expert model parallelism yet.")
importance = importance.cpu()
importance = DistributedProcessGroup.get_dist_syncd_obj(
importance,
[parallel_state.tensor_parallel_group, parallel_state.data_parallel_group],
[parallel_state.tensor_parallel_group, parallel_state.data_parallel_group, parallel_state.expert_model_parallel_group],
sum,
)
Comment on lines 295 to 299
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Canonicalize the winning recipe across expert-parallel ranks too.

These changes make candidate stats identical across EP shards, but run_search() still normalizes the solver output only over DP/TP. The comment at Lines 750-752 already calls out tie-driven divergence, so the same grouped MoE hparam can still end up with different recipes on different expert-parallel ranks.

Suggested fix
             best_format = DistributedProcessGroup.get_dist_syncd_obj(
                 best_hparam_recipe_info["format"],
-                [_ps.data_parallel_group, _ps.tensor_parallel_group],
+                [
+                    _ps.data_parallel_group,
+                    _ps.tensor_parallel_group,
+                    _ps.expert_model_parallel_group,
+                ],
                 lambda a: a[0],
             )

Also applies to: 318-320

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/algorithms.py` around lines 295 - 299, The solver
output normalization and recipe selection need to be synchronized across
expert-parallel ranks: extend the DistributedProcessGroup.get_dist_syncd_obj
calls and any normalization/argmax steps inside run_search() to include
parallel_state.expert_model_parallel_group (i.e., add expert_parallel_group to
the group list used when reducing/normalizing candidate stats and winner
selection), and apply the same change to the other occurrence around the
importance aggregation (the block at the other get_dist_syncd_obj call
referenced). This ensures candidate stats and the canonical winning recipe are
identical across EP shards so tie-driven divergence cannot produce different
recipes per EP rank.

total_score += importance.item()
Expand All @@ -318,13 +315,9 @@ def get_cost(self, recipe: QuantRecipe) -> float:
cost += weight_size * recipe.compression
continue

if parallel_state.expert_model_parallel_group.is_initialized():
# TODO: Support expert model parallelism
warnings.warn("AutoQuantize does not support expert model parallelism yet.")

weight_size = DistributedProcessGroup.get_dist_syncd_obj(
weight_size,
[parallel_state.tensor_parallel_group],
[parallel_state.tensor_parallel_group, parallel_state.expert_model_parallel_group],
Comment on lines 318 to +320
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the search budget on the same scale as the new EP-synced costs.

Line 320 makes expert-module costs global across expert_model_parallel_group, but run_search() still derives max_weight_size from the local shard via _get_total_weight_size(self.model.modules()). On EP runs that mixes global candidate costs with a local budget, so the solver can over-compress and report the wrong effective bits.

Suggested fix
-        total_weight_size = self._get_total_weight_size(self.model.modules())
+        total_weight_size = sum(
+            candidate_stat["costs"][-1] for candidate_stat in self.candidate_stats.values()
+        )
         max_weight_size = total_weight_size * compression
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/algorithms.py` around lines 318 - 320,
run_search() currently computes max_weight_size from the local shard via
_get_total_weight_size(self.model.modules()), while costs were made global with
DistributedProcessGroup.get_dist_syncd_obj (expert_model_parallel_group),
causing an inconsistent budget; fix by deriving the search budget from the same
globally-synced weight_size used for costs — call
DistributedProcessGroup.get_dist_syncd_obj on the value returned by
_get_total_weight_size(self.model.modules()) (or otherwise scale it by the
expert_model_parallel_group world size) so max_weight_size is on the same
EP-synced scale as the candidate costs.

sum,
)

Expand Down Expand Up @@ -362,6 +355,8 @@ class _AutoQuantizeBaseSearcher(BaseSearcher, ABC):
# gate_proj, up_proj, down_proj for Qwen3 like MoE models
r"^(.*?\.mlp\.experts)\.\d+\.(gate_proj|up_proj|down_proj)$",
r"^(.*?\.mixer\.experts)\.\d+\.(up_proj|down_proj)$", # NemotronH MoE experts
# NemotronH MoE experts in MCore naming (linear_fc1=gate+up fused, linear_fc2=down)
r"^(.*?\.mlp\.experts\.local_experts)\.\d+\.(linear_fc1|linear_fc2)$",
r"^(.*?)\.(gate_proj|up_proj)$", # gate_proj, up_proj for llama like models
r"^(.*?)\.(\d+\.(w1|w2|w3))$", # mixtral experts
r"^(.*?)\.((w1_linear|w2_linear|w3_linear)\.\d+)$", # dbrx experts
Expand Down
Loading