Support AutoQuant in Megatron-Core#1512
Conversation
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
📝 WalkthroughWalkthroughThis PR extends quantization support for Mixture of Experts (MoE) models by updating distributed synchronization across expert parallel groups in recipe scoring and costing, and adding pattern matching to group NemotronH expert modules for consistent quantization recipe sharing. ChangesMoE Quantization Expert Parallelism
🎯 2 (Simple) | ⏱️ ~10 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/algorithms.py`:
- Around line 295-299: The solver output normalization and recipe selection need
to be synchronized across expert-parallel ranks: extend the
DistributedProcessGroup.get_dist_syncd_obj calls and any normalization/argmax
steps inside run_search() to include parallel_state.expert_model_parallel_group
(i.e., add expert_parallel_group to the group list used when
reducing/normalizing candidate stats and winner selection), and apply the same
change to the other occurrence around the importance aggregation (the block at
the other get_dist_syncd_obj call referenced). This ensures candidate stats and
the canonical winning recipe are identical across EP shards so tie-driven
divergence cannot produce different recipes per EP rank.
- Around line 318-320: run_search() currently computes max_weight_size from the
local shard via _get_total_weight_size(self.model.modules()), while costs were
made global with DistributedProcessGroup.get_dist_syncd_obj
(expert_model_parallel_group), causing an inconsistent budget; fix by deriving
the search budget from the same globally-synced weight_size used for costs —
call DistributedProcessGroup.get_dist_syncd_obj on the value returned by
_get_total_weight_size(self.model.modules()) (or otherwise scale it by the
expert_model_parallel_group world size) so max_weight_size is on the same
EP-synced scale as the candidate costs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eb567d1f-0a05-4136-a751-1c8bdfdc1f46
📒 Files selected for processing (1)
modelopt/torch/quantization/algorithms.py
| importance = DistributedProcessGroup.get_dist_syncd_obj( | ||
| importance, | ||
| [parallel_state.tensor_parallel_group, parallel_state.data_parallel_group], | ||
| [parallel_state.tensor_parallel_group, parallel_state.data_parallel_group, parallel_state.expert_model_parallel_group], | ||
| sum, | ||
| ) |
There was a problem hiding this comment.
Canonicalize the winning recipe across expert-parallel ranks too.
These changes make candidate stats identical across EP shards, but run_search() still normalizes the solver output only over DP/TP. The comment at Lines 750-752 already calls out tie-driven divergence, so the same grouped MoE hparam can still end up with different recipes on different expert-parallel ranks.
Suggested fix
best_format = DistributedProcessGroup.get_dist_syncd_obj(
best_hparam_recipe_info["format"],
- [_ps.data_parallel_group, _ps.tensor_parallel_group],
+ [
+ _ps.data_parallel_group,
+ _ps.tensor_parallel_group,
+ _ps.expert_model_parallel_group,
+ ],
lambda a: a[0],
)Also applies to: 318-320
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/quantization/algorithms.py` around lines 295 - 299, The solver
output normalization and recipe selection need to be synchronized across
expert-parallel ranks: extend the DistributedProcessGroup.get_dist_syncd_obj
calls and any normalization/argmax steps inside run_search() to include
parallel_state.expert_model_parallel_group (i.e., add expert_parallel_group to
the group list used when reducing/normalizing candidate stats and winner
selection), and apply the same change to the other occurrence around the
importance aggregation (the block at the other get_dist_syncd_obj call
referenced). This ensures candidate stats and the canonical winning recipe are
identical across EP shards so tie-driven divergence cannot produce different
recipes per EP rank.
| weight_size = DistributedProcessGroup.get_dist_syncd_obj( | ||
| weight_size, | ||
| [parallel_state.tensor_parallel_group], | ||
| [parallel_state.tensor_parallel_group, parallel_state.expert_model_parallel_group], |
There was a problem hiding this comment.
Keep the search budget on the same scale as the new EP-synced costs.
Line 320 makes expert-module costs global across expert_model_parallel_group, but run_search() still derives max_weight_size from the local shard via _get_total_weight_size(self.model.modules()). On EP runs that mixes global candidate costs with a local budget, so the solver can over-compress and report the wrong effective bits.
Suggested fix
- total_weight_size = self._get_total_weight_size(self.model.modules())
+ total_weight_size = sum(
+ candidate_stat["costs"][-1] for candidate_stat in self.candidate_stats.values()
+ )
max_weight_size = total_weight_size * compression🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/quantization/algorithms.py` around lines 318 - 320,
run_search() currently computes max_weight_size from the local shard via
_get_total_weight_size(self.model.modules()), while costs were made global with
DistributedProcessGroup.get_dist_syncd_obj (expert_model_parallel_group),
causing an inconsistent budget; fix by deriving the search budget from the same
globally-synced weight_size used for costs — call
DistributedProcessGroup.get_dist_syncd_obj on the value returned by
_get_total_weight_size(self.model.modules()) (or otherwise scale it by the
expert_model_parallel_group world size) so max_weight_size is on the same
EP-synced scale as the candidate costs.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1512 +/- ##
==========================================
- Coverage 76.93% 76.90% -0.03%
==========================================
Files 474 474
Lines 51506 51502 -4
==========================================
- Hits 39625 39608 -17
- Misses 11881 11894 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: New feature
Add autoquant support for MCore by supporting Expert Parallelism and NemotronH MoE models
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit