Skip to content

Support AutoQuant in Megatron-Core#1512

Open
jenchen13 wants to merge 1 commit into
mainfrom
jennifchen/mcore_autoquant
Open

Support AutoQuant in Megatron-Core#1512
jenchen13 wants to merge 1 commit into
mainfrom
jennifchen/mcore_autoquant

Conversation

@jenchen13
Copy link
Copy Markdown
Contributor

@jenchen13 jenchen13 commented May 18, 2026

What does this PR do?

Type of change: New feature

Add autoquant support for MCore by supporting Expert Parallelism and NemotronH MoE models

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Bug Fixes
    • Fixed quantization synchronization for expert model parallelism in distributed training.
    • Extended quantization support for NemotronH mixture-of-experts models.

Review Change Stack

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested a review from a team as a code owner May 18, 2026 15:09
@jenchen13 jenchen13 requested a review from realAsma May 18, 2026 15:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR extends quantization support for Mixture of Experts (MoE) models by updating distributed synchronization across expert parallel groups in recipe scoring and costing, and adding pattern matching to group NemotronH expert modules for consistent quantization recipe sharing.

Changes

MoE Quantization Expert Parallelism

Layer / File(s) Summary
Expert parallel group synchronization in recipe metrics
modelopt/torch/quantization/algorithms.py
QuantRecipeHparam.get_score() and QuantRecipeHparam.get_cost() now include expert_model_parallel_group when synchronizing distributed values, removing prior conditional expert parallelism handling.
NemotronH MoE expert module grouping pattern
modelopt/torch/quantization/algorithms.py
_AutoQuantizeBaseSearcher.quant_grouping_rules gains a regex pattern to match and group NemotronH MoE expert modules in MCore naming style (mlp.experts.local_experts.<idx>.(linear_fc1|linear_fc2)).

🎯 2 (Simple) | ⏱️ ~10 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error # nosec comments used to bypass Bandit checks are not allowed per SECURITY.md. Also, trust_remote_code=True is hardcoded without caller-configurable parameters. Remove all # nosec comments and require explicit codeowner approval. Make trust_remote_code=True configurable with False as default, or get approval from @NVIDIA/modelopt-setup-codeowners.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Support AutoQuant in Megatron-Core' directly aligns with the PR's main objective of adding AutoQuant support for Megatron-Core with expert parallelism capabilities.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jennifchen/mcore_autoquant

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1512/

Built to branch gh-pages at 2026-05-18 15:13 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/algorithms.py`:
- Around line 295-299: The solver output normalization and recipe selection need
to be synchronized across expert-parallel ranks: extend the
DistributedProcessGroup.get_dist_syncd_obj calls and any normalization/argmax
steps inside run_search() to include parallel_state.expert_model_parallel_group
(i.e., add expert_parallel_group to the group list used when
reducing/normalizing candidate stats and winner selection), and apply the same
change to the other occurrence around the importance aggregation (the block at
the other get_dist_syncd_obj call referenced). This ensures candidate stats and
the canonical winning recipe are identical across EP shards so tie-driven
divergence cannot produce different recipes per EP rank.
- Around line 318-320: run_search() currently computes max_weight_size from the
local shard via _get_total_weight_size(self.model.modules()), while costs were
made global with DistributedProcessGroup.get_dist_syncd_obj
(expert_model_parallel_group), causing an inconsistent budget; fix by deriving
the search budget from the same globally-synced weight_size used for costs —
call DistributedProcessGroup.get_dist_syncd_obj on the value returned by
_get_total_weight_size(self.model.modules()) (or otherwise scale it by the
expert_model_parallel_group world size) so max_weight_size is on the same
EP-synced scale as the candidate costs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb567d1f-0a05-4136-a751-1c8bdfdc1f46

📥 Commits

Reviewing files that changed from the base of the PR and between 7038dec and 0b6a7c9.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/algorithms.py

Comment on lines 295 to 299
importance = DistributedProcessGroup.get_dist_syncd_obj(
importance,
[parallel_state.tensor_parallel_group, parallel_state.data_parallel_group],
[parallel_state.tensor_parallel_group, parallel_state.data_parallel_group, parallel_state.expert_model_parallel_group],
sum,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Canonicalize the winning recipe across expert-parallel ranks too.

These changes make candidate stats identical across EP shards, but run_search() still normalizes the solver output only over DP/TP. The comment at Lines 750-752 already calls out tie-driven divergence, so the same grouped MoE hparam can still end up with different recipes on different expert-parallel ranks.

Suggested fix
             best_format = DistributedProcessGroup.get_dist_syncd_obj(
                 best_hparam_recipe_info["format"],
-                [_ps.data_parallel_group, _ps.tensor_parallel_group],
+                [
+                    _ps.data_parallel_group,
+                    _ps.tensor_parallel_group,
+                    _ps.expert_model_parallel_group,
+                ],
                 lambda a: a[0],
             )

Also applies to: 318-320

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/algorithms.py` around lines 295 - 299, The solver
output normalization and recipe selection need to be synchronized across
expert-parallel ranks: extend the DistributedProcessGroup.get_dist_syncd_obj
calls and any normalization/argmax steps inside run_search() to include
parallel_state.expert_model_parallel_group (i.e., add expert_parallel_group to
the group list used when reducing/normalizing candidate stats and winner
selection), and apply the same change to the other occurrence around the
importance aggregation (the block at the other get_dist_syncd_obj call
referenced). This ensures candidate stats and the canonical winning recipe are
identical across EP shards so tie-driven divergence cannot produce different
recipes per EP rank.

Comment on lines 318 to +320
weight_size = DistributedProcessGroup.get_dist_syncd_obj(
weight_size,
[parallel_state.tensor_parallel_group],
[parallel_state.tensor_parallel_group, parallel_state.expert_model_parallel_group],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the search budget on the same scale as the new EP-synced costs.

Line 320 makes expert-module costs global across expert_model_parallel_group, but run_search() still derives max_weight_size from the local shard via _get_total_weight_size(self.model.modules()). On EP runs that mixes global candidate costs with a local budget, so the solver can over-compress and report the wrong effective bits.

Suggested fix
-        total_weight_size = self._get_total_weight_size(self.model.modules())
+        total_weight_size = sum(
+            candidate_stat["costs"][-1] for candidate_stat in self.candidate_stats.values()
+        )
         max_weight_size = total_weight_size * compression
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/algorithms.py` around lines 318 - 320,
run_search() currently computes max_weight_size from the local shard via
_get_total_weight_size(self.model.modules()), while costs were made global with
DistributedProcessGroup.get_dist_syncd_obj (expert_model_parallel_group),
causing an inconsistent budget; fix by deriving the search budget from the same
globally-synced weight_size used for costs — call
DistributedProcessGroup.get_dist_syncd_obj on the value returned by
_get_total_weight_size(self.model.modules()) (or otherwise scale it by the
expert_model_parallel_group world size) so max_weight_size is on the same
EP-synced scale as the candidate costs.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.90%. Comparing base (7038dec) to head (0b6a7c9).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1512      +/-   ##
==========================================
- Coverage   76.93%   76.90%   -0.03%     
==========================================
  Files         474      474              
  Lines       51506    51502       -4     
==========================================
- Hits        39625    39608      -17     
- Misses      11881    11894      +13     
Flag Coverage Δ
examples 41.73% <ø> (+0.91%) ⬆️
gpu 59.74% <ø> (-0.59%) ⬇️
regression 15.22% <ø> (+0.07%) ⬆️
unit 52.65% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant