Support AutoQuant in Megatron-Core by jenchen13 · Pull Request #1512 · NVIDIA/Model-Optimizer

jenchen13 · 2026-05-18T15:09:31Z

What does this PR do?

Type of change: New feature

Add autoquant support for MCore by supporting Expert Parallelism and NemotronH MoE models

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Bug Fixes
- Fixed quantization synchronization for expert model parallelism in distributed training.
- Extended quantization support for NemotronH mixture-of-experts models.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

coderabbitai · 2026-05-18T15:09:45Z

📝 Walkthrough

Walkthrough

This PR extends quantization support for Mixture of Experts (MoE) models by updating distributed synchronization across expert parallel groups in recipe scoring and costing, and adding pattern matching to group NemotronH expert modules for consistent quantization recipe sharing.

Changes

MoE Quantization Expert Parallelism

Layer / File(s)	Summary
Expert parallel group synchronization in recipe metrics `modelopt/torch/quantization/algorithms.py`	`QuantRecipeHparam.get_score()` and `QuantRecipeHparam.get_cost()` now include `expert_model_parallel_group` when synchronizing distributed values, removing prior conditional expert parallelism handling.
NemotronH MoE expert module grouping pattern `modelopt/torch/quantization/algorithms.py`	`_AutoQuantizeBaseSearcher.quant_grouping_rules` gains a regex pattern to match and group NemotronH MoE expert modules in MCore naming style (`mlp.experts.local_experts.<idx>.(linear_fc1\|linear_fc2)`).

🎯 2 (Simple) | ⏱️ ~10 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	# nosec comments used to bypass Bandit checks are not allowed per SECURITY.md. Also, trust_remote_code=True is hardcoded without caller-configurable parameters.	Remove all # nosec comments and require explicit codeowner approval. Make trust_remote_code=True configurable with False as default, or get approval from `@NVIDIA/modelopt-setup-codeowners`.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Support AutoQuant in Megatron-Core' directly aligns with the PR's main objective of adding AutoQuant support for Megatron-Core with expert parallelism capabilities.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jennifchen/mcore_autoquant

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-18T15:13:48Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1512/
Built to branch `gh-pages` at 2026-05-18 15:13 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/algorithms.py`:
- Around line 295-299: The solver output normalization and recipe selection need
to be synchronized across expert-parallel ranks: extend the
DistributedProcessGroup.get_dist_syncd_obj calls and any normalization/argmax
steps inside run_search() to include parallel_state.expert_model_parallel_group
(i.e., add expert_parallel_group to the group list used when
reducing/normalizing candidate stats and winner selection), and apply the same
change to the other occurrence around the importance aggregation (the block at
the other get_dist_syncd_obj call referenced). This ensures candidate stats and
the canonical winning recipe are identical across EP shards so tie-driven
divergence cannot produce different recipes per EP rank.
- Around line 318-320: run_search() currently computes max_weight_size from the
local shard via _get_total_weight_size(self.model.modules()), while costs were
made global with DistributedProcessGroup.get_dist_syncd_obj
(expert_model_parallel_group), causing an inconsistent budget; fix by deriving
the search budget from the same globally-synced weight_size used for costs —
call DistributedProcessGroup.get_dist_syncd_obj on the value returned by
_get_total_weight_size(self.model.modules()) (or otherwise scale it by the
expert_model_parallel_group world size) so max_weight_size is on the same
EP-synced scale as the candidate costs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eb567d1f-0a05-4136-a751-1c8bdfdc1f46

📥 Commits

Reviewing files that changed from the base of the PR and between 7038dec and 0b6a7c9.

📒 Files selected for processing (1)

modelopt/torch/quantization/algorithms.py

coderabbitai · 2026-05-18T15:15:02Z

            importance = DistributedProcessGroup.get_dist_syncd_obj(
                importance,
-                [parallel_state.tensor_parallel_group, parallel_state.data_parallel_group],
+                [parallel_state.tensor_parallel_group, parallel_state.data_parallel_group, parallel_state.expert_model_parallel_group],
                sum,
            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Canonicalize the winning recipe across expert-parallel ranks too.

These changes make candidate stats identical across EP shards, but run_search() still normalizes the solver output only over DP/TP. The comment at Lines 750-752 already calls out tie-driven divergence, so the same grouped MoE hparam can still end up with different recipes on different expert-parallel ranks.

Suggested fix

best_format = DistributedProcessGroup.get_dist_syncd_obj( best_hparam_recipe_info["format"], - [_ps.data_parallel_group, _ps.tensor_parallel_group], + [ + _ps.data_parallel_group, + _ps.tensor_parallel_group, + _ps.expert_model_parallel_group, + ], lambda a: a[0], )

Also applies to: 318-320

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/algorithms.py` around lines 295 - 299, The solver output normalization and recipe selection need to be synchronized across expert-parallel ranks: extend the DistributedProcessGroup.get_dist_syncd_obj calls and any normalization/argmax steps inside run_search() to include parallel_state.expert_model_parallel_group (i.e., add expert_parallel_group to the group list used when reducing/normalizing candidate stats and winner selection), and apply the same change to the other occurrence around the importance aggregation (the block at the other get_dist_syncd_obj call referenced). This ensures candidate stats and the canonical winning recipe are identical across EP shards so tie-driven divergence cannot produce different recipes per EP rank.

coderabbitai · 2026-05-18T15:15:02Z

            weight_size = DistributedProcessGroup.get_dist_syncd_obj(
                weight_size,
-                [parallel_state.tensor_parallel_group],
+                [parallel_state.tensor_parallel_group, parallel_state.expert_model_parallel_group],


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the search budget on the same scale as the new EP-synced costs.

Line 320 makes expert-module costs global across expert_model_parallel_group, but run_search() still derives max_weight_size from the local shard via _get_total_weight_size(self.model.modules()). On EP runs that mixes global candidate costs with a local budget, so the solver can over-compress and report the wrong effective bits.

Suggested fix

- total_weight_size = self._get_total_weight_size(self.model.modules()) + total_weight_size = sum( + candidate_stat["costs"][-1] for candidate_stat in self.candidate_stats.values() + ) max_weight_size = total_weight_size * compression

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/algorithms.py` around lines 318 - 320, run_search() currently computes max_weight_size from the local shard via _get_total_weight_size(self.model.modules()), while costs were made global with DistributedProcessGroup.get_dist_syncd_obj (expert_model_parallel_group), causing an inconsistent budget; fix by deriving the search budget from the same globally-synced weight_size used for costs — call DistributedProcessGroup.get_dist_syncd_obj on the value returned by _get_total_weight_size(self.model.modules()) (or otherwise scale it by the expert_model_parallel_group world size) so max_weight_size is on the same EP-synced scale as the candidate costs.

codecov · 2026-05-18T15:23:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.90%. Comparing base (7038dec) to head (0b6a7c9).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1512      +/-   ##
==========================================
- Coverage   76.93%   76.90%   -0.03%     
==========================================
  Files         474      474              
  Lines       51506    51502       -4     
==========================================
- Hits        39625    39608      -17     
- Misses      11881    11894      +13

Flag	Coverage Δ
examples	`41.73% <ø> (+0.91%)`	⬆️
gpu	`59.74% <ø> (-0.59%)`	⬇️
regression	`15.22% <ø> (+0.07%)`	⬆️
unit	`52.65% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

support mcore autoquant

0b6a7c9

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>

jenchen13 requested a review from a team as a code owner May 18, 2026 15:09

jenchen13 requested a review from realAsma May 18, 2026 15:09

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support AutoQuant in Megatron-Core#1512

Support AutoQuant in Megatron-Core#1512
jenchen13 wants to merge 1 commit into
mainfrom
jennifchen/mcore_autoquant

jenchen13 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Pre-merge checks failed

Uh oh!

github-actions Bot commented May 18, 2026

Built to branch `gh-pages` at 2026-05-18 15:13 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jenchen13 commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

github-actions Bot commented May 18, 2026

Built to branch gh-pages at 2026-05-18 15:13 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jenchen13 commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-18 15:13 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 18, 2026 •

edited

Loading