Conversation
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
📝 WalkthroughWalkthroughThe change adds per-expert weight initialization in a test file to diversify amax values across local experts by assigning unique weight constants to each expert's linear layers prior to quantization. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py (1)
771-772: Prefertorch.no_grad()over.datafor test initialization.Lines 771–772 use
.data.fill_()for weight initialization. The idiomatic PyTorch pattern is to wrap parameter mutations intorch.no_grad()context and callfill_()directly on the parameter.♻️ Proposed fix
- for layer in model.decoder.layers: - for i, expert in enumerate(layer.mlp.experts.local_experts): - expert.linear_fc1.weight.data.fill_(0.1 + i * 0.05) - expert.linear_fc2.weight.data.fill_(0.2 + i * 0.05) + with torch.no_grad(): + for layer in model.decoder.layers: + for i, expert in enumerate(layer.mlp.experts.local_experts): + expert.linear_fc1.weight.fill_(0.1 + i * 0.05) + expert.linear_fc2.weight.fill_(0.2 + i * 0.05)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py` around lines 771 - 772, The test initializes expert weights using the deprecated .data API; replace those mutations with a torch.no_grad() context and call fill_() directly on the parameter tensors (e.g., expert.linear_fc1.weight and expert.linear_fc2.weight) so the assignments occur without tracking gradients and avoid .data usage—wrap the two fill_() calls inside torch.no_grad(): with torch.no_grad(): expert.linear_fc1.weight.fill_(...) and expert.linear_fc2.weight.fill_(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`:
- Around line 771-772: The test initializes expert weights using the deprecated
.data API; replace those mutations with a torch.no_grad() context and call
fill_() directly on the parameter tensors (e.g., expert.linear_fc1.weight and
expert.linear_fc2.weight) so the assignments occur without tracking gradients
and avoid .data usage—wrap the two fill_() calls inside torch.no_grad(): with
torch.no_grad(): expert.linear_fc1.weight.fill_(...) and
expert.linear_fc2.weight.fill_(...).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #936 +/- ##
==========================================
- Coverage 73.09% 72.17% -0.92%
==========================================
Files 205 207 +2
Lines 22301 22656 +355
==========================================
+ Hits 16300 16352 +52
- Misses 6001 6304 +303 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Bug Fix
Overview: ?
Fix MOE layer sync test by initializing weights in MOE layer differently
Link to bug
Usage
# Add a code snippet demonstrating how to use thisTesting
https://github.com/NVIDIA/Model-Optimizer/actions/runs/22414958311
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes