Skip to content

Megatron-Bridge import example in launcher for Nemotron Super V3#1516

Open
jenchen13 wants to merge 1 commit into
mainfrom
jennifchen/mbridge_import
Open

Megatron-Bridge import example in launcher for Nemotron Super V3#1516
jenchen13 wants to merge 1 commit into
mainfrom
jennifchen/mbridge_import

Conversation

@jenchen13
Copy link
Copy Markdown
Contributor

@jenchen13 jenchen13 commented May 18, 2026

What does this PR do?

Type of change: New feature

Add Megatron-Bridge import example in launcher for Nemotron Super V3

Usage

# Usage:
#   export SLURM_HOST=<slurm-host>
#   export SLURM_ACCOUNT=<your-team>
#   export SLURM_PARTITION=<gpu-partition>   # default: batch
#   export SLURM_JOB_DIR=/home/scratch.<user>/experiments
#   export HF_TOKEN=<your-hf-token>          # gated model
#   cd tools/launcher
#   uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_import.yaml --yes

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features
    • Added Megatron-Bridge checkpoint import capability from Hugging Face models on CPU-capable setups.
    • Added example pipeline configuration for importing NVIDIA Nemotron-3 model checkpoints via Slurm.
    • Updated launcher to support the Megatron-Bridge module with extended packaging patterns.

Review Change Stack

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested a review from ChenhanYu May 18, 2026 20:26
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

📝 Walkthrough

Walkthrough

This PR integrates Megatron-Bridge into the NVIDIA Model-Optimizer launcher infrastructure to enable checkpoint import from HuggingFace models. It adds Megatron-Bridge as a git submodule, provides a shell script to orchestrate the import process with environment setup, updates the launcher's packaging configuration to distribute Megatron-Bridge artifacts, and includes a YAML example for importing NVIDIA-Nemotron-3-Super-120B weights.

Changes

Megatron-Bridge Integration for Launcher

Layer / File(s) Summary
Submodule dependency setup
.gitmodules, tools/launcher/modules/Megatron-Bridge, tools/launcher/modules/Megatron-LM
Registers Megatron-Bridge as a new git submodule under tools/launcher/modules/ and updates both Megatron-Bridge and Megatron-LM submodule commit pointers.
Checkpoint import script
tools/launcher/common/megatron_bridge/import/import.sh
Implements a Bash script that validates required HF_MODEL_ID environment variable, derives output checkpoint directory name from the model ID, sets TORCH_DTYPE (default bfloat16), installs local megatron-bridge in editable mode if missing, optionally installs extra pip dependencies, prepends local Megatron-LM to PYTHONPATH, and executes the Megatron-Bridge conversion utility with assembled arguments.
Launcher packaging and example configuration
tools/launcher/launch.py, tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_import.yaml
Extends the launcher's pattern packager to include Megatron-Bridge source files, examples, pyproject.toml, and README. Provides a Slurm-based example job that invokes the import script for NVIDIA-Nemotron-3-Super-120B, passing model ID and output directory via environment variables, and allocates a single 8-GPU node with 4-hour time limit.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error PR contains 2 instances of # nosec comments in tools/launcher/launch.py (lines 33, 100) which violate security policy. No explicit @NVIDIA/modelopt-setup-codeowners approval found in PR description. Remove # nosec comments or add explicit justification and @NVIDIA/modelopt-setup-codeowners approval in PR description per SECURITY.md requirements.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a Megatron-Bridge import example to the launcher for Nemotron Super V3, which aligns with all file modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jennifchen/mbridge_import

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tools/launcher/launch.py (1)

74-74: ⚡ Quick win

Avoid hardcoding relative_path length.

This is easy to desync from include_pattern and can silently break packager mapping when entries change.

Proposed refactor
-packager = run.PatternPackager(
-    include_pattern=[
+_include_patterns = [
         "modules/Megatron-LM/megatron/*",
         "modules/Megatron-LM/examples/*",
         "modules/Megatron-LM/*.py",
         "modules/Megatron-Bridge/src/*",
         "modules/Megatron-Bridge/examples/*",
         "modules/Megatron-Bridge/pyproject.toml",
         "modules/Megatron-Bridge/README.md",
         "modules/Model-Optimizer/modelopt/*",
         "modules/Model-Optimizer/modelopt_recipes/*",
         "modules/Model-Optimizer/examples/*",
         "examples/*",
         "common/*",
-    ],
-    relative_path=[LAUNCHER_DIR] * 12,
+]
+
+packager = run.PatternPackager(
+    include_pattern=_include_patterns,
+    relative_path=[LAUNCHER_DIR] * len(_include_patterns),
 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/launch.py` at line 74, The hardcoded repeat count in the
relative_path assignment (relative_path=[LAUNCHER_DIR] * 12) can get out of sync
with include_pattern; change it to derive the repeat length from the
include_pattern (e.g. relative_path=[LAUNCHER_DIR] * len(include_pattern) or
equivalent) so that relative_path size always matches include_pattern length
(reference variables: relative_path, LAUNCHER_DIR, include_pattern).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/common/megatron_bridge/import/import.sh`:
- Around line 45-49: The current import check only verifies importability and
can pick up an older installed megatron.bridge; change the logic so we force
local resolution by verifying the imported module's file path is inside
BRIDGE_DIR (or else install). Replace the simple import test with a Python check
that imports megatron.bridge, inspects megatron.bridge.__file__ (or uses
importlib to get the module path), resolves it to an absolute path and exits
nonzero if that path does not start with the absolute BRIDGE_DIR; keep the pip
install -e "${BRIDGE_DIR}" step to run when the path check fails. This ensures
the script uses the submodule under BRIDGE_DIR rather than an unrelated
site-packages version.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_import.yaml`:
- Line 26: Replace the hardcoded OUTPUT_DIR value with a launcher interpolation
variable: remove "/scratchspace/megatron-bridge" and set OUTPUT_DIR to use the
shared launcher global var (e.g., <<global_vars.OUTPUT_DIR>>); update any
accompanying docs/examples to reference the global var if needed and ensure the
global_vars entry is defined in the launcher YAML set so the
megatron_bridge_import.yaml's OUTPUT_DIR uses <<global_vars.OUTPUT_DIR>> instead
of the literal path.

---

Nitpick comments:
In `@tools/launcher/launch.py`:
- Line 74: The hardcoded repeat count in the relative_path assignment
(relative_path=[LAUNCHER_DIR] * 12) can get out of sync with include_pattern;
change it to derive the repeat length from the include_pattern (e.g.
relative_path=[LAUNCHER_DIR] * len(include_pattern) or equivalent) so that
relative_path size always matches include_pattern length (reference variables:
relative_path, LAUNCHER_DIR, include_pattern).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c1a8d8d-8cbf-4959-8dc3-ee644e8af2e1

📥 Commits

Reviewing files that changed from the base of the PR and between 3d4f4cc and e5c6b67.

📒 Files selected for processing (6)
  • .gitmodules
  • tools/launcher/common/megatron_bridge/import/import.sh
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_import.yaml
  • tools/launcher/launch.py
  • tools/launcher/modules/Megatron-Bridge
  • tools/launcher/modules/Megatron-LM

Comment on lines +45 to +49
if ! python -c "import megatron.bridge" 2>/dev/null; then
echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
unset PIP_CONSTRAINT
pip install -e "${BRIDGE_DIR}"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Force local Megatron-Bridge resolution to avoid version drift.

Line 45 only checks importability, so an older site-packages megatron.bridge can be used instead of this submodule, which can break the local conversion script/API contract.

Proposed fix
-if ! python -c "import megatron.bridge" 2>/dev/null; then
-    echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
-    unset PIP_CONSTRAINT
-    pip install -e "${BRIDGE_DIR}"
-fi
+echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
+unset PIP_CONSTRAINT
+pip install -e "${BRIDGE_DIR}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ! python -c "import megatron.bridge" 2>/dev/null; then
echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
unset PIP_CONSTRAINT
pip install -e "${BRIDGE_DIR}"
fi
echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
unset PIP_CONSTRAINT
pip install -e "${BRIDGE_DIR}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/megatron_bridge/import/import.sh` around lines 45 - 49,
The current import check only verifies importability and can pick up an older
installed megatron.bridge; change the logic so we force local resolution by
verifying the imported module's file path is inside BRIDGE_DIR (or else
install). Replace the simple import test with a Python check that imports
megatron.bridge, inspects megatron.bridge.__file__ (or uses importlib to get the
module path), resolves it to an absolute path and exits nonzero if that path
does not start with the absolute BRIDGE_DIR; keep the pip install -e
"${BRIDGE_DIR}" step to run when the path check fails. This ensures the script
uses the submodule under BRIDGE_DIR rather than an unrelated site-packages
version.

script: common/megatron_bridge/import/import.sh
environment:
- HF_MODEL_ID: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
- OUTPUT_DIR: /scratchspace/megatron-bridge
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace hardcoded OUTPUT_DIR with launcher interpolation.

Hardcoding /scratchspace/megatron-bridge makes the example non-portable across clusters/users; prefer a shared variable via launcher interpolation.

As per coding guidelines, tools/launcher/**/*.yaml should use <<global_vars.X>> interpolation for shared values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_import.yaml`
at line 26, Replace the hardcoded OUTPUT_DIR value with a launcher interpolation
variable: remove "/scratchspace/megatron-bridge" and set OUTPUT_DIR to use the
shared launcher global var (e.g., <<global_vars.OUTPUT_DIR>>); update any
accompanying docs/examples to reference the global var if needed and ensure the
global_vars entry is defined in the launcher YAML set so the
megatron_bridge_import.yaml's OUTPUT_DIR uses <<global_vars.OUTPUT_DIR>> instead
of the literal path.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.98%. Comparing base (f5650bd) to head (e5c6b67).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1516      +/-   ##
==========================================
+ Coverage   76.95%   76.98%   +0.03%     
==========================================
  Files         474      474              
  Lines       51503    51503              
==========================================
+ Hits        39632    39649      +17     
+ Misses      11871    11854      -17     
Flag Coverage Δ
regression 15.21% <ø> (+0.18%) ⬆️
unit 52.63% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this script need to be so complicated and why can we just run /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py directly? Can we assume this script is run in nemo:26.02 or later container hence all required dependencies already present?

environment:
- HF_MODEL_ID: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
- OUTPUT_DIR: /scratchspace/megatron-bridge
- EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use nemo container, then we wont need these extra deps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants