Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[submodule "tools/launcher/modules/Megatron-LM"]
path = tools/launcher/modules/Megatron-LM
url = https://github.com/NVIDIA/Megatron-LM.git
[submodule "tools/launcher/modules/Megatron-Bridge"]
path = tools/launcher/modules/Megatron-Bridge
url = https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
74 changes: 74 additions & 0 deletions tools/launcher/common/megatron_bridge/import/import.sh
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this script need to be so complicated and why can we just run /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py directly? Can we assume this script is run in nemo:26.02 or later container hence all required dependencies already present?

Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/bin/bash

# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Megatron-Bridge HF -> Megatron checkpoint import (CPU-capable).
#
# Required env: HF_MODEL_ID (e.g. nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16)
# Optional env:
# OUTPUT_DIR Parent dir for the MCore checkpoint (default: cwd).
# TORCH_DTYPE Model dtype for HF load (default: bfloat16).
#
# Writes MCore checkpoint to ${OUTPUT_DIR}/<basename(HF_MODEL_ID)>-MCore
#
# Runs:
# python examples/conversion/convert_checkpoints.py import \
# --hf-model $HF_MODEL_ID \
# --megatron-path $OUTPUT_DIR/<model>-MCore \
# --torch-dtype $TORCH_DTYPE

set -e

if [[ -z "${HF_MODEL_ID}" ]]; then
echo "[ERROR] HF_MODEL_ID is required" >&2
exit 1
fi

SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
LAUNCHER_DIR="${SCRIPT_DIR}/../../.."
BRIDGE_DIR="${LAUNCHER_DIR}/modules/Megatron-Bridge"
MLM_DIR="${LAUNCHER_DIR}/modules/Megatron-LM"

if ! python -c "import megatron.bridge" 2>/dev/null; then
echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
unset PIP_CONSTRAINT
pip install -e "${BRIDGE_DIR}"
fi
Comment on lines +45 to +49
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Force local Megatron-Bridge resolution to avoid version drift.

Line 45 only checks importability, so an older site-packages megatron.bridge can be used instead of this submodule, which can break the local conversion script/API contract.

Proposed fix
-if ! python -c "import megatron.bridge" 2>/dev/null; then
-    echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
-    unset PIP_CONSTRAINT
-    pip install -e "${BRIDGE_DIR}"
-fi
+echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
+unset PIP_CONSTRAINT
+pip install -e "${BRIDGE_DIR}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ! python -c "import megatron.bridge" 2>/dev/null; then
echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
unset PIP_CONSTRAINT
pip install -e "${BRIDGE_DIR}"
fi
echo "[INFO] Installing megatron-bridge from ${BRIDGE_DIR}"
unset PIP_CONSTRAINT
pip install -e "${BRIDGE_DIR}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/megatron_bridge/import/import.sh` around lines 45 - 49,
The current import check only verifies importability and can pick up an older
installed megatron.bridge; change the logic so we force local resolution by
verifying the imported module's file path is inside BRIDGE_DIR (or else
install). Replace the simple import test with a Python check that imports
megatron.bridge, inspects megatron.bridge.__file__ (or uses importlib to get the
module path), resolves it to an absolute path and exits nonzero if that path
does not start with the absolute BRIDGE_DIR; keep the pip install -e
"${BRIDGE_DIR}" step to run when the path check fails. This ensures the script
uses the submodule under BRIDGE_DIR rather than an unrelated site-packages
version.


if [[ -n "${EXTRA_PIP_DEPS}" ]]; then
echo "[INFO] Installing extra deps: ${EXTRA_PIP_DEPS}"
unset PIP_CONSTRAINT
read -r -a _deps <<< "${EXTRA_PIP_DEPS}"
# --no-build-isolation: mamba-ssm/causal-conv1d need torch visible at build time.
pip install --no-build-isolation "${_deps[@]}"
fi

# Megatron-Bridge needs newer megatron.core (incl. megatron.core.distributed.fsdp).
# Prepend local Megatron-LM to PYTHONPATH so its sources shadow installed megatron-core.
export PYTHONPATH="${MLM_DIR}:${PYTHONPATH}"

OUTPUT_DIR="${OUTPUT_DIR:-$(pwd)}"
MODEL_NAME="$(basename "${HF_MODEL_ID}")"
MEGATRON_PATH="${OUTPUT_DIR}/${MODEL_NAME}-MCore"
TORCH_DTYPE="${TORCH_DTYPE:-bfloat16}"

mkdir -p "${OUTPUT_DIR}"

cd "${BRIDGE_DIR}"
exec python examples/conversion/convert_checkpoints.py import \
--hf-model "${HF_MODEL_ID}" \
--megatron-path "${MEGATRON_PATH}" \
--torch-dtype "${TORCH_DTYPE}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Megatron-Bridge import for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.
#
# Imports HF weights to a Megatron-LM checkpoint via AutoBridge.import_ckpt
# (use_cpu_initialization=True). Uses a single 8xH100 Slurm node — Megatron-Bridge
# requires at least 1 GPU for nccl init even with CPU-resident weights.
#
# Usage:
# export SLURM_HOST=<slurm-host>
# export SLURM_ACCOUNT=<your-team>
# export SLURM_PARTITION=<gpu-partition> # default: batch
# export SLURM_JOB_DIR=/home/scratch.<user>/experiments
# export HF_TOKEN=<your-hf-token> # gated model
# cd tools/launcher
# uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_import.yaml --yes

job_name: Nemotron-3-Super-120B_bridge_import
pipeline:
skip: false
allow_to_fail: false
note: "HF -> MCore import via Megatron-Bridge (8xH100)"

task_0:
script: common/megatron_bridge/import/import.sh
environment:
- HF_MODEL_ID: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
- OUTPUT_DIR: /scratchspace/megatron-bridge
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace hardcoded OUTPUT_DIR with launcher interpolation.

Hardcoding /scratchspace/megatron-bridge makes the example non-portable across clusters/users; prefer a shared variable via launcher interpolation.

As per coding guidelines, tools/launcher/**/*.yaml should use <<global_vars.X>> interpolation for shared values.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_import.yaml`
at line 26, Replace the hardcoded OUTPUT_DIR value with a launcher interpolation
variable: remove "/scratchspace/megatron-bridge" and set OUTPUT_DIR to use the
shared launcher global var (e.g., <<global_vars.OUTPUT_DIR>>); update any
accompanying docs/examples to reference the global var if needed and ensure the
global_vars entry is defined in the launcher YAML set so the
megatron_bridge_import.yaml's OUTPUT_DIR uses <<global_vars.OUTPUT_DIR>> instead
of the literal path.

- EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we use nemo container, then we wont need these extra deps

slurm_config:
_factory_: "slurm_factory"
partition: batch
nodes: 1
ntasks_per_node: 1
gpus_per_node: 8
time: "04:00:00"
6 changes: 5 additions & 1 deletion tools/launcher/launch.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,17 @@
"modules/Megatron-LM/megatron/*",
"modules/Megatron-LM/examples/*",
"modules/Megatron-LM/*.py",
"modules/Megatron-Bridge/src/*",
"modules/Megatron-Bridge/examples/*",
"modules/Megatron-Bridge/pyproject.toml",
"modules/Megatron-Bridge/README.md",
"modules/Model-Optimizer/modelopt/*",
"modules/Model-Optimizer/modelopt_recipes/*",
"modules/Model-Optimizer/examples/*",
"examples/*",
"common/*",
],
relative_path=[LAUNCHER_DIR] * 8,
relative_path=[LAUNCHER_DIR] * 12,
)

MODELOPT_SRC_PATH = os.path.join(LAUNCHER_DIR, "modules/Model-Optimizer/modelopt")
Expand Down
1 change: 1 addition & 0 deletions tools/launcher/modules/Megatron-Bridge
Submodule Megatron-Bridge added at 6f24c7
2 changes: 1 addition & 1 deletion tools/launcher/modules/Megatron-LM
Submodule Megatron-LM updated 1311 files
Loading