Adding a megatron-bridge sample by allela-roy · Pull Request #1065 · awslabs/awsome-distributed-ai

allela-roy · 2026-04-27T12:29:47Z

Purpose

Adding a new Megatron-Bridge sample for distributed pretraining of Qwen 3 models on SMHP EKS

Changes

Added corresponding sample files

Test Plan

Ran the full workflow: built and pushed the container image to ECR, downloaded Qwen3-0.6B weights to FSx, and launched a 2-node distributed pretraining job on a SMHP EKS cluster.

Environment:

AWS Service: SageMaker HyperPod EKS
EKS version: 1.33
Instance type: ml.p5.48xlarge
Number of nodes: 2
AMI: Amazon Linux 2023 (ami-0399b1f4b455cb701)

Test commands:

# Set environment variables
export AWS_REGION=us-east-1
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export REPO_URI=${REGISTRY}megatron-bridge-qwen3:latest

# Build and push image
docker build -f aws-megatron-bridge.Dockerfile -t ${REPO_URI} .
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${REGISTRY}
docker image push ${REPO_URI}

# Deploy training job (Qwen3-0.6B, 2 nodes, TP=1, PP=1, 10 iterations)
export NUM_NODES=2 GPU_PER_NODE=8 EFA_PER_NODE=32 FI_PROVIDER=efa
export MODEL_SIZE=0.6b TENSOR_PARALLEL=1 PIPELINE_PARALLEL=1
export SEQ_LENGTH=2048 GLOBAL_BATCH_SIZE=16 MICRO_BATCH_SIZE=1 TRAIN_ITERS=10

envsubst '$REPO_URI $NUM_NODES $GPU_PER_NODE $EFA_PER_NODE $FI_PROVIDER $MODEL_SIZE $TRAIN_ITERS $SEQ_LENGTH $GLOBAL_BATCH_SIZE $MICRO_BATCH_SIZE $TENSOR_PARALLEL $PIPELINE_PARALLEL' \
  < kubernetes/qwen3/manifests/pytorchjob.yaml-template | kubectl apply -f -

# Monitor
kubectl logs -f megatron-bridge-qwen3-worker-0
kubectl get pytorchjob megatron-bridge-qwen3


## Test Results

**Job status:** Succeeded (2/2 workers completed)

**EFA verification**

NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.18.0
NCCL INFO NET/OFI Using Libfabric version 2.4
NCCL INFO NET/OFI Using transport protocol RDMA (platform set)
NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct (found 32 nics)
NCCL INFO NET/OFI Configuring AWS-specific options
NCCL INFO NET/OFI Internode latency set at 75.0 us
NCCL INFO TUNER/Plugin: Using nccl_ofi_tuner (v3)
NCCL INFO NET/OFI Region base Tuner is chosen for platform: p5.48xlarge


**Training output** (Qwen3-0.6B, 16x H100, TP=1 PP=1, mock data):

[Megatron-Bridge] Qwen3-0.6B pretraining
[Megatron-Bridge] World size: 16, TP=1, PP=1
[Megatron-Bridge] Model built: 0.60B params, 0.60B trainable

[Megatron-Bridge] Training complete!


## Directory Structure

<!-- If adding or updating a test case, ensure it follows the expected layout below. -->

3.test_cases/
└── / # e.g. pytorch, megatron, jax
└── / # e.g. picotron, FSDP, megatron-lm
└── / # e.g. SmolLM-1.7B (may be omitted for single-model cases)
├── Dockerfile # Container / environment setup
├── README.md # Overview, prerequisites, usage
├── slurm/ # Slurm-specific launch scripts
├── kubernetes/ # Kubernetes manifests
└── hyperpod-eks/ # HyperPod EKS instructions


- Top-level files (`Dockerfile`, `README.md`, training scripts, configs) cover general setup.
- Subdirectories (`slurm/`, `kubernetes/`, `hyperpod-eks/`) contain service-specific launch instructions.
- Not all service subdirectories are required — include only the ones relevant to your test case.

## Checklist

- [X] I have read the [contributing guidelines](https://github.com/awslabs/awsome-distributed-training/blob/main/CONTRIBUTING.md).
- [X ] I am working against the latest `main` branch.
- [X ] I have searched existing open and recently merged PRs to confirm this is not a duplicate.
- [X ] The contribution is self-contained with documentation and scripts.
- [X ] External dependencies are pinned to a specific version or tag (no `latest`).
- [X] A README is included or updated with prerequisites, instructions, and known issues.
- [X] New test cases follow the [expected directory structure](#directory-structure).

paragao

Review: PR #1065 — Adding a megatron-bridge sample

Thanks for this contribution. A few observations:

What's done well

Security: HF token is correctly sourced from a Kubernetes Secret (hf-token) — no hardcoded credentials anywhere.
HPC/EFA config: The NCCL environment variables are comprehensive and correct for p5.48xlarge (32 EFA adapters, FI_EFA_USE_DEVICE_RDMA=1, NCCL_SOCKET_IFNAME=^lo, trace buffer/dump for debugging hangs).
Documentation: The README is thorough — prerequisites, parallelism recommendation table, validated output, and cleanup instructions.
Megatron-Core patterns: The main_grad allocation and gradient copy logic is correct.
Fault tolerance: Elastic policy with maxRestarts: 100 + restartPolicy: OnFailure is well configured.

Blocking — please address before merge

1. Megatron-LM dependency is not pinned (aws-megatron-bridge.Dockerfile, line 113)

pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@main

This installs from @main, which can change at any time. The repo's contribution checklist requires: "External dependencies are pinned to a specific version or tag (no latest)".

Fix: Pin to the current commit hash:

pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@<commit_hash>

2. Training loop only supports PP=1, but README recommends PP=2 for Qwen3-32B (pretrain_qwen3.py, line 176)

output = model[0](input_ids, position_ids, attention_mask)

With pipeline parallelism > 1, Megatron-Core returns multiple model chunks (one per stage). Only calling model[0] means stage-1 workers never receive activations — the job will hang or produce incorrect results. The README's Qwen3-32B row recommends PP=2, so users following that guidance would hit a broken run.

Fix (option A): Add a guard and update the README:

assert len(model) == 1, "This sample only supports pipeline_model_parallel_size=1"

Then change the Qwen3-32B recommendation to PP=1 (adjusting TP/CP to compensate).

Fix (option B): Implement pipeline-parallel scheduling using Megatron-Core's forward_backward_no_pipelining / forward_backward_pipelining_with_interleaving from megatron.core.pipeline_parallel.schedules.

Non-blocking suggestions

Error handling: Wrap the training entrypoint in try/finally to ensure dist.destroy_process_group() is called on failure.
etcd resource limits: The etcd container in pytorchjob.yaml-template has no CPU/memory requests or limits. Add them for scheduling predictability.
etcd version: 3.4.13-0 is from 2020. Consider upgrading to a 3.5.x release for security patches.
Unused global_batch_size argument: The arg is accepted but the training loop doesn't implement gradient accumulation. A comment clarifying this is a simplified mock loop would prevent confusion.
SSH setup may be unnecessary: Since rendezvous uses torchrun/etcd (not MPI), the SSH key generation and config weakening in the Dockerfile add attack surface without benefit. Consider removing lines 60-71.
README validated output: Consider replacing specific node names (hyperpod-i-008789534bbb4c33f) and internal IPs with placeholders for a cleaner public sample.

KeitaW

Review Batch 1/5 — Structure & Repository Hygiene

Thanks for the contribution — overall this is a thorough sample with a great
README. The findings below are split across five themed batches; this one
covers structure and repo conventions. Subsequent batches will cover
deployment pipeline, training code, infrastructure/NCCL, and documentation.

Directory layout puts the model below `kubernetes/` instead of above it

The repo convention (and the PR template you ticked) is
<framework>/<library>/<model>/{Dockerfile, README, slurm/, kubernetes/, hyperpod-eks/},
i.e. model directory above platform directories. The sibling megatron-lm and
nemo test cases follow that layout. Here the structure is inverted:
<library>/kubernetes/<model>/, which means the training script
pretrain_qwen3.py lives at kubernetes/qwen3/pretrain_qwen3.py even though
it's platform-agnostic Python that torchrun can invoke from Slurm too.

I'd suggest restructuring to
3.test_cases/megatron/megatron-bridge/qwen3/{Dockerfile, README.md, kubernetes/manifests/...}
so that adding a Slurm path later (or another model) doesn't require
re-locating the script. If you'd rather keep the current layout because the
sample is EKS-only today, it might still be worth pulling pretrain_qwen3.py
up to the library level and leaving only manifests under kubernetes/.

KeitaW · 2026-04-29T09:49:16Z

+export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
+export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
+export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
+export IMAGE_TAG=latest


:latest shown in user-facing build commands

The README sets IMAGE_TAG=latest here and prints ${REGISTRY}megatron-bridge-qwen3:latest as the image URI users should plug into REPO_URI (lines 58, 71). The repo policy is to avoid latest for reproducibility. For a sample whose validated output is captured against a specific container build, I'd suggest a date-stamped or semver tag and noting that users should bump it on each rebuild.

Suggested change

export IMAGE_TAG=latest

export IMAGE_TAG=2026-04-27 # bump on each rebuild; avoid `latest` for reproducibility

KeitaW

Review Batch 2/5 — Deployment Pipeline (K8s / Slurm)

Two findings on the K8s manifests below. Both are about reducing
runtime variability on the hot path.

KeitaW · 2026-04-29T09:49:21Z

+        args:
+          - |
+            set -ex
+            pip install huggingface_hub


huggingface_hub installed at job runtime without a version pin

The download Job runs pip install huggingface_hub at launch time. That pulls whatever is current on PyPI — both unpinned and network-fragile — and re-runs on every retry. Since the Dockerfile already has internet access at build time, I'd suggest moving the install into aws-megatron-bridge.Dockerfile with an upper bound (e.g. huggingface_hub>=0.25,<1.0) and dropping the runtime install. That also lets the download Job start instantly from FSx's perspective.

KeitaW · 2026-04-29T09:49:21Z

+          containers:
+            - name: pytorch
+              image: ${REPO_URI}
+              imagePullPolicy: Always


imagePullPolicy: Always re-pulls the image on every pod start

Combined with :latest in the README this is consistent — but if you switch to a fixed tag (per the Structure batch suggestion) you can drop this to IfNotPresent and shave significant time off rendezvous on a multi-restart elastic job. Worth noting in the README either way.

Suggested change

imagePullPolicy: Always

imagePullPolicy: IfNotPresent

KeitaW

Review Batch 3/5 — Training Code Quality

This is the largest batch. The training script has several distributed-training
correctness issues that, taken together, mean the loop doesn't actually train
anything in the way the README claims:

No data-parallel gradient reduction (so DP=4 in the documented 8B run trains 4 independent models)
No pipeline-parallel schedule (so the 32B row in §6 won't run as documented)
The loss expression isn't a real LM loss
The attention_mask shape doesn't match what MCore GPT expects
Manual main_grad → grad copy can silently overwrite real gradients with zeros for embeddings/norms

The clean path forward, in my view, is to use Megatron-Bridge's official
training entrypoint (megatron.bridge.training.pretrain or the recipe API).
That owns the full lifecycle — DDP wrap, distributed optimizer, FP32 master,
gradient clipping, LR schedule, microbatching, and the right attention_mask
plumbing. If the goal is "smallest viable benchmark loop", at minimum (a) flip
wrap_with_ddp=True, (b) drop the manual main_grad allocation, and (c) use a
real cross-entropy loss against shifted labels.

Inline findings follow.

KeitaW · 2026-04-29T09:49:26Z

+
+        # Forward pass (model is a list of pipeline-parallel chunks)
+        output = model[0](input_ids, position_ids, attention_mask)
+        loss = output.float().mean()


loss = output.float().mean() is not a language-modeling loss

The loop calls .backward() on the arithmetic mean of every logit in the
batch. That's not next-token cross-entropy and not even a noisy approximation
of one — it's a different objective with no token-identity signal. Five
mechanical points:

1. No labels are used. A real LM loss is
F.cross_entropy(logits.view(-1, V), labels.view(-1)), where labels is
input_ids shifted by one. The script never constructs labels; the loss is
purely a function of logit magnitudes, independent of the input tokens.

2. The gradient is uniform across the vocab axis.
∂mean(Z) / ∂Z[i,s,v] = 1 / (B · S · V) — the same constant for every logit.
A real CE gradient is sparse: softmax(Z) − one_hot(label). The uniform
gradient learns nothing about token preferences.

3. softmax(logits) is invariant to adding a constant to every logit.
The optimizer is shifting all logits by roughly the same amount per step, so
the predicted distribution is unchanged from initialization. The loss number
decreases but the model doesn't move on the task.

4. The loss is unbounded below. Cross-entropy is ≥ 0 (Gibbs); mean(logits)
has no floor. The validated 0.69 → −4.37 → … → −32.59 curve is the optimizer
shrinking output magnitudes without limit.

5. Step-1 = 0.69 is the wrong order of magnitude. A randomly-initialized
152K-vocab model under real CE starts near log(151936) ≈ 11.93. 0.69 is
just mean(logit_init) near zero.

Why this matters. Anyone using this script as a starting point for real
pretraining will copy the loss expression and silently get zero LM progress
with a beautifully decreasing curve. As a pure throughput benchmark it's also
slightly compromised — uniform vocab gradients skip the LM-head all-gather +
softmax-CE fused kernel that real workloads exercise.

Two options:

Compute a real LM loss — shift input_ids by one to form labels, and use
torch.nn.functional.cross_entropy(logits.view(-1, vocab_size), labels.view(-1))
(or, better, MCore's vocab_parallel_cross_entropy since logits are
vocab-sharded under TP).

Reframe as benchmark-only — rename file/sections to "benchmark" /
"throughput", drop the loss field from the per-step print, and clearly
say in the README that this exercises forward+backward but doesn't optimize
an LM objective. output.sum().backward() is fine and you can drop the
optimizer step entirely; freezing parameters and timing forward only is
cleaner still.

For reference, 3.test_cases/megatron/megatron-lm/ uses Megatron-LM's official
pretrain_gpt.py with proper loss + LR schedule.

KeitaW · 2026-04-29T09:49:26Z

+    if has_weights:
+        if rank == 0:
+            print(f"[Megatron-Bridge] Loading weights from {hf_model_path}")
+        model = provider.provide_distributed_model(wrap_with_ddp=False)


wrap_with_ddp=False skips DP all-reduce — gradients aren't synchronized across data-parallel ranks

Same flag at line 133 (the else branch). provide_distributed_model(wrap_with_ddp=False)
deliberately skips MCore's DDP wrapper. That wrapper is the component that
registers backward hooks (distributed_data_parallel.py#L354),
moves param.grad into param.main_grad inside the post-backward hook
(distributed_data_parallel.py#L449-L453),
and starts/finishes the DP all-reduce (or reduce-scatter when the distributed
optimizer is enabled) in param_and_grad_buffer.py#L517.
Without the wrapper, none of those hooks are registered: loss.backward() on
line 187 only runs local autograd plus the TP collectives MCore inlines into
its column/row-parallel kernels — there's no cross-rank gradient reduction at
all when world_size > tp * pp.

The flag's behaviour is documented in the Megatron-Bridge source:
provide_distributed_model signature and wrap_with_ddp docstring
(default is True, and ddp_config is required when it's True); the
_ddp_wrap helper
is what actually instantiates megatron.core.distributed.DistributedDataParallel
and wires the hooks. Setting wrap_with_ddp=False is supported (the
bridge-guide examples
use it for inference / weight-conversion paths) but it's an inference-mode
shortcut, not a training configuration.

For the README's documented configurations:

Qwen3-8B (TP=4, PP=1, 16 GPUs) → DP=4, gradients diverge silently across
the 4 DP groups.

Qwen3-0.6B / 1.7B (TP=1, PP=1, 16 GPUs) → DP=16, every rank trains
independently.

Only Qwen3-14B and Qwen3-32B happen to have DP=1, so DP reduction is moot
— but those rows have other issues (see the pipeline-schedule and table notes).

Either flip the flag (wrap_with_ddp=True, then let MCore's DDP wire the
hooks) or call out in the README that this is a single-rank toy script and the
"DP" portion of any multi-rank run is not actually trained.

KeitaW · 2026-04-29T09:49:26Z

+        # Copy main_grad -> grad for the standard PyTorch optimizer
+        for p in params:
+            if p.main_grad is not None:
+                if p.grad is None:
+                    p.grad = p.main_grad.to(p.data.dtype)
+                else:
+                    p.grad.copy_(p.main_grad)


Manual main_grad → grad copy can silently zero gradients for embeddings / norms

Lines 138-144 zero-allocate main_grad on every parameter; MCore's fused
column/row-parallel linear backward kernels capture weight.main_grad during
forward and accumulate weight gradients into it, which is how the loop
"appears" to work for the matmul weights. But parameters whose autograd path
writes the normal p.grad instead — embeddings, layer norms, biases on layers
not using the fused kernels — get their real gradient overwritten by the
zero-initialized p.main_grad here:

if p.grad is None: p.grad = p.main_grad.to(p.data.dtype) # main_grad is still zeros for non-fused params else: p.grad.copy_(p.main_grad) # overwrites the real grad with zeros

The optimizer then "steps" with a zero gradient on those parameters, so
embeddings and norms don't actually train. This is in addition to the
no-DP-reduce issue above, and together they explain why the validated loss
curve doesn't look like real pretraining.

The clean fix is to use the official Megatron-Bridge training entrypoint
(megatron.bridge.training.pretrain, or the recipe API) which owns the full
lifecycle. If the goal is "smallest viable benchmark loop", at minimum (a)
flip wrap_with_ddp=True, (b) drop the manual main_grad allocation
(lines 138-144), and (c) let MCore's DDP populate main_grad for all
parameters before the optimizer step.

KeitaW · 2026-04-29T09:49:27Z

+        attention_mask = torch.ones_like(input_ids)
+
+        # Forward pass (model is a list of pipeline-parallel chunks)
+        output = model[0](input_ids, position_ids, attention_mask)


No pipeline-parallel schedule — pp >= 2 configurations won't actually run

Every rank just calls its local model[0] directly. There's no pipeline
schedule, no forward_backward_func, no point-to-point send/recv between PP
stages, and no decoder_input plumbing from the previous stage. That's fine
for pp == 1, but the README's parallelism table (§6) documents Qwen3-32B as
TP=8, PP=2, and MODEL_CONFIGS["32b"] (line 56) hardcodes the same defaults.
The 32B path will either deadlock (later stages waiting for input that never
arrives) or train garbage hidden states.

Either drop the 32B row from the table (and the MODEL_CONFIGS["32b"] entry)
until pipeline support is wired in, or use Megatron-Bridge's training
entrypoint which calls megatron.core.pipeline_parallel.get_forward_backward_func
for you.

As a separate point, when pp >= 2 the return value of model[0](...) is
also no longer logits — non-last stages return hidden states — which would
break the next line's output.float().mean() even before the pipeline
plumbing is fixed.

KeitaW · 2026-04-29T09:49:27Z

+            .unsqueeze(0)
+            .expand(args.micro_batch_size, -1)
+        )
+        attention_mask = torch.ones_like(input_ids)


attention_mask = torch.ones_like(input_ids) is the wrong shape and dtype for MCore GPT

MCore's transformer block expects attention_mask shaped [1, 1, S, S] and
dtype bool (a boolean self-attention mask, where True typically means
"masked out" — the inverse of HuggingFace's "1 = keep" convention). This
script passes [B, S] int64 all-ones, which is the HF padding-mask
convention. For MCore GPT the [B, S] shape is the separate padding_mask
input, not attention_mask.

Depending on the backend (TE vs vanilla MCore attention) the script will
either silently use a default causal mask and ignore the input, or — worse —
mis-interpret the all-ones tensor and mask out everything. Either way the
forward isn't doing what the loop appears to think it's doing.

Two fixes:

Pass attention_mask=None and let MCore's attention build the causal mask
internally (this is what pretrain_gpt.py does for synthetic data).

Or build the mask explicitly:
attention_mask = torch.triu( torch.ones((seq_len, seq_len), dtype=torch.bool, device=f"cuda:{local_rank}"), diagonal=1, ).view(1, 1, seq_len, seq_len)

Suggested change

attention_mask = torch.ones_like(input_ids)

attention_mask = None

KeitaW · 2026-04-29T09:49:27Z

+        print(f"[Megatron-Bridge] World size: {world_size}, TP={tp}, PP={pp}")
+        print(f"[Megatron-Bridge] Model path: {hf_model_path}")
+        print(
+            f"[Megatron-Bridge] Seq={args.seq_length}, GBS={args.global_batch_size}, Iters={args.train_iters}"


--global-batch-size is parsed and printed but never enforced

--global-batch-size is logged here and the README's §5.1 walks users through
setting it, but the loop only iterates one micro-batch per step with no
gradient accumulation and no DP × micro-batch math. The effective batch is
micro_batch_size × (world_size / (tp × pp)), which for the validated 0.6B
run is 1 × 16 = 16 — coincidentally matching the requested GBS=16, but the
script would not honor any other GBS the user sets.

Either implement microbatching/accumulation, or remove --global-batch-size
from the CLI and the README entirely so users aren't misled into thinking the
flag is honored.

KeitaW · 2026-04-29T09:49:27Z

+    tp = args.tp or model_cfg["tp"]
+    pp = args.pp or model_cfg["pp"]


Truthiness fallback on --tp / --pp

0 would silently fall through to the config default. It's not a realistic
input today (TP=0 is meaningless), but is not None is the explicit form the
checklist calls for and it costs nothing.

Suggested change

tp = args.tp or model_cfg["tp"]

pp = args.pp or model_cfg["pp"]

tp = args.tp if args.tp is not None else model_cfg["tp"]

pp = args.pp if args.pp is not None else model_cfg["pp"]

KeitaW

Review Batch 4/5 — Infrastructure & NCCL Configuration

Two findings on the Dockerfile's pip pins. The Dockerfile is otherwise solid
— the EFA stack pattern is canonical and the base image is correctly pinned.

KeitaW · 2026-04-29T09:49:33Z

+######################
+RUN pip uninstall -y megatron-bridge megatron-core \
+    && rm -rf /opt/Megatron-Bridge /opt/megatron-lm \
+    && pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@main \


Megatron-LM pinned to @main is a moving target

The contribution checklist requires external dependencies be pinned to a
specific version, tag, or commit. @main will silently change between builds
and can break the validated training output (which is the main artifact of
this PR). The comment above this block already calls out the version
relationship — "megatron-bridge 0.4.0 requires MCore >=0.18" — so could you
pin to the specific commit or tag of Megatron-LM that matches the 0.4.0
release window (e.g. @core_r0.18.0 or a SHA you've validated against)?

KeitaW · 2026-04-29T09:49:33Z

+    && pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33" \
+    && pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0"


nvidia-modelopt and megatron-bridge lack upper bounds

Other pip installs in the same Dockerfile use bounded ranges
("awscli>=1.44,<2.0", "transformers>=4.57,<5.0"). For consistency and to
match the checklist, please add upper bounds. Without these, a future
docker build of the same Dockerfile may pick up an incompatible major
version and fail to import.

Suggested change

&& pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33" \

&& pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0"

&& pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33,<1.0" \

&& pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0,<0.5.0"

KeitaW

Review Batch 5/5 — Documentation Consistency

Final batch. Three inline doc findings below, plus one cross-file framing
issue and the "things that look great" callout.

"Mock data" framing in §8.1 contradicts the actual behaviour

§8.1 says:

For pure throughput benchmarking, the training script uses mock data by
default (no --hf-model-path required for mock).

Reading pretrain_qwen3.py (lines 168-180), mock data is always used —
input_ids is unconditionally torch.randint(...) regardless of whether real
weights were loaded. So "benchmark mode" is the only mode. §5 ("Distributed
Training") implies real training is happening when --hf-model-path is set,
which isn't the case. This ties back to the loss issue in Batch 3; whichever
way you resolve that, the README narrative needs to match. Could you collapse
§5 and §8.1 into one section that's explicit about benchmark-only behaviour,
or add a real-data training path?

Things That Look Great

The README is unusually thorough for a new test case — eight sections,
validated cluster topology, EFA verification logs, NCCL tuner output, and
per-step throughput. That's exactly what other contributors should be
modeling off of.
The Dockerfile follows the repo's canonical EFA-stack pattern: remove
IB/HPC-X libs, install EFA 1.47.0 (matches the CI minimum) with
--skip-kmod --no-verify --skip-limit-conf, set the right LD_LIBRARY_PATH
ordering, then install GDRCopy. Easy to review.
envsubst is called with explicit variable whitelists in both README
examples ('$REPO_URI $HF_MODEL $MODEL_SIZE' and the longer training one),
which is the convention.
NCCL_SOCKET_IFNAME=^lo (manifest) and ^docker,lo,veth_def_agent
(Dockerfile) use exclusion patterns rather than positive selection — the
repo's preferred form, and the one that survives EFA interfaces.
The pytorchjob.yaml-template ships a strong NCCL/Torch debug environment
(TORCH_NCCL_TRACE_BUFFER_SIZE, TORCH_NCCL_DUMP_ON_TIMEOUT,
TORCH_DISTRIBUTED_DEBUG=DETAIL) which is genuinely useful for first-time
HyperPod EKS users debugging hangs.
Using a Kubernetes Secret (hf-token) for the HuggingFace token rather
than baking it into the manifest is the right choice and is propagated
cleanly to both the download Job and the training Worker.
PyTorchJob (Kubeflow) is used for distributed training rather than a
plain Job, matching the repo guidance.
The base image nvcr.io/nvidia/nemo:25.07 is pinned to an explicit tag.
The validated test plan in the PR description (8m wall-clock, 2/2 workers
succeeded, 32 NICs detected per node, ~330K tokens/s steady-state) gives a
reviewer confidence the artifact actually runs end-to-end on the target
hardware.

Thanks again for the contribution!

KeitaW · 2026-04-29T09:49:38Z

+## 6. Model Sizes and Recommended Parallelism
+
+The table below provides recommended parallelism settings for each Qwen 3 model size
+on 2x p5.48xlarge (16 GPUs total):
+
+| Model | Parameters | TP | PP | Nodes (p5.48xlarge) | Notes |
+|-------|-----------|----|----|---------------------|-------|
+| Qwen3-0.6B | 0.6B | 1 | 1 | 1 | Fits on single GPU |
+| Qwen3-1.7B | 1.7B | 1 | 1 | 1 | Fits on single GPU |
+| Qwen3-4B | 4B | 2 | 1 | 1 | 2-way tensor parallel |
+| Qwen3-8B | 8B | 4 | 1 | 1 | 4-way tensor parallel |
+| Qwen3-14B | 14B | 8 | 1 | 1 | Full node tensor parallel |
+| Qwen3-32B | 32B | 8 | 2 | 2 | TP + PP with activation recompute |
+


Section 6 parallelism table contradicts the section heading

The header reads "recommended parallelism settings for each Qwen 3 model size
on 2x p5.48xlarge (16 GPUs total)" but the table's Nodes (p5.48xlarge)
column mixes 1 and 2. Models marked Nodes=1 (Qwen3-0.6B through
Qwen3-14B) only make sense on a single-node cluster, contradicting the
section premise. I think what you meant is "minimum nodes required" — could
you either retitle the section ("Minimum cluster sizing per Qwen 3 size") or
expand the table to show both single-node and 2-node configurations
explicitly? The latter is more useful since the entire validated workflow
assumes 2 nodes.

(See also the pipeline-schedule finding in Batch 3 — the 32B row needs to be
revisited regardless: with no PP schedule wired in, that config won't run as
documented.)

KeitaW · 2026-04-29T09:49:38Z

+export MODEL_SIZE=8b                       # Match the model: 0.6b, 1.7b, 4b, 8b, 14b, 32b
+
+# Create a Kubernetes Secret for HuggingFace token (one-time setup)
+kubectl create secret generic hf-token --from-literal=token=<your-huggingface-token>


<your-huggingface-token> placeholder lacks an explicit "replace this" cue

Most placeholder readers know to replace it, but a one-line note above the
command ("Replace <your-huggingface-token> with a token from
huggingface.co/settings/tokens that has access to the Qwen3 repos.") would
save a step for first-time HyperPod users. Minor.

KeitaW · 2026-04-29T09:49:38Z

+!*.yaml-template
+*.yaml


manifests/.gitignore rule ordering is unnecessary

*.yaml doesn't match *.yaml-template (the glob is anchored to the file
suffix), so the !*.yaml-template re-include rule is a no-op. You can drop
it and just keep *.yaml. Not a bug, just dead config.

Suggested change

!*.yaml-template

*.yaml

*.yaml

KeitaW

Few comments

Adding megatron-bridge sample

af689d3

allela-roy changed the title ~~Adding a new megatron-bridge sample~~ Adding a megatron-bridge sample Apr 27, 2026

paragao requested changes Apr 27, 2026

View reviewed changes

KeitaW reviewed Apr 29, 2026

View reviewed changes

KeitaW requested changes Apr 29, 2026

View reviewed changes

KeitaW mentioned this pull request May 16, 2026

Add Qwen3.6-35B-A3B MoE LoRA function-calling test case (Megatron-Bridge + Kubeflow PyTorchJob) #1091

Open

7 tasks

	export IMAGE_TAG=latest
	export IMAGE_TAG=2026-04-27 # bump on each rebuild; avoid `latest` for reproducibility

	attention_mask = torch.ones_like(input_ids)
	attention_mask = None

		tp = args.tp or model_cfg["tp"]
		pp = args.pp or model_cfg["pp"]

		&& pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33" \
		&& pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0"

Conversation

allela-roy commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Uh oh!

paragao left a comment

Choose a reason for hiding this comment

Review: PR #1065 — Adding a megatron-bridge sample

What's done well

Blocking — please address before merge

Non-blocking suggestions

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/5 — Structure & Repository Hygiene

Directory layout puts the model below kubernetes/ instead of above it

Uh oh!

Choose a reason for hiding this comment

:latest shown in user-facing build commands

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/5 — Deployment Pipeline (K8s / Slurm)

Uh oh!

Choose a reason for hiding this comment

huggingface_hub installed at job runtime without a version pin

Uh oh!

Choose a reason for hiding this comment

imagePullPolicy: Always re-pulls the image on every pod start

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/5 — Training Code Quality

Uh oh!

Choose a reason for hiding this comment

loss = output.float().mean() is not a language-modeling loss

Uh oh!

Choose a reason for hiding this comment

wrap_with_ddp=False skips DP all-reduce — gradients aren't synchronized across data-parallel ranks

Uh oh!

Choose a reason for hiding this comment

Manual main_grad → grad copy can silently zero gradients for embeddings / norms

Uh oh!

Choose a reason for hiding this comment

No pipeline-parallel schedule — pp >= 2 configurations won't actually run

Uh oh!

Choose a reason for hiding this comment

attention_mask = torch.ones_like(input_ids) is the wrong shape and dtype for MCore GPT

Uh oh!

Choose a reason for hiding this comment

--global-batch-size is parsed and printed but never enforced

Uh oh!

Choose a reason for hiding this comment

Truthiness fallback on --tp / --pp

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 4/5 — Infrastructure & NCCL Configuration

Uh oh!

Choose a reason for hiding this comment

Megatron-LM pinned to @main is a moving target

Uh oh!

Choose a reason for hiding this comment

nvidia-modelopt and megatron-bridge lack upper bounds

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 5/5 — Documentation Consistency

"Mock data" framing in §8.1 contradicts the actual behaviour

Things That Look Great

Uh oh!

Choose a reason for hiding this comment

Section 6 parallelism table contradicts the section heading

Uh oh!

Choose a reason for hiding this comment

<your-huggingface-token> placeholder lacks an explicit "replace this" cue

Uh oh!

Choose a reason for hiding this comment

allela-roy commented Apr 27, 2026 •

edited

Loading

Directory layout puts the model below `kubernetes/` instead of above it

`:latest` shown in user-facing build commands

`huggingface_hub` installed at job runtime without a version pin

`imagePullPolicy: Always` re-pulls the image on every pod start

`loss = output.float().mean()` is not a language-modeling loss

`wrap_with_ddp=False` skips DP all-reduce — gradients aren't synchronized across data-parallel ranks

Manual `main_grad → grad` copy can silently zero gradients for embeddings / norms

No pipeline-parallel schedule — `pp >= 2` configurations won't actually run

`attention_mask = torch.ones_like(input_ids)` is the wrong shape and dtype for MCore GPT

`--global-batch-size` is parsed and printed but never enforced

Truthiness fallback on `--tp` / `--pp`

`Megatron-LM` pinned to `@main` is a moving target

`nvidia-modelopt` and `megatron-bridge` lack upper bounds

`<your-huggingface-token>` placeholder lacks an explicit "replace this" cue

`manifests/.gitignore` rule ordering is unnecessary