Skip to content

Adding a megatron-bridge sample#1065

Open
allela-roy wants to merge 1 commit into
mainfrom
megatron-bridge
Open

Adding a megatron-bridge sample#1065
allela-roy wants to merge 1 commit into
mainfrom
megatron-bridge

Conversation

@allela-roy
Copy link
Copy Markdown
Contributor

@allela-roy allela-roy commented Apr 27, 2026

Purpose

Adding a new Megatron-Bridge sample for distributed pretraining of Qwen 3 models on SMHP EKS

Changes

Added corresponding sample files

Test Plan

Ran the full workflow: built and pushed the container image to ECR, downloaded Qwen3-0.6B weights to FSx, and launched a 2-node distributed pretraining job on a SMHP EKS cluster.

Environment:

  • AWS Service: SageMaker HyperPod EKS
  • EKS version: 1.33
  • Instance type: ml.p5.48xlarge
  • Number of nodes: 2
  • AMI: Amazon Linux 2023 (ami-0399b1f4b455cb701)

Test commands:

# Set environment variables
export AWS_REGION=us-east-1
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export REPO_URI=${REGISTRY}megatron-bridge-qwen3:latest

# Build and push image
docker build -f aws-megatron-bridge.Dockerfile -t ${REPO_URI} .
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${REGISTRY}
docker image push ${REPO_URI}

# Deploy training job (Qwen3-0.6B, 2 nodes, TP=1, PP=1, 10 iterations)
export NUM_NODES=2 GPU_PER_NODE=8 EFA_PER_NODE=32 FI_PROVIDER=efa
export MODEL_SIZE=0.6b TENSOR_PARALLEL=1 PIPELINE_PARALLEL=1
export SEQ_LENGTH=2048 GLOBAL_BATCH_SIZE=16 MICRO_BATCH_SIZE=1 TRAIN_ITERS=10

envsubst '$REPO_URI $NUM_NODES $GPU_PER_NODE $EFA_PER_NODE $FI_PROVIDER $MODEL_SIZE $TRAIN_ITERS $SEQ_LENGTH $GLOBAL_BATCH_SIZE $MICRO_BATCH_SIZE $TENSOR_PARALLEL $PIPELINE_PARALLEL' \
  < kubernetes/qwen3/manifests/pytorchjob.yaml-template | kubectl apply -f -

# Monitor
kubectl logs -f megatron-bridge-qwen3-worker-0
kubectl get pytorchjob megatron-bridge-qwen3

## Test Results

**Job status:** Succeeded (2/2 workers completed)

**EFA verification** 

NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.18.0
NCCL INFO NET/OFI Using Libfabric version 2.4
NCCL INFO NET/OFI Using transport protocol RDMA (platform set)
NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct (found 32 nics)
NCCL INFO NET/OFI Configuring AWS-specific options
NCCL INFO NET/OFI Internode latency set at 75.0 us
NCCL INFO TUNER/Plugin: Using nccl_ofi_tuner (v3)
NCCL INFO NET/OFI Region base Tuner is chosen for platform: p5.48xlarge


**Training output** (Qwen3-0.6B, 16x H100, TP=1 PP=1, mock data):

[Megatron-Bridge] Qwen3-0.6B pretraining
[Megatron-Bridge] World size: 16, TP=1, PP=1
[Megatron-Bridge] Model built: 0.60B params, 0.60B trainable

step 1/10 | loss: 0.6869 | time: 1.89s | tokens/s: 17,351
step 2/10 | loss: -4.3676 | time: 0.14s | tokens/s: 241,682
step 3/10 | loss: -8.8785 | time: 0.11s | tokens/s: 306,938
step 4/10 | loss: -14.0235 | time: 0.10s | tokens/s: 328,907
step 5/10 | loss: -18.2283 | time: 0.10s | tokens/s: 329,265
step 6/10 | loss: -19.9766 | time: 0.10s | tokens/s: 330,300
step 7/10 | loss: -23.3931 | time: 0.10s | tokens/s: 330,264
step 8/10 | loss: -26.0851 | time: 0.10s | tokens/s: 329,648
step 9/10 | loss: -29.8682 | time: 0.11s | tokens/s: 306,454
step 10/10 | loss: -32.5855 | time: 0.10s | tokens/s: 326,035

[Megatron-Bridge] Training complete!


## Directory Structure

<!-- If adding or updating a test case, ensure it follows the expected layout below. -->

3.test_cases/
└── / # e.g. pytorch, megatron, jax
└── / # e.g. picotron, FSDP, megatron-lm
└── / # e.g. SmolLM-1.7B (may be omitted for single-model cases)
├── Dockerfile # Container / environment setup
├── README.md # Overview, prerequisites, usage
├── slurm/ # Slurm-specific launch scripts
├── kubernetes/ # Kubernetes manifests
└── hyperpod-eks/ # HyperPod EKS instructions


- Top-level files (`Dockerfile`, `README.md`, training scripts, configs) cover general setup.
- Subdirectories (`slurm/`, `kubernetes/`, `hyperpod-eks/`) contain service-specific launch instructions.
- Not all service subdirectories are required — include only the ones relevant to your test case.

## Checklist

- [X] I have read the [contributing guidelines](https://github.com/awslabs/awsome-distributed-training/blob/main/CONTRIBUTING.md).
- [X ] I am working against the latest `main` branch.
- [X ] I have searched existing open and recently merged PRs to confirm this is not a duplicate.
- [X ] The contribution is self-contained with documentation and scripts.
- [X ] External dependencies are pinned to a specific version or tag (no `latest`).
- [X] A README is included or updated with prerequisites, instructions, and known issues.
- [X] New test cases follow the [expected directory structure](#directory-structure).

@allela-roy allela-roy changed the title Adding a new megatron-bridge sample Adding a megatron-bridge sample Apr 27, 2026
Copy link
Copy Markdown
Contributor

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #1065 — Adding a megatron-bridge sample

Thanks for this contribution. A few observations:

What's done well

  • Security: HF token is correctly sourced from a Kubernetes Secret (hf-token) — no hardcoded credentials anywhere.
  • HPC/EFA config: The NCCL environment variables are comprehensive and correct for p5.48xlarge (32 EFA adapters, FI_EFA_USE_DEVICE_RDMA=1, NCCL_SOCKET_IFNAME=^lo, trace buffer/dump for debugging hangs).
  • Documentation: The README is thorough — prerequisites, parallelism recommendation table, validated output, and cleanup instructions.
  • Megatron-Core patterns: The main_grad allocation and gradient copy logic is correct.
  • Fault tolerance: Elastic policy with maxRestarts: 100 + restartPolicy: OnFailure is well configured.

Blocking — please address before merge

1. Megatron-LM dependency is not pinned (aws-megatron-bridge.Dockerfile, line 113)

pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@main

This installs from @main, which can change at any time. The repo's contribution checklist requires: "External dependencies are pinned to a specific version or tag (no latest)".

Fix: Pin to the current commit hash:

pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@<commit_hash>

2. Training loop only supports PP=1, but README recommends PP=2 for Qwen3-32B (pretrain_qwen3.py, line 176)

output = model[0](input_ids, position_ids, attention_mask)

With pipeline parallelism > 1, Megatron-Core returns multiple model chunks (one per stage). Only calling model[0] means stage-1 workers never receive activations — the job will hang or produce incorrect results. The README's Qwen3-32B row recommends PP=2, so users following that guidance would hit a broken run.

Fix (option A): Add a guard and update the README:

assert len(model) == 1, "This sample only supports pipeline_model_parallel_size=1"

Then change the Qwen3-32B recommendation to PP=1 (adjusting TP/CP to compensate).

Fix (option B): Implement pipeline-parallel scheduling using Megatron-Core's forward_backward_no_pipelining / forward_backward_pipelining_with_interleaving from megatron.core.pipeline_parallel.schedules.


Non-blocking suggestions

  • Error handling: Wrap the training entrypoint in try/finally to ensure dist.destroy_process_group() is called on failure.
  • etcd resource limits: The etcd container in pytorchjob.yaml-template has no CPU/memory requests or limits. Add them for scheduling predictability.
  • etcd version: 3.4.13-0 is from 2020. Consider upgrading to a 3.5.x release for security patches.
  • Unused global_batch_size argument: The arg is accepted but the training loop doesn't implement gradient accumulation. A comment clarifying this is a simplified mock loop would prevent confusion.
  • SSH setup may be unnecessary: Since rendezvous uses torchrun/etcd (not MPI), the SSH key generation and config weakening in the Dockerfile add attack surface without benefit. Consider removing lines 60-71.
  • README validated output: Consider replacing specific node names (hyperpod-i-008789534bbb4c33f) and internal IPs with placeholders for a cleaner public sample.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/5 — Structure & Repository Hygiene

Thanks for the contribution — overall this is a thorough sample with a great
README. The findings below are split across five themed batches; this one
covers structure and repo conventions. Subsequent batches will cover
deployment pipeline, training code, infrastructure/NCCL, and documentation.

Directory layout puts the model below kubernetes/ instead of above it

The repo convention (and the PR template you ticked) is
<framework>/<library>/<model>/{Dockerfile, README, slurm/, kubernetes/, hyperpod-eks/},
i.e. model directory above platform directories. The sibling megatron-lm and
nemo test cases follow that layout. Here the structure is inverted:
<library>/kubernetes/<model>/, which means the training script
pretrain_qwen3.py lives at kubernetes/qwen3/pretrain_qwen3.py even though
it's platform-agnostic Python that torchrun can invoke from Slurm too.

I'd suggest restructuring to
3.test_cases/megatron/megatron-bridge/qwen3/{Dockerfile, README.md, kubernetes/manifests/...}
so that adding a Slurm path later (or another model) doesn't require
re-locating the script. If you'd rather keep the current layout because the
sample is EKS-only today, it might still be worth pulling pretrain_qwen3.py
up to the library level and leaving only manifests under kubernetes/.

export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE_TAG=latest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:latest shown in user-facing build commands

The README sets IMAGE_TAG=latest here and prints ${REGISTRY}megatron-bridge-qwen3:latest as the image URI users should plug into REPO_URI (lines 58, 71). The repo policy is to avoid latest for reproducibility. For a sample whose validated output is captured against a specific container build, I'd suggest a date-stamped or semver tag and noting that users should bump it on each rebuild.

Suggested change
export IMAGE_TAG=latest
export IMAGE_TAG=2026-04-27 # bump on each rebuild; avoid `latest` for reproducibility

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/5 — Deployment Pipeline (K8s / Slurm)

Two findings on the K8s manifests below. Both are about reducing
runtime variability on the hot path.

args:
- |
set -ex
pip install huggingface_hub
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huggingface_hub installed at job runtime without a version pin

The download Job runs pip install huggingface_hub at launch time. That pulls whatever is current on PyPI — both unpinned and network-fragile — and re-runs on every retry. Since the Dockerfile already has internet access at build time, I'd suggest moving the install into aws-megatron-bridge.Dockerfile with an upper bound (e.g. huggingface_hub>=0.25,<1.0) and dropping the runtime install. That also lets the download Job start instantly from FSx's perspective.

containers:
- name: pytorch
image: ${REPO_URI}
imagePullPolicy: Always
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagePullPolicy: Always re-pulls the image on every pod start

Combined with :latest in the README this is consistent — but if you switch to a fixed tag (per the Structure batch suggestion) you can drop this to IfNotPresent and shave significant time off rendezvous on a multi-restart elastic job. Worth noting in the README either way.

Suggested change
imagePullPolicy: Always
imagePullPolicy: IfNotPresent

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/5 — Training Code Quality

This is the largest batch. The training script has several distributed-training
correctness issues that, taken together, mean the loop doesn't actually train
anything in the way the README claims:

  • No data-parallel gradient reduction (so DP=4 in the documented 8B run trains 4 independent models)
  • No pipeline-parallel schedule (so the 32B row in §6 won't run as documented)
  • The loss expression isn't a real LM loss
  • The attention_mask shape doesn't match what MCore GPT expects
  • Manual main_grad → grad copy can silently overwrite real gradients with zeros for embeddings/norms

The clean path forward, in my view, is to use Megatron-Bridge's official
training entrypoint (megatron.bridge.training.pretrain or the recipe API).
That owns the full lifecycle — DDP wrap, distributed optimizer, FP32 master,
gradient clipping, LR schedule, microbatching, and the right attention_mask
plumbing. If the goal is "smallest viable benchmark loop", at minimum (a) flip
wrap_with_ddp=True, (b) drop the manual main_grad allocation, and (c) use a
real cross-entropy loss against shifted labels.

Inline findings follow.


# Forward pass (model is a list of pipeline-parallel chunks)
output = model[0](input_ids, position_ids, attention_mask)
loss = output.float().mean()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loss = output.float().mean() is not a language-modeling loss

The loop calls .backward() on the arithmetic mean of every logit in the
batch. That's not next-token cross-entropy and not even a noisy approximation
of one — it's a different objective with no token-identity signal. Five
mechanical points:

1. No labels are used. A real LM loss is
F.cross_entropy(logits.view(-1, V), labels.view(-1)), where labels is
input_ids shifted by one. The script never constructs labels; the loss is
purely a function of logit magnitudes, independent of the input tokens.

2. The gradient is uniform across the vocab axis.
∂mean(Z) / ∂Z[i,s,v] = 1 / (B · S · V) — the same constant for every logit.
A real CE gradient is sparse: softmax(Z) − one_hot(label). The uniform
gradient learns nothing about token preferences.

3. softmax(logits) is invariant to adding a constant to every logit.
The optimizer is shifting all logits by roughly the same amount per step, so
the predicted distribution is unchanged from initialization. The loss number
decreases but the model doesn't move on the task.

4. The loss is unbounded below. Cross-entropy is ≥ 0 (Gibbs); mean(logits)
has no floor. The validated 0.69 → −4.37 → … → −32.59 curve is the optimizer
shrinking output magnitudes without limit.

5. Step-1 = 0.69 is the wrong order of magnitude. A randomly-initialized
152K-vocab model under real CE starts near log(151936) ≈ 11.93. 0.69 is
just mean(logit_init) near zero.

Why this matters. Anyone using this script as a starting point for real
pretraining will copy the loss expression and silently get zero LM progress
with a beautifully decreasing curve. As a pure throughput benchmark it's also
slightly compromised — uniform vocab gradients skip the LM-head all-gather +
softmax-CE fused kernel that real workloads exercise.

Two options:

  1. Compute a real LM loss — shift input_ids by one to form labels, and use
    torch.nn.functional.cross_entropy(logits.view(-1, vocab_size), labels.view(-1))
    (or, better, MCore's vocab_parallel_cross_entropy since logits are
    vocab-sharded under TP).
  2. Reframe as benchmark-only — rename file/sections to "benchmark" /
    "throughput", drop the loss field from the per-step print, and clearly
    say in the README that this exercises forward+backward but doesn't optimize
    an LM objective. output.sum().backward() is fine and you can drop the
    optimizer step entirely; freezing parameters and timing forward only is
    cleaner still.

For reference, 3.test_cases/megatron/megatron-lm/ uses Megatron-LM's official
pretrain_gpt.py with proper loss + LR schedule.

if has_weights:
if rank == 0:
print(f"[Megatron-Bridge] Loading weights from {hf_model_path}")
model = provider.provide_distributed_model(wrap_with_ddp=False)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap_with_ddp=False skips DP all-reduce — gradients aren't synchronized across data-parallel ranks

Same flag at line 133 (the else branch). provide_distributed_model(wrap_with_ddp=False)
deliberately skips MCore's DDP wrapper. That wrapper is the component that
registers backward hooks (distributed_data_parallel.py#L354),
moves param.grad into param.main_grad inside the post-backward hook
(distributed_data_parallel.py#L449-L453),
and starts/finishes the DP all-reduce (or reduce-scatter when the distributed
optimizer is enabled) in param_and_grad_buffer.py#L517.
Without the wrapper, none of those hooks are registered: loss.backward() on
line 187 only runs local autograd plus the TP collectives MCore inlines into
its column/row-parallel kernels — there's no cross-rank gradient reduction at
all when world_size > tp * pp.

The flag's behaviour is documented in the Megatron-Bridge source:
provide_distributed_model signature and wrap_with_ddp docstring
(default is True, and ddp_config is required when it's True); the
_ddp_wrap helper
is what actually instantiates megatron.core.distributed.DistributedDataParallel
and wires the hooks. Setting wrap_with_ddp=False is supported (the
bridge-guide examples
use it for inference / weight-conversion paths) but it's an inference-mode
shortcut, not a training configuration.

For the README's documented configurations:

  • Qwen3-8B (TP=4, PP=1, 16 GPUs) → DP=4, gradients diverge silently across
    the 4 DP groups.
  • Qwen3-0.6B / 1.7B (TP=1, PP=1, 16 GPUs) → DP=16, every rank trains
    independently.
  • Only Qwen3-14B and Qwen3-32B happen to have DP=1, so DP reduction is moot
    — but those rows have other issues (see the pipeline-schedule and table notes).

Either flip the flag (wrap_with_ddp=True, then let MCore's DDP wire the
hooks) or call out in the README that this is a single-rank toy script and the
"DP" portion of any multi-rank run is not actually trained.

Comment on lines +189 to +195
# Copy main_grad -> grad for the standard PyTorch optimizer
for p in params:
if p.main_grad is not None:
if p.grad is None:
p.grad = p.main_grad.to(p.data.dtype)
else:
p.grad.copy_(p.main_grad)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual main_grad → grad copy can silently zero gradients for embeddings / norms

Lines 138-144 zero-allocate main_grad on every parameter; MCore's fused
column/row-parallel linear backward kernels capture weight.main_grad during
forward and accumulate weight gradients into it, which is how the loop
"appears" to work for the matmul weights. But parameters whose autograd path
writes the normal p.grad instead — embeddings, layer norms, biases on layers
not using the fused kernels — get their real gradient overwritten by the
zero-initialized p.main_grad here:

if p.grad is None:
    p.grad = p.main_grad.to(p.data.dtype)   # main_grad is still zeros for non-fused params
else:
    p.grad.copy_(p.main_grad)               # overwrites the real grad with zeros

The optimizer then "steps" with a zero gradient on those parameters, so
embeddings and norms don't actually train. This is in addition to the
no-DP-reduce issue above, and together they explain why the validated loss
curve doesn't look like real pretraining.

The clean fix is to use the official Megatron-Bridge training entrypoint
(megatron.bridge.training.pretrain, or the recipe API) which owns the full
lifecycle. If the goal is "smallest viable benchmark loop", at minimum (a)
flip wrap_with_ddp=True, (b) drop the manual main_grad allocation
(lines 138-144), and (c) let MCore's DDP populate main_grad for all
parameters before the optimizer step.

attention_mask = torch.ones_like(input_ids)

# Forward pass (model is a list of pipeline-parallel chunks)
output = model[0](input_ids, position_ids, attention_mask)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No pipeline-parallel schedule — pp >= 2 configurations won't actually run

Every rank just calls its local model[0] directly. There's no pipeline
schedule, no forward_backward_func, no point-to-point send/recv between PP
stages, and no decoder_input plumbing from the previous stage. That's fine
for pp == 1, but the README's parallelism table (§6) documents Qwen3-32B as
TP=8, PP=2, and MODEL_CONFIGS["32b"] (line 56) hardcodes the same defaults.
The 32B path will either deadlock (later stages waiting for input that never
arrives) or train garbage hidden states.

Either drop the 32B row from the table (and the MODEL_CONFIGS["32b"] entry)
until pipeline support is wired in, or use Megatron-Bridge's training
entrypoint which calls megatron.core.pipeline_parallel.get_forward_backward_func
for you.

As a separate point, when pp >= 2 the return value of model[0](...) is
also no longer logits — non-last stages return hidden states — which would
break the next line's output.float().mean() even before the pipeline
plumbing is fixed.

.unsqueeze(0)
.expand(args.micro_batch_size, -1)
)
attention_mask = torch.ones_like(input_ids)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attention_mask = torch.ones_like(input_ids) is the wrong shape and dtype for MCore GPT

MCore's transformer block expects attention_mask shaped [1, 1, S, S] and
dtype bool (a boolean self-attention mask, where True typically means
"masked out" — the inverse of HuggingFace's "1 = keep" convention). This
script passes [B, S] int64 all-ones, which is the HF padding-mask
convention. For MCore GPT the [B, S] shape is the separate padding_mask
input, not attention_mask.

Depending on the backend (TE vs vanilla MCore attention) the script will
either silently use a default causal mask and ignore the input, or — worse —
mis-interpret the all-ones tensor and mask out everything. Either way the
forward isn't doing what the loop appears to think it's doing.

Two fixes:

  1. Pass attention_mask=None and let MCore's attention build the causal mask
    internally (this is what pretrain_gpt.py does for synthetic data).
  2. Or build the mask explicitly:
    attention_mask = torch.triu(
        torch.ones((seq_len, seq_len), dtype=torch.bool, device=f"cuda:{local_rank}"),
        diagonal=1,
    ).view(1, 1, seq_len, seq_len)
Suggested change
attention_mask = torch.ones_like(input_ids)
attention_mask = None

print(f"[Megatron-Bridge] World size: {world_size}, TP={tp}, PP={pp}")
print(f"[Megatron-Bridge] Model path: {hf_model_path}")
print(
f"[Megatron-Bridge] Seq={args.seq_length}, GBS={args.global_batch_size}, Iters={args.train_iters}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--global-batch-size is parsed and printed but never enforced

--global-batch-size is logged here and the README's §5.1 walks users through
setting it, but the loop only iterates one micro-batch per step with no
gradient accumulation and no DP × micro-batch math. The effective batch is
micro_batch_size × (world_size / (tp × pp)), which for the validated 0.6B
run is 1 × 16 = 16 — coincidentally matching the requested GBS=16, but the
script would not honor any other GBS the user sets.

Either implement microbatching/accumulation, or remove --global-batch-size
from the CLI and the README entirely so users aren't misled into thinking the
flag is honored.

Comment on lines +86 to +87
tp = args.tp or model_cfg["tp"]
pp = args.pp or model_cfg["pp"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truthiness fallback on --tp / --pp

0 would silently fall through to the config default. It's not a realistic
input today (TP=0 is meaningless), but is not None is the explicit form the
checklist calls for and it costs nothing.

Suggested change
tp = args.tp or model_cfg["tp"]
pp = args.pp or model_cfg["pp"]
tp = args.tp if args.tp is not None else model_cfg["tp"]
pp = args.pp if args.pp is not None else model_cfg["pp"]

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 4/5 — Infrastructure & NCCL Configuration

Two findings on the Dockerfile's pip pins. The Dockerfile is otherwise solid
— the EFA stack pattern is canonical and the base image is correctly pinned.

######################
RUN pip uninstall -y megatron-bridge megatron-core \
&& rm -rf /opt/Megatron-Bridge /opt/megatron-lm \
&& pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@main \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Megatron-LM pinned to @main is a moving target

The contribution checklist requires external dependencies be pinned to a
specific version, tag, or commit. @main will silently change between builds
and can break the validated training output (which is the main artifact of
this PR). The comment above this block already calls out the version
relationship — "megatron-bridge 0.4.0 requires MCore >=0.18" — so could you
pin to the specific commit or tag of Megatron-LM that matches the 0.4.0
release window (e.g. @core_r0.18.0 or a SHA you've validated against)?

Comment on lines +133 to +134
&& pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33" \
&& pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-modelopt and megatron-bridge lack upper bounds

Other pip installs in the same Dockerfile use bounded ranges
("awscli>=1.44,<2.0", "transformers>=4.57,<5.0"). For consistency and to
match the checklist, please add upper bounds. Without these, a future
docker build of the same Dockerfile may pick up an incompatible major
version and fail to import.

Suggested change
&& pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33" \
&& pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0"
&& pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33,<1.0" \
&& pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0,<0.5.0"

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 5/5 — Documentation Consistency

Final batch. Three inline doc findings below, plus one cross-file framing
issue and the "things that look great" callout.

"Mock data" framing in §8.1 contradicts the actual behaviour

§8.1 says:

For pure throughput benchmarking, the training script uses mock data by
default (no --hf-model-path required for mock).

Reading pretrain_qwen3.py (lines 168-180), mock data is always used —
input_ids is unconditionally torch.randint(...) regardless of whether real
weights were loaded. So "benchmark mode" is the only mode. §5 ("Distributed
Training") implies real training is happening when --hf-model-path is set,
which isn't the case. This ties back to the loss issue in Batch 3; whichever
way you resolve that, the README narrative needs to match. Could you collapse
§5 and §8.1 into one section that's explicit about benchmark-only behaviour,
or add a real-data training path?


Things That Look Great

  • The README is unusually thorough for a new test case — eight sections,
    validated cluster topology, EFA verification logs, NCCL tuner output, and
    per-step throughput. That's exactly what other contributors should be
    modeling off of.
  • The Dockerfile follows the repo's canonical EFA-stack pattern: remove
    IB/HPC-X libs, install EFA 1.47.0 (matches the CI minimum) with
    --skip-kmod --no-verify --skip-limit-conf, set the right LD_LIBRARY_PATH
    ordering, then install GDRCopy. Easy to review.
  • envsubst is called with explicit variable whitelists in both README
    examples ('$REPO_URI $HF_MODEL $MODEL_SIZE' and the longer training one),
    which is the convention.
  • NCCL_SOCKET_IFNAME=^lo (manifest) and ^docker,lo,veth_def_agent
    (Dockerfile) use exclusion patterns rather than positive selection — the
    repo's preferred form, and the one that survives EFA interfaces.
  • The pytorchjob.yaml-template ships a strong NCCL/Torch debug environment
    (TORCH_NCCL_TRACE_BUFFER_SIZE, TORCH_NCCL_DUMP_ON_TIMEOUT,
    TORCH_DISTRIBUTED_DEBUG=DETAIL) which is genuinely useful for first-time
    HyperPod EKS users debugging hangs.
  • Using a Kubernetes Secret (hf-token) for the HuggingFace token rather
    than baking it into the manifest is the right choice and is propagated
    cleanly to both the download Job and the training Worker.
  • PyTorchJob (Kubeflow) is used for distributed training rather than a
    plain Job, matching the repo guidance.
  • The base image nvcr.io/nvidia/nemo:25.07 is pinned to an explicit tag.
  • The validated test plan in the PR description (8m wall-clock, 2/2 workers
    succeeded, 32 NICs detected per node, ~330K tokens/s steady-state) gives a
    reviewer confidence the artifact actually runs end-to-end on the target
    hardware.

Thanks again for the contribution!

Comment on lines +137 to +150
## 6. Model Sizes and Recommended Parallelism

The table below provides recommended parallelism settings for each Qwen 3 model size
on 2x p5.48xlarge (16 GPUs total):

| Model | Parameters | TP | PP | Nodes (p5.48xlarge) | Notes |
|-------|-----------|----|----|---------------------|-------|
| Qwen3-0.6B | 0.6B | 1 | 1 | 1 | Fits on single GPU |
| Qwen3-1.7B | 1.7B | 1 | 1 | 1 | Fits on single GPU |
| Qwen3-4B | 4B | 2 | 1 | 1 | 2-way tensor parallel |
| Qwen3-8B | 8B | 4 | 1 | 1 | 4-way tensor parallel |
| Qwen3-14B | 14B | 8 | 1 | 1 | Full node tensor parallel |
| Qwen3-32B | 32B | 8 | 2 | 2 | TP + PP with activation recompute |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section 6 parallelism table contradicts the section heading

The header reads "recommended parallelism settings for each Qwen 3 model size
on 2x p5.48xlarge (16 GPUs total)" but the table's Nodes (p5.48xlarge)
column mixes 1 and 2. Models marked Nodes=1 (Qwen3-0.6B through
Qwen3-14B) only make sense on a single-node cluster, contradicting the
section premise. I think what you meant is "minimum nodes required" — could
you either retitle the section ("Minimum cluster sizing per Qwen 3 size") or
expand the table to show both single-node and 2-node configurations
explicitly? The latter is more useful since the entire validated workflow
assumes 2 nodes.

(See also the pipeline-schedule finding in Batch 3 — the 32B row needs to be
revisited regardless: with no PP schedule wired in, that config won't run as
documented.)

export MODEL_SIZE=8b # Match the model: 0.6b, 1.7b, 4b, 8b, 14b, 32b

# Create a Kubernetes Secret for HuggingFace token (one-time setup)
kubectl create secret generic hf-token --from-literal=token=<your-huggingface-token>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<your-huggingface-token> placeholder lacks an explicit "replace this" cue

Most placeholder readers know to replace it, but a one-line note above the
command ("Replace <your-huggingface-token> with a token from
huggingface.co/settings/tokens that has access to the Qwen3 repos.") would
save a step for first-time HyperPod users. Minor.

Comment on lines +5 to +6
!*.yaml-template
*.yaml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manifests/.gitignore rule ordering is unnecessary

*.yaml doesn't match *.yaml-template (the glob is anchored to the file
suffix), so the !*.yaml-template re-include rule is a no-op. You can drop
it and just keep *.yaml. Not a bug, just dead config.

Suggested change
!*.yaml-template
*.yaml
*.yaml

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants