Adding a megatron-bridge sample#1065
Conversation
paragao
left a comment
There was a problem hiding this comment.
Review: PR #1065 — Adding a megatron-bridge sample
Thanks for this contribution. A few observations:
What's done well
- Security: HF token is correctly sourced from a Kubernetes Secret (
hf-token) — no hardcoded credentials anywhere. - HPC/EFA config: The NCCL environment variables are comprehensive and correct for p5.48xlarge (32 EFA adapters,
FI_EFA_USE_DEVICE_RDMA=1,NCCL_SOCKET_IFNAME=^lo, trace buffer/dump for debugging hangs). - Documentation: The README is thorough — prerequisites, parallelism recommendation table, validated output, and cleanup instructions.
- Megatron-Core patterns: The
main_gradallocation and gradient copy logic is correct. - Fault tolerance: Elastic policy with
maxRestarts: 100+restartPolicy: OnFailureis well configured.
Blocking — please address before merge
1. Megatron-LM dependency is not pinned (aws-megatron-bridge.Dockerfile, line 113)
pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@mainThis installs from @main, which can change at any time. The repo's contribution checklist requires: "External dependencies are pinned to a specific version or tag (no latest)".
Fix: Pin to the current commit hash:
pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@<commit_hash>2. Training loop only supports PP=1, but README recommends PP=2 for Qwen3-32B (pretrain_qwen3.py, line 176)
output = model[0](input_ids, position_ids, attention_mask)With pipeline parallelism > 1, Megatron-Core returns multiple model chunks (one per stage). Only calling model[0] means stage-1 workers never receive activations — the job will hang or produce incorrect results. The README's Qwen3-32B row recommends PP=2, so users following that guidance would hit a broken run.
Fix (option A): Add a guard and update the README:
assert len(model) == 1, "This sample only supports pipeline_model_parallel_size=1"Then change the Qwen3-32B recommendation to PP=1 (adjusting TP/CP to compensate).
Fix (option B): Implement pipeline-parallel scheduling using Megatron-Core's forward_backward_no_pipelining / forward_backward_pipelining_with_interleaving from megatron.core.pipeline_parallel.schedules.
Non-blocking suggestions
- Error handling: Wrap the training entrypoint in
try/finallyto ensuredist.destroy_process_group()is called on failure. - etcd resource limits: The etcd container in
pytorchjob.yaml-templatehas no CPU/memory requests or limits. Add them for scheduling predictability. - etcd version:
3.4.13-0is from 2020. Consider upgrading to a 3.5.x release for security patches. - Unused
global_batch_sizeargument: The arg is accepted but the training loop doesn't implement gradient accumulation. A comment clarifying this is a simplified mock loop would prevent confusion. - SSH setup may be unnecessary: Since rendezvous uses torchrun/etcd (not MPI), the SSH key generation and config weakening in the Dockerfile add attack surface without benefit. Consider removing lines 60-71.
- README validated output: Consider replacing specific node names (
hyperpod-i-008789534bbb4c33f) and internal IPs with placeholders for a cleaner public sample.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/5 — Structure & Repository Hygiene
Thanks for the contribution — overall this is a thorough sample with a great
README. The findings below are split across five themed batches; this one
covers structure and repo conventions. Subsequent batches will cover
deployment pipeline, training code, infrastructure/NCCL, and documentation.
Directory layout puts the model below kubernetes/ instead of above it
The repo convention (and the PR template you ticked) is
<framework>/<library>/<model>/{Dockerfile, README, slurm/, kubernetes/, hyperpod-eks/},
i.e. model directory above platform directories. The sibling megatron-lm and
nemo test cases follow that layout. Here the structure is inverted:
<library>/kubernetes/<model>/, which means the training script
pretrain_qwen3.py lives at kubernetes/qwen3/pretrain_qwen3.py even though
it's platform-agnostic Python that torchrun can invoke from Slurm too.
I'd suggest restructuring to
3.test_cases/megatron/megatron-bridge/qwen3/{Dockerfile, README.md, kubernetes/manifests/...}
so that adding a Slurm path later (or another model) doesn't require
re-locating the script. If you'd rather keep the current layout because the
sample is EKS-only today, it might still be worth pulling pretrain_qwen3.py
up to the library level and leaving only manifests under kubernetes/.
| export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') | ||
| export ACCOUNT=$(aws sts get-caller-identity --query Account --output text) | ||
| export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/ | ||
| export IMAGE_TAG=latest |
There was a problem hiding this comment.
:latest shown in user-facing build commands
The README sets IMAGE_TAG=latest here and prints ${REGISTRY}megatron-bridge-qwen3:latest as the image URI users should plug into REPO_URI (lines 58, 71). The repo policy is to avoid latest for reproducibility. For a sample whose validated output is captured against a specific container build, I'd suggest a date-stamped or semver tag and noting that users should bump it on each rebuild.
| export IMAGE_TAG=latest | |
| export IMAGE_TAG=2026-04-27 # bump on each rebuild; avoid `latest` for reproducibility |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/5 — Deployment Pipeline (K8s / Slurm)
Two findings on the K8s manifests below. Both are about reducing
runtime variability on the hot path.
| args: | ||
| - | | ||
| set -ex | ||
| pip install huggingface_hub |
There was a problem hiding this comment.
huggingface_hub installed at job runtime without a version pin
The download Job runs pip install huggingface_hub at launch time. That pulls whatever is current on PyPI — both unpinned and network-fragile — and re-runs on every retry. Since the Dockerfile already has internet access at build time, I'd suggest moving the install into aws-megatron-bridge.Dockerfile with an upper bound (e.g. huggingface_hub>=0.25,<1.0) and dropping the runtime install. That also lets the download Job start instantly from FSx's perspective.
| containers: | ||
| - name: pytorch | ||
| image: ${REPO_URI} | ||
| imagePullPolicy: Always |
There was a problem hiding this comment.
imagePullPolicy: Always re-pulls the image on every pod start
Combined with :latest in the README this is consistent — but if you switch to a fixed tag (per the Structure batch suggestion) you can drop this to IfNotPresent and shave significant time off rendezvous on a multi-restart elastic job. Worth noting in the README either way.
| imagePullPolicy: Always | |
| imagePullPolicy: IfNotPresent |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/5 — Training Code Quality
This is the largest batch. The training script has several distributed-training
correctness issues that, taken together, mean the loop doesn't actually train
anything in the way the README claims:
- No data-parallel gradient reduction (so DP=4 in the documented 8B run trains 4 independent models)
- No pipeline-parallel schedule (so the 32B row in §6 won't run as documented)
- The loss expression isn't a real LM loss
- The
attention_maskshape doesn't match what MCore GPT expects - Manual
main_grad → gradcopy can silently overwrite real gradients with zeros for embeddings/norms
The clean path forward, in my view, is to use Megatron-Bridge's official
training entrypoint (megatron.bridge.training.pretrain or the recipe API).
That owns the full lifecycle — DDP wrap, distributed optimizer, FP32 master,
gradient clipping, LR schedule, microbatching, and the right attention_mask
plumbing. If the goal is "smallest viable benchmark loop", at minimum (a) flip
wrap_with_ddp=True, (b) drop the manual main_grad allocation, and (c) use a
real cross-entropy loss against shifted labels.
Inline findings follow.
|
|
||
| # Forward pass (model is a list of pipeline-parallel chunks) | ||
| output = model[0](input_ids, position_ids, attention_mask) | ||
| loss = output.float().mean() |
There was a problem hiding this comment.
loss = output.float().mean() is not a language-modeling loss
The loop calls .backward() on the arithmetic mean of every logit in the
batch. That's not next-token cross-entropy and not even a noisy approximation
of one — it's a different objective with no token-identity signal. Five
mechanical points:
1. No labels are used. A real LM loss is
F.cross_entropy(logits.view(-1, V), labels.view(-1)), where labels is
input_ids shifted by one. The script never constructs labels; the loss is
purely a function of logit magnitudes, independent of the input tokens.
2. The gradient is uniform across the vocab axis.
∂mean(Z) / ∂Z[i,s,v] = 1 / (B · S · V) — the same constant for every logit.
A real CE gradient is sparse: softmax(Z) − one_hot(label). The uniform
gradient learns nothing about token preferences.
3. softmax(logits) is invariant to adding a constant to every logit.
The optimizer is shifting all logits by roughly the same amount per step, so
the predicted distribution is unchanged from initialization. The loss number
decreases but the model doesn't move on the task.
4. The loss is unbounded below. Cross-entropy is ≥ 0 (Gibbs); mean(logits)
has no floor. The validated 0.69 → −4.37 → … → −32.59 curve is the optimizer
shrinking output magnitudes without limit.
5. Step-1 = 0.69 is the wrong order of magnitude. A randomly-initialized
152K-vocab model under real CE starts near log(151936) ≈ 11.93. 0.69 is
just mean(logit_init) near zero.
Why this matters. Anyone using this script as a starting point for real
pretraining will copy the loss expression and silently get zero LM progress
with a beautifully decreasing curve. As a pure throughput benchmark it's also
slightly compromised — uniform vocab gradients skip the LM-head all-gather +
softmax-CE fused kernel that real workloads exercise.
Two options:
- Compute a real LM loss — shift
input_idsby one to form labels, and use
torch.nn.functional.cross_entropy(logits.view(-1, vocab_size), labels.view(-1))
(or, better, MCore'svocab_parallel_cross_entropysince logits are
vocab-sharded under TP). - Reframe as benchmark-only — rename file/sections to "benchmark" /
"throughput", drop thelossfield from the per-step print, and clearly
say in the README that this exercises forward+backward but doesn't optimize
an LM objective.output.sum().backward()is fine and you can drop the
optimizer step entirely; freezing parameters and timing forward only is
cleaner still.
For reference, 3.test_cases/megatron/megatron-lm/ uses Megatron-LM's official
pretrain_gpt.py with proper loss + LR schedule.
| if has_weights: | ||
| if rank == 0: | ||
| print(f"[Megatron-Bridge] Loading weights from {hf_model_path}") | ||
| model = provider.provide_distributed_model(wrap_with_ddp=False) |
There was a problem hiding this comment.
wrap_with_ddp=False skips DP all-reduce — gradients aren't synchronized across data-parallel ranks
Same flag at line 133 (the else branch). provide_distributed_model(wrap_with_ddp=False)
deliberately skips MCore's DDP wrapper. That wrapper is the component that
registers backward hooks (distributed_data_parallel.py#L354),
moves param.grad into param.main_grad inside the post-backward hook
(distributed_data_parallel.py#L449-L453),
and starts/finishes the DP all-reduce (or reduce-scatter when the distributed
optimizer is enabled) in param_and_grad_buffer.py#L517.
Without the wrapper, none of those hooks are registered: loss.backward() on
line 187 only runs local autograd plus the TP collectives MCore inlines into
its column/row-parallel kernels — there's no cross-rank gradient reduction at
all when world_size > tp * pp.
The flag's behaviour is documented in the Megatron-Bridge source:
provide_distributed_model signature and wrap_with_ddp docstring
(default is True, and ddp_config is required when it's True); the
_ddp_wrap helper
is what actually instantiates megatron.core.distributed.DistributedDataParallel
and wires the hooks. Setting wrap_with_ddp=False is supported (the
bridge-guide examples
use it for inference / weight-conversion paths) but it's an inference-mode
shortcut, not a training configuration.
For the README's documented configurations:
- Qwen3-8B (TP=4, PP=1, 16 GPUs) → DP=4, gradients diverge silently across
the 4 DP groups. - Qwen3-0.6B / 1.7B (TP=1, PP=1, 16 GPUs) → DP=16, every rank trains
independently. - Only Qwen3-14B and Qwen3-32B happen to have DP=1, so DP reduction is moot
— but those rows have other issues (see the pipeline-schedule and table notes).
Either flip the flag (wrap_with_ddp=True, then let MCore's DDP wire the
hooks) or call out in the README that this is a single-rank toy script and the
"DP" portion of any multi-rank run is not actually trained.
| # Copy main_grad -> grad for the standard PyTorch optimizer | ||
| for p in params: | ||
| if p.main_grad is not None: | ||
| if p.grad is None: | ||
| p.grad = p.main_grad.to(p.data.dtype) | ||
| else: | ||
| p.grad.copy_(p.main_grad) |
There was a problem hiding this comment.
Manual main_grad → grad copy can silently zero gradients for embeddings / norms
Lines 138-144 zero-allocate main_grad on every parameter; MCore's fused
column/row-parallel linear backward kernels capture weight.main_grad during
forward and accumulate weight gradients into it, which is how the loop
"appears" to work for the matmul weights. But parameters whose autograd path
writes the normal p.grad instead — embeddings, layer norms, biases on layers
not using the fused kernels — get their real gradient overwritten by the
zero-initialized p.main_grad here:
if p.grad is None:
p.grad = p.main_grad.to(p.data.dtype) # main_grad is still zeros for non-fused params
else:
p.grad.copy_(p.main_grad) # overwrites the real grad with zerosThe optimizer then "steps" with a zero gradient on those parameters, so
embeddings and norms don't actually train. This is in addition to the
no-DP-reduce issue above, and together they explain why the validated loss
curve doesn't look like real pretraining.
The clean fix is to use the official Megatron-Bridge training entrypoint
(megatron.bridge.training.pretrain, or the recipe API) which owns the full
lifecycle. If the goal is "smallest viable benchmark loop", at minimum (a)
flip wrap_with_ddp=True, (b) drop the manual main_grad allocation
(lines 138-144), and (c) let MCore's DDP populate main_grad for all
parameters before the optimizer step.
| attention_mask = torch.ones_like(input_ids) | ||
|
|
||
| # Forward pass (model is a list of pipeline-parallel chunks) | ||
| output = model[0](input_ids, position_ids, attention_mask) |
There was a problem hiding this comment.
No pipeline-parallel schedule — pp >= 2 configurations won't actually run
Every rank just calls its local model[0] directly. There's no pipeline
schedule, no forward_backward_func, no point-to-point send/recv between PP
stages, and no decoder_input plumbing from the previous stage. That's fine
for pp == 1, but the README's parallelism table (§6) documents Qwen3-32B as
TP=8, PP=2, and MODEL_CONFIGS["32b"] (line 56) hardcodes the same defaults.
The 32B path will either deadlock (later stages waiting for input that never
arrives) or train garbage hidden states.
Either drop the 32B row from the table (and the MODEL_CONFIGS["32b"] entry)
until pipeline support is wired in, or use Megatron-Bridge's training
entrypoint which calls megatron.core.pipeline_parallel.get_forward_backward_func
for you.
As a separate point, when pp >= 2 the return value of model[0](...) is
also no longer logits — non-last stages return hidden states — which would
break the next line's output.float().mean() even before the pipeline
plumbing is fixed.
| .unsqueeze(0) | ||
| .expand(args.micro_batch_size, -1) | ||
| ) | ||
| attention_mask = torch.ones_like(input_ids) |
There was a problem hiding this comment.
attention_mask = torch.ones_like(input_ids) is the wrong shape and dtype for MCore GPT
MCore's transformer block expects attention_mask shaped [1, 1, S, S] and
dtype bool (a boolean self-attention mask, where True typically means
"masked out" — the inverse of HuggingFace's "1 = keep" convention). This
script passes [B, S] int64 all-ones, which is the HF padding-mask
convention. For MCore GPT the [B, S] shape is the separate padding_mask
input, not attention_mask.
Depending on the backend (TE vs vanilla MCore attention) the script will
either silently use a default causal mask and ignore the input, or — worse —
mis-interpret the all-ones tensor and mask out everything. Either way the
forward isn't doing what the loop appears to think it's doing.
Two fixes:
- Pass
attention_mask=Noneand let MCore's attention build the causal mask
internally (this is whatpretrain_gpt.pydoes for synthetic data). - Or build the mask explicitly:
attention_mask = torch.triu( torch.ones((seq_len, seq_len), dtype=torch.bool, device=f"cuda:{local_rank}"), diagonal=1, ).view(1, 1, seq_len, seq_len)
| attention_mask = torch.ones_like(input_ids) | |
| attention_mask = None |
| print(f"[Megatron-Bridge] World size: {world_size}, TP={tp}, PP={pp}") | ||
| print(f"[Megatron-Bridge] Model path: {hf_model_path}") | ||
| print( | ||
| f"[Megatron-Bridge] Seq={args.seq_length}, GBS={args.global_batch_size}, Iters={args.train_iters}" |
There was a problem hiding this comment.
--global-batch-size is parsed and printed but never enforced
--global-batch-size is logged here and the README's §5.1 walks users through
setting it, but the loop only iterates one micro-batch per step with no
gradient accumulation and no DP × micro-batch math. The effective batch is
micro_batch_size × (world_size / (tp × pp)), which for the validated 0.6B
run is 1 × 16 = 16 — coincidentally matching the requested GBS=16, but the
script would not honor any other GBS the user sets.
Either implement microbatching/accumulation, or remove --global-batch-size
from the CLI and the README entirely so users aren't misled into thinking the
flag is honored.
| tp = args.tp or model_cfg["tp"] | ||
| pp = args.pp or model_cfg["pp"] |
There was a problem hiding this comment.
Truthiness fallback on --tp / --pp
0 would silently fall through to the config default. It's not a realistic
input today (TP=0 is meaningless), but is not None is the explicit form the
checklist calls for and it costs nothing.
| tp = args.tp or model_cfg["tp"] | |
| pp = args.pp or model_cfg["pp"] | |
| tp = args.tp if args.tp is not None else model_cfg["tp"] | |
| pp = args.pp if args.pp is not None else model_cfg["pp"] |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 4/5 — Infrastructure & NCCL Configuration
Two findings on the Dockerfile's pip pins. The Dockerfile is otherwise solid
— the EFA stack pattern is canonical and the base image is correctly pinned.
| ###################### | ||
| RUN pip uninstall -y megatron-bridge megatron-core \ | ||
| && rm -rf /opt/Megatron-Bridge /opt/megatron-lm \ | ||
| && pip install --no-cache-dir --no-deps git+https://github.com/NVIDIA/Megatron-LM.git@main \ |
There was a problem hiding this comment.
Megatron-LM pinned to @main is a moving target
The contribution checklist requires external dependencies be pinned to a
specific version, tag, or commit. @main will silently change between builds
and can break the validated training output (which is the main artifact of
this PR). The comment above this block already calls out the version
relationship — "megatron-bridge 0.4.0 requires MCore >=0.18" — so could you
pin to the specific commit or tag of Megatron-LM that matches the 0.4.0
release window (e.g. @core_r0.18.0 or a SHA you've validated against)?
| && pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33" \ | ||
| && pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0" |
There was a problem hiding this comment.
nvidia-modelopt and megatron-bridge lack upper bounds
Other pip installs in the same Dockerfile use bounded ranges
("awscli>=1.44,<2.0", "transformers>=4.57,<5.0"). For consistency and to
match the checklist, please add upper bounds. Without these, a future
docker build of the same Dockerfile may pick up an incompatible major
version and fail to import.
| && pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33" \ | |
| && pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0" | |
| && pip install --no-cache-dir --no-deps "nvidia-modelopt>=0.33,<1.0" \ | |
| && pip install --no-cache-dir --no-deps "megatron-bridge>=0.4.0,<0.5.0" |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 5/5 — Documentation Consistency
Final batch. Three inline doc findings below, plus one cross-file framing
issue and the "things that look great" callout.
"Mock data" framing in §8.1 contradicts the actual behaviour
§8.1 says:
For pure throughput benchmarking, the training script uses mock data by
default (no--hf-model-pathrequired for mock).
Reading pretrain_qwen3.py (lines 168-180), mock data is always used —
input_ids is unconditionally torch.randint(...) regardless of whether real
weights were loaded. So "benchmark mode" is the only mode. §5 ("Distributed
Training") implies real training is happening when --hf-model-path is set,
which isn't the case. This ties back to the loss issue in Batch 3; whichever
way you resolve that, the README narrative needs to match. Could you collapse
§5 and §8.1 into one section that's explicit about benchmark-only behaviour,
or add a real-data training path?
Things That Look Great
- The README is unusually thorough for a new test case — eight sections,
validated cluster topology, EFA verification logs, NCCL tuner output, and
per-step throughput. That's exactly what other contributors should be
modeling off of. - The Dockerfile follows the repo's canonical EFA-stack pattern: remove
IB/HPC-X libs, install EFA1.47.0(matches the CI minimum) with
--skip-kmod --no-verify --skip-limit-conf, set the rightLD_LIBRARY_PATH
ordering, then install GDRCopy. Easy to review. envsubstis called with explicit variable whitelists in both README
examples ('$REPO_URI $HF_MODEL $MODEL_SIZE'and the longer training one),
which is the convention.NCCL_SOCKET_IFNAME=^lo(manifest) and^docker,lo,veth_def_agent
(Dockerfile) use exclusion patterns rather than positive selection — the
repo's preferred form, and the one that survives EFA interfaces.- The
pytorchjob.yaml-templateships a strong NCCL/Torch debug environment
(TORCH_NCCL_TRACE_BUFFER_SIZE,TORCH_NCCL_DUMP_ON_TIMEOUT,
TORCH_DISTRIBUTED_DEBUG=DETAIL) which is genuinely useful for first-time
HyperPod EKS users debugging hangs. - Using a Kubernetes
Secret(hf-token) for the HuggingFace token rather
than baking it into the manifest is the right choice and is propagated
cleanly to both the download Job and the training Worker. PyTorchJob(Kubeflow) is used for distributed training rather than a
plainJob, matching the repo guidance.- The base image
nvcr.io/nvidia/nemo:25.07is pinned to an explicit tag. - The validated test plan in the PR description (8m wall-clock, 2/2 workers
succeeded, 32 NICs detected per node, ~330K tokens/s steady-state) gives a
reviewer confidence the artifact actually runs end-to-end on the target
hardware.
Thanks again for the contribution!
| ## 6. Model Sizes and Recommended Parallelism | ||
|
|
||
| The table below provides recommended parallelism settings for each Qwen 3 model size | ||
| on 2x p5.48xlarge (16 GPUs total): | ||
|
|
||
| | Model | Parameters | TP | PP | Nodes (p5.48xlarge) | Notes | | ||
| |-------|-----------|----|----|---------------------|-------| | ||
| | Qwen3-0.6B | 0.6B | 1 | 1 | 1 | Fits on single GPU | | ||
| | Qwen3-1.7B | 1.7B | 1 | 1 | 1 | Fits on single GPU | | ||
| | Qwen3-4B | 4B | 2 | 1 | 1 | 2-way tensor parallel | | ||
| | Qwen3-8B | 8B | 4 | 1 | 1 | 4-way tensor parallel | | ||
| | Qwen3-14B | 14B | 8 | 1 | 1 | Full node tensor parallel | | ||
| | Qwen3-32B | 32B | 8 | 2 | 2 | TP + PP with activation recompute | | ||
|
|
There was a problem hiding this comment.
Section 6 parallelism table contradicts the section heading
The header reads "recommended parallelism settings for each Qwen 3 model size
on 2x p5.48xlarge (16 GPUs total)" but the table's Nodes (p5.48xlarge)
column mixes 1 and 2. Models marked Nodes=1 (Qwen3-0.6B through
Qwen3-14B) only make sense on a single-node cluster, contradicting the
section premise. I think what you meant is "minimum nodes required" — could
you either retitle the section ("Minimum cluster sizing per Qwen 3 size") or
expand the table to show both single-node and 2-node configurations
explicitly? The latter is more useful since the entire validated workflow
assumes 2 nodes.
(See also the pipeline-schedule finding in Batch 3 — the 32B row needs to be
revisited regardless: with no PP schedule wired in, that config won't run as
documented.)
| export MODEL_SIZE=8b # Match the model: 0.6b, 1.7b, 4b, 8b, 14b, 32b | ||
|
|
||
| # Create a Kubernetes Secret for HuggingFace token (one-time setup) | ||
| kubectl create secret generic hf-token --from-literal=token=<your-huggingface-token> |
There was a problem hiding this comment.
<your-huggingface-token> placeholder lacks an explicit "replace this" cue
Most placeholder readers know to replace it, but a one-line note above the
command ("Replace <your-huggingface-token> with a token from
huggingface.co/settings/tokens that has access to the Qwen3 repos.") would
save a step for first-time HyperPod users. Minor.
| !*.yaml-template | ||
| *.yaml |
There was a problem hiding this comment.
manifests/.gitignore rule ordering is unnecessary
*.yaml doesn't match *.yaml-template (the glob is anchored to the file
suffix), so the !*.yaml-template re-include rule is a no-op. You can drop
it and just keep *.yaml. Not a bug, just dead config.
| !*.yaml-template | |
| *.yaml | |
| *.yaml |
Purpose
Adding a new Megatron-Bridge sample for distributed pretraining of Qwen 3 models on SMHP EKS
Changes
Added corresponding sample files
Test Plan
Ran the full workflow: built and pushed the container image to ECR, downloaded Qwen3-0.6B weights to FSx, and launched a 2-node distributed pretraining job on a SMHP EKS cluster.
Environment:
ami-0399b1f4b455cb701)Test commands:
NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.18.0
NCCL INFO NET/OFI Using Libfabric version 2.4
NCCL INFO NET/OFI Using transport protocol RDMA (platform set)
NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct (found 32 nics)
NCCL INFO NET/OFI Configuring AWS-specific options
NCCL INFO NET/OFI Internode latency set at 75.0 us
NCCL INFO TUNER/Plugin: Using nccl_ofi_tuner (v3)
NCCL INFO NET/OFI Region base Tuner is chosen for platform: p5.48xlarge
[Megatron-Bridge] Qwen3-0.6B pretraining
[Megatron-Bridge] World size: 16, TP=1, PP=1
[Megatron-Bridge] Model built: 0.60B params, 0.60B trainable
step 1/10 | loss: 0.6869 | time: 1.89s | tokens/s: 17,351
step 2/10 | loss: -4.3676 | time: 0.14s | tokens/s: 241,682
step 3/10 | loss: -8.8785 | time: 0.11s | tokens/s: 306,938
step 4/10 | loss: -14.0235 | time: 0.10s | tokens/s: 328,907
step 5/10 | loss: -18.2283 | time: 0.10s | tokens/s: 329,265
step 6/10 | loss: -19.9766 | time: 0.10s | tokens/s: 330,300
step 7/10 | loss: -23.3931 | time: 0.10s | tokens/s: 330,264
step 8/10 | loss: -26.0851 | time: 0.10s | tokens/s: 329,648
step 9/10 | loss: -29.8682 | time: 0.11s | tokens/s: 306,454
step 10/10 | loss: -32.5855 | time: 0.10s | tokens/s: 326,035
[Megatron-Bridge] Training complete!
3.test_cases/
└── / # e.g. pytorch, megatron, jax
└── / # e.g. picotron, FSDP, megatron-lm
└── / # e.g. SmolLM-1.7B (may be omitted for single-model cases)
├── Dockerfile # Container / environment setup
├── README.md # Overview, prerequisites, usage
├── slurm/ # Slurm-specific launch scripts
├── kubernetes/ # Kubernetes manifests
└── hyperpod-eks/ # HyperPod EKS instructions