EDIT: based on new findings
Describe the bug
Tagging @mcore-oncall.
Megatron-Core training reliably crashes a few seconds into the first training iteration on GCP A3-mega (TCPxO / FastTrak network) with:
NCCL WARN NET/FasTrak ...: INVALID_ARGUMENT: Buffer registration request does not have a valid fd set.
NCCL WARN NET/FasTrak ...: FAILED_PRECONDITION: receive() error: 32: Broken pipe
The FastTrak plugin's per-comm buffer-registration call returns "no valid fd set" the first time any collective runs on each megatron-core subgroup process group (dp_cp_group, dp_group, model_parallel_group, …). This is independent of the tensor's size or memory backing — see Localization / falsified hypothesis below.
Environment:
10× A3-mega (8× H100/node), GCP TCPxO networking (host shim under /var/lib/tcpxo/lib64, qualified for NCCL 2.28.7-1)
Container: nvcr.io/nvidia/pytorch:24.12-py3 (PyTorch 2.6, CUDA 12.6, TE 1.13)
megatron-core==0.15.0, mcore-bridge>=1.2.0, transformers==4.46.3
Bundled NCCL 2.27.5 — also reproduced after swapping PyTorch's libnccl.so.2 for the host TCPxO-bundled NCCL 2.28.7-1 to exactly match the qualified shim
Driven by ms-swift's megatron sft (Megatron-SWIFT)
Model: Qwen3-VL-30B-A3B-Instruct (MoE)
Parallelism: TP=4, PP=2, EP=8, ETP=1, SP=on, moe_grouped_gemm=true, moe_expert_capacity_factor=1.0, recompute_granularity=full, attention_backend=flash, optimizer_cpu_offload=true, packing=true, micro-bs=1, GBS=40, max_length=5500
Symptom timing: === Starting training === is reached, megatron ModelConfig initialization completes, and within seconds of the first all-reduce / broadcast we hit the FastTrak buffer-registration error simultaneously on all 80 ranks. No assertion or Python traceback originating from megatron — just NCCL warnings followed by communicator teardown and PyTorch surfacing ncclSystemError.
Localization (two specific call sites)
We ran three iterations of the experiment, each one removing the previous failing site. Each iteration uncovered a new crash site on a different megatron-core process group, with the same INVALID_ARGUMENT: ... no valid fd set:
Unpatched run — crashed at megatron/core/distributed/finalize_model_grads.py:484 on all_reduce(num_tokens, group=dp_cp_group). Process group: dp_cp_group.
Run-time patch of site #1 (pad num_tokens to a 2 MB VMM-backed scratch) — crash moved to megatron/core/optimizer/clip_grads.py:130, the all_reduce(...) inside get_grad_norm_fp32. Process group: dp group.
Universal monkey-patch of torch.distributed.all_reduce / broadcast to route every small CUDA tensor through a 2 MB VMM-backed scratch (installed via .pth so all 80 worker processes see it) — crash moved to swift/megatron/utils/parallel_utils.py:25 (logical_and_across_model_parallel_group), all_reduce(input, op=MIN, group=mpu.get_model_parallel_group()). Process group: model-parallel group.
Falsified: tensor size / VMM-backing is not the trigger. In the third run the wrapper was in the call stack and was substituting a freshly-allocated 2 MB scratch tensor for the original scalar:
swift/megatron/utils/parallel_utils.py:25 logical_and_across_model_parallel_group
torch.distributed.all_reduce(input, op=MIN, group=mpu.get_model_parallel_group())
/usr/local/lib/python3.11/site-packages/_patch_dist.py:50 in all_reduce
r = _orig_all_reduce(scratch, *args, **kwargs) # scratch is 2 MB, freshly allocated
torch/distributed/c10d_logger.py:83 wrapper
torch/distributed/distributed_c10d.py:3007 all_reduce
→ NET/FasTrak: INVALID_ARGUMENT: Buffer registration request does not have a valid fd set
A 2 MB freshly-allocated CUDA tensor failed FastTrak buffer registration in the same way as the original int64 scalar.
The pattern across the three runs is: the first collective on each megatron-core subgroup PG fails FastTrak buffer registration. Eliminate one site, the next subgroup's first collective fails the same way. Subgroups are created via dist.new_group(rank_list), which on NCCL maps to ncclCommSplit of the world communicator. FastTrak's per-comm buffer-registration step looks like it doesn't get a valid FD on these split-derived comms.
Expected behavior
Megatron-core subgroup process groups should be set up in a way that's compatible with FastTrak/dma-buf-style NCCL plugins, or there should be a documented config option to fall back to a registration path that doesn't rely on a per-comm FD.
Additional context
Things ruled out (all reproduce the crash):
NCCL_DMABUF_ENABLE=0
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (already on)
Padding the num_tokens all_reduce/broadcast in finalize_model_grads.py to 2 MB — fixes site (1), exposes site (2)
Universal monkey-patch wrapping every torch.distributed.all_reduce / broadcast on small CUDA tensors via a 2 MB VMM-backed scratch (installed via .pth so all 80 worker processes see it) — fixes site (2), exposes site (3); this is what falsified the tensor-size hypothesis
Container base bumped through nvcr.io/nvidia/pytorch:24.04 → 24.07 → 24.10 → 24.12 (TE 1.5 → 1.8 → 1.11 → 1.13); 24.04 hits the same crash, newer bases needed for TEGroupedLinear (mcore-bridge ≥ 1.2 requirement)
Swapping PyTorch's bundled NCCL 2.27.5 for the host TCPxO's NCCL 2.28.7-1 so runtime NCCL matches the FastTrak shim's qualified version exactly — same crash signature
Re-exporting the mcore checkpoint under the matching TE version (rules out TE extra_state pickle mismatch)
NCCL env (sourced from /var/lib/tcpxo/lib64/nccl-env-profile.sh): standard FastTrak/TCPxO config — NCCL_FASTRAK_USE_LLCM=1, NCCL_FASTRAK_USE_SNAP=1, NCCL_FASTRAK_NUM_FLOWS=2, NCCL_PROTO=Simple, NCCL_NET_GDR_LEVEL=PIX, etc.
Question for mcore-oncall: how are megatron-core's subgroup PGs (dp_cp_group, dp_group, model_parallel_group, etc.) constructed in the current 0.15 line? Specifically, do they go through ncclCommSplit of the world comm, and is there any FastTrak/dma-buf-aware initialization step they rely on (e.g., a warm-up collective or buffer-registration handshake) that might be missing?
Thanks
EDIT: based on new findings
Describe the bug
Tagging @mcore-oncall.
Megatron-Core training reliably crashes a few seconds into the first training iteration on GCP A3-mega (TCPxO / FastTrak network) with:
NCCL WARN NET/FasTrak ...: INVALID_ARGUMENT: Buffer registration request does not have a valid fd set.
NCCL WARN NET/FasTrak ...: FAILED_PRECONDITION: receive() error: 32: Broken pipe
The FastTrak plugin's per-comm buffer-registration call returns "no valid fd set" the first time any collective runs on each megatron-core subgroup process group (dp_cp_group, dp_group, model_parallel_group, …). This is independent of the tensor's size or memory backing — see Localization / falsified hypothesis below.
Environment:
10× A3-mega (8× H100/node), GCP TCPxO networking (host shim under /var/lib/tcpxo/lib64, qualified for NCCL 2.28.7-1)
Container: nvcr.io/nvidia/pytorch:24.12-py3 (PyTorch 2.6, CUDA 12.6, TE 1.13)
megatron-core==0.15.0, mcore-bridge>=1.2.0, transformers==4.46.3
Bundled NCCL 2.27.5 — also reproduced after swapping PyTorch's libnccl.so.2 for the host TCPxO-bundled NCCL 2.28.7-1 to exactly match the qualified shim
Driven by ms-swift's megatron sft (Megatron-SWIFT)
Model: Qwen3-VL-30B-A3B-Instruct (MoE)
Parallelism: TP=4, PP=2, EP=8, ETP=1, SP=on, moe_grouped_gemm=true, moe_expert_capacity_factor=1.0, recompute_granularity=full, attention_backend=flash, optimizer_cpu_offload=true, packing=true, micro-bs=1, GBS=40, max_length=5500
Symptom timing: === Starting training === is reached, megatron ModelConfig initialization completes, and within seconds of the first all-reduce / broadcast we hit the FastTrak buffer-registration error simultaneously on all 80 ranks. No assertion or Python traceback originating from megatron — just NCCL warnings followed by communicator teardown and PyTorch surfacing ncclSystemError.
Localization (two specific call sites)
We ran three iterations of the experiment, each one removing the previous failing site. Each iteration uncovered a new crash site on a different megatron-core process group, with the same INVALID_ARGUMENT: ... no valid fd set:
Unpatched run — crashed at megatron/core/distributed/finalize_model_grads.py:484 on all_reduce(num_tokens, group=dp_cp_group). Process group: dp_cp_group.
Run-time patch of site #1 (pad num_tokens to a 2 MB VMM-backed scratch) — crash moved to megatron/core/optimizer/clip_grads.py:130, the all_reduce(...) inside get_grad_norm_fp32. Process group: dp group.
Universal monkey-patch of torch.distributed.all_reduce / broadcast to route every small CUDA tensor through a 2 MB VMM-backed scratch (installed via .pth so all 80 worker processes see it) — crash moved to swift/megatron/utils/parallel_utils.py:25 (logical_and_across_model_parallel_group), all_reduce(input, op=MIN, group=mpu.get_model_parallel_group()). Process group: model-parallel group.
Falsified: tensor size / VMM-backing is not the trigger. In the third run the wrapper was in the call stack and was substituting a freshly-allocated 2 MB scratch tensor for the original scalar:
swift/megatron/utils/parallel_utils.py:25 logical_and_across_model_parallel_group
torch.distributed.all_reduce(input, op=MIN, group=mpu.get_model_parallel_group())
/usr/local/lib/python3.11/site-packages/_patch_dist.py:50 in all_reduce
r = _orig_all_reduce(scratch, *args, **kwargs) # scratch is 2 MB, freshly allocated
torch/distributed/c10d_logger.py:83 wrapper
torch/distributed/distributed_c10d.py:3007 all_reduce
→ NET/FasTrak: INVALID_ARGUMENT: Buffer registration request does not have a valid fd set
A 2 MB freshly-allocated CUDA tensor failed FastTrak buffer registration in the same way as the original int64 scalar.
The pattern across the three runs is: the first collective on each megatron-core subgroup PG fails FastTrak buffer registration. Eliminate one site, the next subgroup's first collective fails the same way. Subgroups are created via dist.new_group(rank_list), which on NCCL maps to ncclCommSplit of the world communicator. FastTrak's per-comm buffer-registration step looks like it doesn't get a valid FD on these split-derived comms.
Expected behavior
Megatron-core subgroup process groups should be set up in a way that's compatible with FastTrak/dma-buf-style NCCL plugins, or there should be a documented config option to fall back to a registration path that doesn't rely on a per-comm FD.
Additional context
Things ruled out (all reproduce the crash):
NCCL_DMABUF_ENABLE=0
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (already on)
Padding the num_tokens all_reduce/broadcast in finalize_model_grads.py to 2 MB — fixes site (1), exposes site (2)
Universal monkey-patch wrapping every torch.distributed.all_reduce / broadcast on small CUDA tensors via a 2 MB VMM-backed scratch (installed via .pth so all 80 worker processes see it) — fixes site (2), exposes site (3); this is what falsified the tensor-size hypothesis
Container base bumped through nvcr.io/nvidia/pytorch:24.04 → 24.07 → 24.10 → 24.12 (TE 1.5 → 1.8 → 1.11 → 1.13); 24.04 hits the same crash, newer bases needed for TEGroupedLinear (mcore-bridge ≥ 1.2 requirement)
Swapping PyTorch's bundled NCCL 2.27.5 for the host TCPxO's NCCL 2.28.7-1 so runtime NCCL matches the FastTrak shim's qualified version exactly — same crash signature
Re-exporting the mcore checkpoint under the matching TE version (rules out TE extra_state pickle mismatch)
NCCL env (sourced from /var/lib/tcpxo/lib64/nccl-env-profile.sh): standard FastTrak/TCPxO config — NCCL_FASTRAK_USE_LLCM=1, NCCL_FASTRAK_USE_SNAP=1, NCCL_FASTRAK_NUM_FLOWS=2, NCCL_PROTO=Simple, NCCL_NET_GDR_LEVEL=PIX, etc.
Question for mcore-oncall: how are megatron-core's subgroup PGs (dp_cp_group, dp_group, model_parallel_group, etc.) constructed in the current 0.15 line? Specifically, do they go through ncclCommSplit of the world comm, and is there any FastTrak/dma-buf-aware initialization step they rely on (e.g., a warm-up collective or buffer-registration handshake) that might be missing?
Thanks