mcore scalar all_reduce fails NCCL FastTrak buffer registration

EDIT: based on new findings

**Describe the bug**

Tagging @mcore-oncall.

Megatron-Core training reliably crashes a few seconds into the first training iteration on GCP A3-mega (TCPxO / FastTrak network) with:

NCCL WARN NET/FasTrak ...: INVALID_ARGUMENT: Buffer registration request does not have a valid fd set.
NCCL WARN NET/FasTrak ...: FAILED_PRECONDITION: receive() error: 32: Broken pipe

The FastTrak plugin's per-comm buffer-registration call returns "no valid fd set" the first time any collective runs on each megatron-core subgroup process group (dp_cp_group, dp_group, model_parallel_group, …). This is independent of the tensor's size or memory backing — see Localization / falsified hypothesis below. 

**Environment:**

10× A3-mega (8× H100/node), GCP TCPxO networking (host shim under /var/lib/tcpxo/lib64, qualified for NCCL 2.28.7-1)
Container: nvcr.io/nvidia/pytorch:24.12-py3 (PyTorch 2.6, CUDA 12.6, TE 1.13)
megatron-core==0.15.0, mcore-bridge>=1.2.0, transformers==4.46.3
Bundled NCCL 2.27.5 — also reproduced after swapping PyTorch's libnccl.so.2 for the host TCPxO-bundled NCCL 2.28.7-1 to exactly match the qualified shim
Driven by ms-swift's megatron sft (Megatron-SWIFT)
Model: Qwen3-VL-30B-A3B-Instruct (MoE)
Parallelism: TP=4, PP=2, EP=8, ETP=1, SP=on, moe_grouped_gemm=true, moe_expert_capacity_factor=1.0, recompute_granularity=full, attention_backend=flash, optimizer_cpu_offload=true, packing=true, micro-bs=1, GBS=40, max_length=5500

Symptom timing: === Starting training === is reached, megatron ModelConfig initialization completes, and within seconds of the first all-reduce / broadcast we hit the FastTrak buffer-registration error simultaneously on all 80 ranks. No assertion or Python traceback originating from megatron — just NCCL warnings followed by communicator teardown and PyTorch surfacing ncclSystemError.

**Localization (two specific call sites)**

We ran three iterations of the experiment, each one removing the previous failing site. Each iteration uncovered a new crash site on a different megatron-core process group, with the same INVALID_ARGUMENT: ... no valid fd set:

Unpatched run — crashed at megatron/core/distributed/finalize_model_grads.py:484 on all_reduce(num_tokens, group=dp_cp_group). Process group: dp_cp_group.

Run-time patch of site #1 (pad num_tokens to a 2 MB VMM-backed scratch) — crash moved to megatron/core/optimizer/clip_grads.py:130, the all_reduce(...) inside get_grad_norm_fp32. Process group: dp group.

Universal monkey-patch of torch.distributed.all_reduce / broadcast to route every small CUDA tensor through a 2 MB VMM-backed scratch (installed via .pth so all 80 worker processes see it) — crash moved to swift/megatron/utils/parallel_utils.py:25 (logical_and_across_model_parallel_group), all_reduce(input, op=MIN, group=mpu.get_model_parallel_group()). Process group: model-parallel group.

Falsified: tensor size / VMM-backing is not the trigger. In the third run the wrapper was in the call stack and was substituting a freshly-allocated 2 MB scratch tensor for the original scalar:

swift/megatron/utils/parallel_utils.py:25 logical_and_across_model_parallel_group
    torch.distributed.all_reduce(input, op=MIN, group=mpu.get_model_parallel_group())
/usr/local/lib/python3.11/site-packages/_patch_dist.py:50 in all_reduce
    r = _orig_all_reduce(scratch, *args, **kwargs)     # scratch is 2 MB, freshly allocated
torch/distributed/c10d_logger.py:83 wrapper
torch/distributed/distributed_c10d.py:3007 all_reduce
→ NET/FasTrak: INVALID_ARGUMENT: Buffer registration request does not have a valid fd set
A 2 MB freshly-allocated CUDA tensor failed FastTrak buffer registration in the same way as the original int64 scalar.

The pattern across the three runs is: the first collective on each megatron-core subgroup PG fails FastTrak buffer registration. Eliminate one site, the next subgroup's first collective fails the same way. Subgroups are created via dist.new_group(rank_list), which on NCCL maps to ncclCommSplit of the world communicator. FastTrak's per-comm buffer-registration step looks like it doesn't get a valid FD on these split-derived comms.

**Expected behavior**

Megatron-core subgroup process groups should be set up in a way that's compatible with FastTrak/dma-buf-style NCCL plugins, or there should be a documented config option to fall back to a registration path that doesn't rely on a per-comm FD.

**Additional context**

Things ruled out (all reproduce the crash):

NCCL_DMABUF_ENABLE=0
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (already on)
Padding the num_tokens all_reduce/broadcast in finalize_model_grads.py to 2 MB — fixes site (1), exposes site (2)
Universal monkey-patch wrapping every torch.distributed.all_reduce / broadcast on small CUDA tensors via a 2 MB VMM-backed scratch (installed via .pth so all 80 worker processes see it) — fixes site (2), exposes site (3); this is what falsified the tensor-size hypothesis
Container base bumped through nvcr.io/nvidia/pytorch:24.04 → 24.07 → 24.10 → 24.12 (TE 1.5 → 1.8 → 1.11 → 1.13); 24.04 hits the same crash, newer bases needed for TEGroupedLinear (mcore-bridge ≥ 1.2 requirement)
Swapping PyTorch's bundled NCCL 2.27.5 for the host TCPxO's NCCL 2.28.7-1 so runtime NCCL matches the FastTrak shim's qualified version exactly — same crash signature
Re-exporting the mcore checkpoint under the matching TE version (rules out TE extra_state pickle mismatch)
NCCL env (sourced from /var/lib/tcpxo/lib64/nccl-env-profile.sh): standard FastTrak/TCPxO config — NCCL_FASTRAK_USE_LLCM=1, NCCL_FASTRAK_USE_SNAP=1, NCCL_FASTRAK_NUM_FLOWS=2, NCCL_PROTO=Simple, NCCL_NET_GDR_LEVEL=PIX, etc.

Question for mcore-oncall: how are megatron-core's subgroup PGs (dp_cp_group, dp_group, model_parallel_group, etc.) constructed in the current 0.15 line? Specifically, do they go through ncclCommSplit of the world comm, and is there any FastTrak/dma-buf-aware initialization step they rely on (e.g., a warm-up collective or buffer-registration handshake) that might be missing?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcore scalar all_reduce fails NCCL FastTrak buffer registration #4660

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

mcore scalar all_reduce fails NCCL FastTrak buffer registration #4660

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions