Skip to content

Commit 4f8240e

Browse files
author
Han Wang
committed
fix(op): drain pending MPI eager-send ACKs in border_op via Barrier
The empty-subdomain spin LAMMPS test (``processors 2 1 1`` with all atoms on rank 0, rank 1 nloc=0) failed at MPI_Finalize with "Communicator (handle=0x44000000) being freed has 2 unmatched message(s)". Test outputs were correct; the failure was purely in the MPI cleanup path. Root cause is the asymmetric ghost-exchange pattern that arises when one rank only Sends and the other only Irecvs at a given swap (no local atoms means nothing to send back). Under MPICH eager protocol: * The sender's MPI_Send returns once the message is queued in the eager buffer; the receiver's ACK round-trip is processed asynchronously by MPI's progress engine. * In symmetric swaps the sender also calls MPI_Wait on its own Irecv, which advances the progress engine and drains pending ACKs. * In asymmetric swaps the sender makes no further MPI call inside border_op, so the ACK stays unprocessed. The "in-flight" counter remains nonzero, and MPI_Finalize reports it as unmatched. Fix: add a single ``MPI_Barrier(world)`` at the end of ``Border::forward_t`` and ``Border::backward_t``. The Barrier forces a round-trip on every rank, which advances every rank's progress engine and drains pending ACKs. Cost is one collective per ghost-exchange call; on a 2-rank, 6-swap, 4-atom case this is in the noise vs the surrounding model forward. Verified on remote (CUDA + MPICH): test_lammps_spin_dpa3_pt2.py ... [3 passed] test_lammps_dpa3_pt2.py ............... [15 passed] Restores the multi-rank LAMMPS spin GNN with empty-subdomain support (PR #5430 CI's last failing case).
1 parent e19108d commit 4f8240e

1 file changed

Lines changed: 23 additions & 0 deletions

File tree

source/op/pt/comm.cc

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,22 @@ class Border : public torch::autograd::Function<Border> {
165165
recv_g1 += nrecv * tensor_size;
166166
}
167167
#ifdef USE_MPI
168+
// Drain pending eager-send ACKs before returning. In the
169+
// asymmetric ghost-exchange pattern (one rank only Sends, the
170+
// other only Irecvs at a given swap — e.g. an empty subdomain
171+
// under ``processors 2 1 1``) the sender's MPI_Send returns once
172+
// the eager-buffered message is queued, but MPICH's internal
173+
// accounting marks the message as "in flight" until the sender's
174+
// progress engine processes the receiver's ACK. In the symmetric
175+
// case the sender's own MPI_Wait on its Irecv drains those ACKs.
176+
// In the asymmetric case there is no such Wait, and the message
177+
// stays "in flight" all the way to MPI_Finalize, which then
178+
// reports ``Communicator (...) being freed has N unmatched
179+
// message(s)``. An MPI_Barrier on the same communicator forces a
180+
// round-trip on every rank, drains ACKs, and clears the counter.
181+
if (mpi_init && world_size >= 1) {
182+
MPI_Barrier(world);
183+
}
168184
#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
169185
// Only copy back when ``recv_g1_tensor`` was actually moved to a
170186
// different device above (the cuda_aware==0 CPU fallback). When
@@ -339,6 +355,13 @@ class Border : public torch::autograd::Function<Border> {
339355
}
340356
}
341357
#ifdef USE_MPI
358+
// Drain pending eager-send ACKs before returning — see forward_t
359+
// for the full rationale. Backward has the same asymmetric
360+
// Send/Irecv pattern (now in the reverse direction) and the same
361+
// unmatched-message trap when one rank only Sends.
362+
if (mpi_init && world_size >= 1) {
363+
MPI_Barrier(world);
364+
}
342365
#if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
343366
// Move result back to the device of the input grad only when
344367
// ``d_local_g1_tensor`` was actually moved to a different device

0 commit comments

Comments
 (0)