fix(op): drain pending MPI eager-send ACKs in border_op via Barrier

Han Wang · Han Wang · commit 4f8240ea66e9 · 2026-05-04T21:56:59.000+08:00
The empty-subdomain spin LAMMPS test (``processors 2 1 1`` with all atoms on rank 0, rank 1 nloc=0) failed at MPI_Finalize with "Communicator (handle=0x44000000) being freed has 2 unmatched message(s)". Test outputs were correct; the failure was purely in the MPI cleanup path. Root cause is the asymmetric ghost-exchange pattern that arises when one rank only Sends and the other only Irecvs at a given swap (no local atoms means nothing to send back). Under MPICH eager protocol: * The sender's MPI_Send returns once the message is queued in the eager buffer; the receiver's ACK round-trip is processed asynchronously by MPI's progress engine. * In symmetric swaps the sender also calls MPI_Wait on its own Irecv, which advances the progress engine and drains pending ACKs. * In asymmetric swaps the sender makes no further MPI call inside border_op, so the ACK stays unprocessed. The "in-flight" counter remains nonzero, and MPI_Finalize reports it as unmatched. Fix: add a single ``MPI_Barrier(world)`` at the end of ``Border::forward_t`` and ``Border::backward_t``. The Barrier forces a round-trip on every rank, which advances every rank's progress engine and drains pending ACKs. Cost is one collective per ghost-exchange call; on a 2-rank, 6-swap, 4-atom case this is in the noise vs the surrounding model forward. Verified on remote (CUDA + MPICH): test_lammps_spin_dpa3_pt2.py ... [3 passed] test_lammps_dpa3_pt2.py ............... [15 passed] Restores the multi-rank LAMMPS spin GNN with empty-subdomain support (PR #5430 CI's last failing case).
diff --git a/source/op/pt/comm.cc b/source/op/pt/comm.cc
@@ -165,6 +165,22 @@ class Border : public torch::autograd::Function<Border> {
       recv_g1 += nrecv * tensor_size;
     }
 #ifdef USE_MPI
+    // Drain pending eager-send ACKs before returning.  In the
+    // asymmetric ghost-exchange pattern (one rank only Sends, the
+    // other only Irecvs at a given swap — e.g. an empty subdomain
+    // under ``processors 2 1 1``) the sender's MPI_Send returns once
+    // the eager-buffered message is queued, but MPICH's internal
+    // accounting marks the message as "in flight" until the sender's
+    // progress engine processes the receiver's ACK.  In the symmetric
+    // case the sender's own MPI_Wait on its Irecv drains those ACKs.
+    // In the asymmetric case there is no such Wait, and the message
+    // stays "in flight" all the way to MPI_Finalize, which then
+    // reports ``Communicator (...) being freed has N unmatched
+    // message(s)``.  An MPI_Barrier on the same communicator forces a
+    // round-trip on every rank, drains ACKs, and clears the counter.
+    if (mpi_init && world_size >= 1) {
+      MPI_Barrier(world);
+    }
 #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
     // Only copy back when ``recv_g1_tensor`` was actually moved to a
     // different device above (the cuda_aware==0 CPU fallback). When
@@ -339,6 +355,13 @@ class Border : public torch::autograd::Function<Border> {
       }
     }
 #ifdef USE_MPI
+    // Drain pending eager-send ACKs before returning — see forward_t
+    // for the full rationale.  Backward has the same asymmetric
+    // Send/Irecv pattern (now in the reverse direction) and the same
+    // unmatched-message trap when one rank only Sends.
+    if (mpi_init && world_size >= 1) {
+      MPI_Barrier(world);
+    }
 #if defined(GOOGLE_CUDA) || defined(TENSORFLOW_USE_ROCM)
     // Move result back to the device of the input grad only when
     // ``d_local_g1_tensor`` was actually moved to a different device