Commit 4f8240e
Han Wang
fix(op): drain pending MPI eager-send ACKs in border_op via Barrier
The empty-subdomain spin LAMMPS test (``processors 2 1 1`` with all
atoms on rank 0, rank 1 nloc=0) failed at MPI_Finalize with
"Communicator (handle=0x44000000) being freed has 2 unmatched
message(s)". Test outputs were correct; the failure was purely in
the MPI cleanup path.
Root cause is the asymmetric ghost-exchange pattern that arises when
one rank only Sends and the other only Irecvs at a given swap (no
local atoms means nothing to send back). Under MPICH eager protocol:
* The sender's MPI_Send returns once the message is queued in the
eager buffer; the receiver's ACK round-trip is processed
asynchronously by MPI's progress engine.
* In symmetric swaps the sender also calls MPI_Wait on its own
Irecv, which advances the progress engine and drains pending ACKs.
* In asymmetric swaps the sender makes no further MPI call inside
border_op, so the ACK stays unprocessed. The "in-flight" counter
remains nonzero, and MPI_Finalize reports it as unmatched.
Fix: add a single ``MPI_Barrier(world)`` at the end of
``Border::forward_t`` and ``Border::backward_t``. The Barrier
forces a round-trip on every rank, which advances every rank's
progress engine and drains pending ACKs. Cost is one collective
per ghost-exchange call; on a 2-rank, 6-swap, 4-atom case this is
in the noise vs the surrounding model forward.
Verified on remote (CUDA + MPICH):
test_lammps_spin_dpa3_pt2.py ... [3 passed]
test_lammps_dpa3_pt2.py ............... [15 passed]
Restores the multi-rank LAMMPS spin GNN with empty-subdomain
support (PR #5430 CI's last failing case).1 parent e19108d commit 4f8240e
1 file changed
Lines changed: 23 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
165 | 165 | | |
166 | 166 | | |
167 | 167 | | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
168 | 184 | | |
169 | 185 | | |
170 | 186 | | |
| |||
339 | 355 | | |
340 | 356 | | |
341 | 357 | | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
342 | 365 | | |
343 | 366 | | |
344 | 367 | | |
| |||
0 commit comments