Skip to content

Commit f1e144e

Browse files
author
Han Wang
committed
fix(test): explicit del lammps before MPI.Finalize() in mpirun runners
The mpirun-driven LAMMPS test runners called ``MPI.Finalize()`` at the end of the script with the ``lammps`` Python object still alive. When the interpreter then shut down, the LAMMPS C++ destructor ran in a state where MPI was already finalized — and LAMMPS' ``Finish::end``, fix/compute teardown, and the deep[m|spin] pair-style destructor chain all issue MPI collectives (``MPI_Gather`` / ``MPI_Reduce``) during cleanup. On the empty-subdomain rank (no local atoms but live ghost atoms), the asymmetric MPI traffic during destruction occasionally hit an MPI-after-Finalize error path and crashed the rank with SIGFPE, manifesting in CUDA CI as ``exit status 136`` of the subprocess for ``test_pair_deepmd_mpi_dpa3_spin_empty_subdomain``. The crash was intermittent (1 fail in ~5 runs) on the GitHub Actions CUDA runner, not reproducible on a V100 dev box. PR #5446 (unrelated to MPI / spin / CUDA code) hit the same flake — confirming it's a pre-existing teardown race in the test runners, not a regression in either PR. The fix is mechanical and identical in all four runners: ``del lammps`` before ``MPI.Finalize()`` so the LAMMPS instance is torn down while the communicator is still valid.
1 parent 4604131 commit f1e144e

4 files changed

Lines changed: 19 additions & 0 deletions

File tree

source/lmp/tests/run_mpi_pair_deepmd.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,4 +62,7 @@
6262
pe = lammps.eval("pe")
6363
arr = [pe]
6464
np.savetxt(output, np.array(arr))
65+
# Tear down LAMMPS before MPI.Finalize() to avoid MPI-after-Finalize
66+
# in the LAMMPS destructor. See run_mpi_pair_deepmd_spin_dpa3_pt2.py.
67+
del lammps
6568
MPI.Finalize()

source/lmp/tests/run_mpi_pair_deepmd_dpa3_pt2.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,4 +225,8 @@
225225
row = np.concatenate([fi, vi])
226226
f.write(" ".join(f"{v:.16e}" for v in row) + "\n")
227227

228+
# Tear down LAMMPS before MPI.Finalize() — see the matching comment in
229+
# ``run_mpi_pair_deepmd_spin_dpa3_pt2.py``. Same teardown-order race
230+
# class; spin happens to hit it more often on CUDA CI.
231+
del lammps
228232
MPI.Finalize()

source/lmp/tests/run_mpi_pair_deepmd_spin.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,4 +62,7 @@
6262
pe = lammps.eval("pe")
6363
arr = [pe]
6464
np.savetxt(output, np.array(arr))
65+
# Tear down LAMMPS before MPI.Finalize() to avoid MPI-after-Finalize
66+
# in the LAMMPS destructor. See run_mpi_pair_deepmd_spin_dpa3_pt2.py.
67+
del lammps
6568
MPI.Finalize()

source/lmp/tests/run_mpi_pair_deepmd_spin_dpa3_pt2.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,4 +144,13 @@
144144
row = np.concatenate([fi, fmi, vi])
145145
f.write(" ".join(f"{v:.16e}" for v in row) + "\n")
146146

147+
# Tear down the LAMMPS instance *before* ``MPI.Finalize()`` so its
148+
# destructor's MPI calls (fix/compute cleanup, timing reductions inside
149+
# ``Finish::end``, the deep-spin pair-style destructor chain, etc.) run
150+
# while the communicator is still valid. Without this, Python keeps
151+
# ``lammps`` alive past ``MPI.Finalize()`` and only releases it during
152+
# interpreter shutdown — and the empty-subdomain rank then hits an
153+
# MPI-after-Finalize call which crashes with SIGFPE on some CUDA CI
154+
# runners (intermittent; not reproducible on V100).
155+
del lammps
147156
MPI.Finalize()

0 commit comments

Comments
 (0)