Skip to content

Commit dbf3379

Browse files
committed
ci: Make OpenMPI yield when idle for the MPI example notebooks
The data-initialization notebook's point-to-point routing (NBX exchange) hung the 4-engine ipyparallel cluster under OpenMPI, while IntelMPI ran fine. OpenMPI defaults to an aggressive busy-wait: with the engines contending for cores, the probe/barrier loop spins at 100% and starves the peers it waits on, so MPI never progresses. IntelMPI yields by default. Set OMPI_MCA_mpi_yield_when_idle=1 in the examples-mpi job so OpenMPI yields the CPU while waiting. Also revert the earlier Python-level sleep(0) in nbx_exchange: sched_yield at the wrong layer (the spin is in OpenMPI's C progress engine), which did not help.
1 parent 6d1a540 commit dbf3379

2 files changed

Lines changed: 7 additions & 11 deletions

File tree

.github/workflows/examples-mpi.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,11 @@ jobs:
4444
DEVITO_ARCH: "gcc"
4545
CC: "gcc"
4646
CXX: "g++"
47+
# Make OpenMPI yield the CPU while waiting instead of busy-spinning. With
48+
# 4 ipyparallel engines contending for cores, the point-to-point routing
49+
# in the data notebooks otherwise livelocks under OpenMPI (IntelMPI yields
50+
# by default, so it is unaffected and ignores this OMPI_* setting).
51+
OMPI_MCA_mpi_yield_when_idle: "1"
4752

4853
steps:
4954
- name: Checkout devito
@@ -71,7 +76,7 @@ jobs:
7176
ipcluster start --profile=mpi --engines=mpi -n 4 --daemonize
7277
# A few seconds to ensure workers are ready
7378
sleep 10
74-
pytest -v --nbval examples/mpi
79+
pytest -vvv --nbval examples/mpi
7580
ipcluster stop --profile=mpi
7681
7782
- name: Test seismic examples

devito/data/distributed/transport.py

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@
88
graph communicator without affecting the layers above.
99
"""
1010

11-
import time
12-
1311
import numpy as np
1412

1513
from devito.mpi import MPI
@@ -85,19 +83,12 @@ def nbx_exchange(comm, sendbufs, dtype, tag=0):
8583
buf = np.empty(count, dtype=dtype)
8684
comm.Recv([buf, mpitype], source=src, tag=tag)
8785
recvd[src] = buf
88-
# Drain any further ready messages before yielding
89-
continue
90-
if barrier is None:
86+
elif barrier is None:
9187
if MPI.Request.Testall(sends):
9288
# All my sends were matched -> announce I am done sending
9389
barrier = comm.Ibarrier()
9490
elif barrier.Test():
9591
# Everyone is done sending and nothing is in flight
9692
break
97-
# Nothing was ready this pass: yield the CPU so co-scheduled ranks can
98-
# make progress. Without this the probe loop busy-waits at 100%, which
99-
# deadlocks the exchange when ranks are oversubscribed (more ranks than
100-
# cores, e.g. the 4-engine ipyparallel cluster on a 2-core CI runner).
101-
time.sleep(0)
10293

10394
return recvd

0 commit comments

Comments
 (0)