Commit 6d1a540
committed
data: Yield the CPU in the NBX exchange loop under oversubscription
nbx_exchange polled with a tight 'while True: Iprobe' busy-wait, spinning
at 100% CPU while waiting for peer messages and the termination barrier.
When ranks are oversubscribed (more ranks than cores), a spinning rank
starves the very peer it is waiting on, so MPI makes no progress and the
exchange deadlocks. This is exactly the MPI-notebook CI configuration: a
4-engine ipyparallel cluster on a 2-core runner (OpenMPI --oversubscribe),
where it manifested as a multi-minute cell timeout.
Yield (time.sleep(0)) whenever a poll pass finds nothing ready, so
co-scheduled ranks can run; drain ready messages first via .
Verified: correctness unchanged (44 routed/gather mode-4 tests pass) and
16 ranks on 8 cores complete in 0.35s instead of hanging.1 parent 82081ba commit 6d1a540
1 file changed
Lines changed: 10 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
| 12 | + | |
11 | 13 | | |
12 | 14 | | |
13 | 15 | | |
| |||
83 | 85 | | |
84 | 86 | | |
85 | 87 | | |
86 | | - | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
87 | 91 | | |
88 | 92 | | |
89 | 93 | | |
90 | 94 | | |
91 | 95 | | |
92 | 96 | | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
93 | 102 | | |
94 | 103 | | |
0 commit comments