Skip to content

Commit 8e00b8b

Browse files
cailmdaleyclaude
andcommitted
felt: record Layer 4 hardening (singleton guard + exit-code fix)
The 'warning sign' pass: added check_mpi_world preflight (OMPI_COMM_WORLD_SIZE vs COMM_WORLD size; SLURM_NTASKS proven unreliable) and, found while testing it e2e, fixed main() swallowing run()'s exit code (every caught error had exited 0). Both tested on a real allocation. Distinct remaining gap: mid-setup rank-0 failure still deadlocks the other ranks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 2289e6a commit 8e00b8b

1 file changed

Lines changed: 39 additions & 9 deletions

File tree

.felt/shapepipe/mpi-hybrid/mpi-hybrid.md

Lines changed: 39 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,12 @@ outcome: |-
2020
names without the _runner suffix → "No module named python_example" + a 5-min deadlock.
2121
Fixed in 7e7b7448. All three drifted undetected because the repo's exercised path is SMP,
2222
not MPI ([[shapepipe/exec-modes-schedulers]]); actual MPI run history (esp. canfar) is
23-
unknown from here. Noted: MPI deadlocks on rank-0 failure instead of
24-
failing fast (follow-up). REMAINING: Martin review + merge of #737; open question whether
25-
MPI should be retired rather than maintained.
23+
unknown from here. HARDENING PASS added a preflight guard (check_mpi_world, 2289e6a7: aborts
24+
when OMPI_COMM_WORLD_SIZE != COMM_WORLD size — the singleton signature; SLURM_NTASKS is NOT
25+
reliable for this) and, found while testing it, fixed a swallowed exit code (33494d74: main()
26+
now returns run()'s value — every caught error had been exiting 0). Both tested + verified on
27+
a real allocation. STILL OPEN: deadlock when rank 0 fails mid-setup for non-singleton reasons.
28+
REMAINING: Martin review + merge of #737; open question whether MPI should be retired.
2629
---
2730

2831
## The problem
@@ -137,6 +140,10 @@ scripts work #737 started.
137140
launcher worked. Shipped in the published image (CI rebuild).
138141
6. **Stale example config fix** (`7e7b7448`) — `config_mpi.ini` module names
139142
`*_runner`-suffixed to match the loader; surfaced running the real script.
143+
7. **MPI singleton preflight guard** (`2289e6a7`) — `check_mpi_world()` aborts on
144+
`OMPI_COMM_WORLD_SIZE``COMM_WORLD` size; unit + real-allocation tested.
145+
8. **Exit-code propagation fix** (`33494d74`) — `main()` returns `run()`'s value;
146+
every caught error had been exiting 0. + regression test.
140147

141148
## Empirical close (2026-05-31) — two layers
142149

@@ -191,12 +198,35 @@ the loader needs the full runner names (`python_example_runner`,
191198
example config drifted out of sync with the rest of the repo, undetected,
192199
because the repo's exercised path is SMP, not MPI.
193200

194-
**Note — MPI deadlocks on rank-0 setup failure** instead of failing fast: when
195-
rank 0 errored on the bad module name, the other ranks blocked in a collective
196-
until SLURM killed the job at the wall clock. This is exactly the failure mode
197-
the "preflight self-check / fail loudly" item (option 5 in the spectrum above)
198-
guards against — worth a follow-up so a stale config or desync surfaces as an
199-
immediate error, not a silent 5-minute hang. Out of scope for #737.
201+
## Layer 4 — silent-failure hardening (the "warning sign")
202+
203+
A deeper pass on the singleton failure (option 5 in the spectrum above) turned
204+
up two more silent-failure paths and fixed both:
205+
206+
**(a) No preflight guard against the singleton signature.** In the singleton
207+
case every process is master, `split_mpi_jobs(list, 1)` hands each the *full*
208+
job list, and they all run the whole pipeline into the same output dir — N
209+
uncoordinated copies, exit 0, plausible-but-wrong. Added `check_mpi_world()`
210+
(`mpi_run.py`, called at the top of `run_mpi`): compares the size that wired up
211+
(`COMM_WORLD`) against the size the launcher intended (`OMPI_COMM_WORLD_SIZE`)
212+
and aborts on a mismatch. Empirically: **`SLURM_NTASKS` is NOT usable** for this
213+
— it reads `1` on remote-node ranks even in a healthy run — `OMPI_COMM_WORLD_SIZE`
214+
is the reliable signal (it's `4` in both healthy and singleton; only `COMM_WORLD`
215+
differs). Commit `2289e6a7`, unit-tested + verified on a real allocation (healthy
216+
passes; OMPI-4-image-under-OMPI-5-host fires the abort).
217+
218+
**(b) Swallowed exit code (the bigger one).** Testing (a) end-to-end exposed that
219+
the guard fired and logged loudly but the job *still exited 0*: `main()` in
220+
`shapepipe_run.py` called `run(args)` without returning it, so `exit(main())` was
221+
always `exit(None)` → 0. **Every caught error in ShapePipe — not just MPI — has
222+
been exiting 0**, invisible to `exit $?` and CI. Fixed to `return run(args)`
223+
(`33494d74`) + regression test. With both, the broken case now exits 1.
224+
225+
**Still open (distinct gap):** when rank 0 fails *mid-setup* for a non-singleton
226+
reason (e.g. the stale-config module error in Layer 3), ranks 1..N block in the
227+
following `bcast`/`scatter` until the wall clock — the guard runs *before* module
228+
loading, so it doesn't cover this. Fixing it needs collective error propagation
229+
(rank 0 signalling failure before the barrier). Left as a follow-up.
200230

201231
**Genuinely verified end to end** (job 780660): the unmodified `candide_mpi.sh`
202232
against the freshly-published `:cleanup-candide-scripts-container-runtime` image

0 commit comments

Comments
 (0)