@@ -20,9 +20,12 @@ outcome: |-
2020 names without the _runner suffix → "No module named python_example" + a 5-min deadlock.
2121 Fixed in 7e7b7448. All three drifted undetected because the repo's exercised path is SMP,
2222 not MPI ([[shapepipe/exec-modes-schedulers]]); actual MPI run history (esp. canfar) is
23- unknown from here. Noted: MPI deadlocks on rank-0 failure instead of
24- failing fast (follow-up). REMAINING: Martin review + merge of #737; open question whether
25- MPI should be retired rather than maintained.
23+ unknown from here. HARDENING PASS added a preflight guard (check_mpi_world, 2289e6a7: aborts
24+ when OMPI_COMM_WORLD_SIZE != COMM_WORLD size — the singleton signature; SLURM_NTASKS is NOT
25+ reliable for this) and, found while testing it, fixed a swallowed exit code (33494d74: main()
26+ now returns run()'s value — every caught error had been exiting 0). Both tested + verified on
27+ a real allocation. STILL OPEN: deadlock when rank 0 fails mid-setup for non-singleton reasons.
28+ REMAINING: Martin review + merge of #737; open question whether MPI should be retired.
2629---
2730
2831## The problem
@@ -137,6 +140,10 @@ scripts work #737 started.
137140 launcher worked. Shipped in the published image (CI rebuild).
1381416 . ** Stale example config fix** (` 7e7b7448 ` ) — ` config_mpi.ini ` module names
139142 ` *_runner ` -suffixed to match the loader; surfaced running the real script.
143+ 7 . ** MPI singleton preflight guard** (` 2289e6a7 ` ) — ` check_mpi_world() ` aborts on
144+ ` OMPI_COMM_WORLD_SIZE ` ≠ ` COMM_WORLD ` size; unit + real-allocation tested.
145+ 8 . ** Exit-code propagation fix** (` 33494d74 ` ) — ` main() ` returns ` run() ` 's value;
146+ every caught error had been exiting 0. + regression test.
140147
141148## Empirical close (2026-05-31) — two layers
142149
@@ -191,12 +198,35 @@ the loader needs the full runner names (`python_example_runner`,
191198example config drifted out of sync with the rest of the repo, undetected,
192199because the repo's exercised path is SMP, not MPI.
193200
194- ** Note — MPI deadlocks on rank-0 setup failure** instead of failing fast: when
195- rank 0 errored on the bad module name, the other ranks blocked in a collective
196- until SLURM killed the job at the wall clock. This is exactly the failure mode
197- the "preflight self-check / fail loudly" item (option 5 in the spectrum above)
198- guards against — worth a follow-up so a stale config or desync surfaces as an
199- immediate error, not a silent 5-minute hang. Out of scope for #737 .
201+ ## Layer 4 — silent-failure hardening (the "warning sign")
202+
203+ A deeper pass on the singleton failure (option 5 in the spectrum above) turned
204+ up two more silent-failure paths and fixed both:
205+
206+ ** (a) No preflight guard against the singleton signature.** In the singleton
207+ case every process is master, ` split_mpi_jobs(list, 1) ` hands each the * full*
208+ job list, and they all run the whole pipeline into the same output dir — N
209+ uncoordinated copies, exit 0, plausible-but-wrong. Added ` check_mpi_world() `
210+ (` mpi_run.py ` , called at the top of ` run_mpi ` ): compares the size that wired up
211+ (` COMM_WORLD ` ) against the size the launcher intended (` OMPI_COMM_WORLD_SIZE ` )
212+ and aborts on a mismatch. Empirically: ** ` SLURM_NTASKS ` is NOT usable** for this
213+ — it reads ` 1 ` on remote-node ranks even in a healthy run — ` OMPI_COMM_WORLD_SIZE `
214+ is the reliable signal (it's ` 4 ` in both healthy and singleton; only ` COMM_WORLD `
215+ differs). Commit ` 2289e6a7 ` , unit-tested + verified on a real allocation (healthy
216+ passes; OMPI-4-image-under-OMPI-5-host fires the abort).
217+
218+ ** (b) Swallowed exit code (the bigger one).** Testing (a) end-to-end exposed that
219+ the guard fired and logged loudly but the job * still exited 0* : ` main() ` in
220+ ` shapepipe_run.py ` called ` run(args) ` without returning it, so ` exit(main()) ` was
221+ always ` exit(None) ` → 0. ** Every caught error in ShapePipe — not just MPI — has
222+ been exiting 0** , invisible to ` exit $? ` and CI. Fixed to ` return run(args) `
223+ (` 33494d74 ` ) + regression test. With both, the broken case now exits 1.
224+
225+ ** Still open (distinct gap):** when rank 0 fails * mid-setup* for a non-singleton
226+ reason (e.g. the stale-config module error in Layer 3), ranks 1..N block in the
227+ following ` bcast ` /` scatter ` until the wall clock — the guard runs * before* module
228+ loading, so it doesn't cover this. Fixing it needs collective error propagation
229+ (rank 0 signalling failure before the barrier). Left as a follow-up.
200230
201231** Genuinely verified end to end** (job 780660): the unmodified ` candide_mpi.sh `
202232against the freshly-published ` :cleanup-candide-scripts-container-runtime ` image
0 commit comments