Skip to content

Commit 6cfdc9c

Browse files
committed
Merge develop into docs/rework (resolve felt-store generations, keep newer)
# Conflicts: # .felt/docker-uv-revert/docker-uv-revert.md # .felt/fabian-coord-bug/fabian-coord-bug.md # .felt/ngmix-update/ngmix-update.md # .felt/prs-in-flight/prs-in-flight.md # .felt/shapepipe.md # .felt/shapepipe/cleanup-rhostats-jobscripts/cleanup-rhostats-jobscripts.md
2 parents ebaeae7 + a4825a7 commit 6cfdc9c

20 files changed

Lines changed: 620 additions & 94 deletions

File tree

.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ just CI. Deserves its own issue; #732 doesn't touch it.
6565

6666
## Knock-on
6767

68-
[[shapepipe/prs-in-flight]]: **#729** (actions group, bumps `setup-miniconda`
68+
**#729** (actions group, bumps `setup-miniconda`
6969
v3→v4) hit the layer-1 failure too — confirming the action bump alone
7070
doesn't fix the path. #729 must rebase on top of #732 once it merges before
7171
it can go green. The smoke-test work in [[shapepipe/smoke-test-read-only]]
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
name: 'ShapePipe execution modes (smp/mpi) and schedulers (PBS/SLURM): what the repo''s tooling shows'
3+
tags:
4+
- shapepipe
5+
- mpi
6+
- reference
7+
created-at: 2026-05-31T16:51:46.221097637+02:00
8+
outcome: 'By the repo''s lights SMP is the exercised path (55/56 example configs; every canfar/candide job script is SMP-only via N_SMP, SLURM+conda); MPI is the 2019 mode, set in 1 config, and its code/config drifted out of sync (module_config_sec bug dates to #415 by git history). PBS is dead (2019 example scripts only); SLURM is current everywhere. CAVEAT: this is what the repo shows, not how ShapePipe was actually run — canfar carried most processing and is invisible from here, so MPI usage history is unknown.'
9+
---
10+
11+
Two orthogonal axes that are easy to conflate when reasoning about how ShapePipe
12+
runs on a cluster. This fiber pins down what each is, when it entered, and what's
13+
actually used today vs. legacy — the context for [[shapepipe/mpi-hybrid]].
14+
15+
## Axis 1 — execution mode (`[EXECUTION] MODE`, inside ShapePipe)
16+
17+
Dispatched in `src/shapepipe/run.py`: `mode = config["EXECUTION"]["MODE"].lower()`,
18+
then `run_mpi(pipe, comm)` if `mode == "mpi"` else `run_smp(pipe)`. If mpi4py isn't
19+
importable, mode is forced to `smp`.
20+
21+
- **`smp`** — joblib `Parallel(n_jobs=batch_size)` across cores on **one node**
22+
(`job_handler._distribute_smp_jobs`). **The living path.** 55 of 56 example
23+
configs set `MODE = SMP`; every canfar/candide production script drives it by
24+
injecting `N_SMP` into the config (`SMP_BATCH_SIZE`).
25+
- **`mpi`** — mpi4py scatter/gather across **multiple nodes** (`pipeline/mpi_run.py`,
26+
`submit_mpi_jobs`). 2019-era (`c6554983` "initial mpi framework"). Exactly **1**
27+
example config uses it. The `worker()` call in `mpi_run.py` has been out of sync
28+
since PR #415 (Jan 2025) — `worker()` gained a `module_config_sec` param and
29+
`mpi_run.py` wasn't updated, so it passes 7 args where 8 are required. On candide
30+
it couldn't even wire up (PMIx mismatch, see [[shapepipe/mpi-hybrid]]), so the
31+
code bug couldn't surface here. Whether MPI was run elsewhere (canfar especially,
32+
which we can't see) is unknown — what's clear is the repo's tooling is all SMP.
33+
34+
**SMP and MPI are the same computation behind two dispatchers.** Both call the
35+
identical `WorkerHandler.worker()` with the identical 8 args (`job_handler._distribute_smp_jobs`
36+
vs `mpi_run.submit_mpi_jobs`). The MPI path's only inter-rank traffic is `bcast`
37+
of setup objects, one `scatter` of the independent job-list, and one `gather` of
38+
result dicts — `worker_handler.py` (the actual work) has zero MPI in it. No
39+
`Send`/`Recv`/`Allreduce`/`Barrier` during compute. That's the signature of an
40+
**embarrassingly parallel** workload: MPI provides no computational capability
41+
that SMP-on-a-node-plus-a-scheduler lacks — it's a job-distribution convenience
42+
(one `mpirun` spanning nodes vs. the submission layer fanning out per-node jobs).
43+
This is what grounds the "is MPI worth keeping?" question to Martin — observed
44+
from the comm pattern, not inferred from usage.
45+
46+
Note `MODE` is overloaded across config sections — `CLASSIC`, `MULTI-EPOCH`,
47+
`FIT_VALIDATION`, `VALIDATION` are *module* modes (PSF / ngmix), not `[EXECUTION]`
48+
modes. Only `smp`/`mpi` live under `[EXECUTION]`.
49+
50+
## Axis 2 — scheduler (the batch wrapper, outside ShapePipe)
51+
52+
- **PBS** (`#PBS` / `qsub`) — the 2019 `example/pbs/` scripts. **Dead** on candide
53+
(migrated to SLURM). All `#PBS` directives removed on the #737 branch.
54+
- **SLURM** (`#SBATCH` / `sbatch`) — **current everywhere**. canfar since ~2020,
55+
candide since 2024.
56+
57+
## What the dates and tooling show
58+
59+
The maintained submission tooling is SMP-only and SLURM-based: `scripts/sh/run_scratch_local.sh`
60+
(2024-11, *"submit jobs on candide"*) → `init_run_exclusive_canfar.sh``job_sp_canfar.bash`,
61+
all `sbatch`, all **SMP** via `N_SMP` ("SMP mode only" in their help), and still **conda**
62+
(`CONDA_PREFIX=$HOME/.conda/envs/shapepipe`), *not* the container. The `example/pbs/candide_{smp,mpi}.sh`
63+
scripts are 2019 **teaching examples** (untouched until the #737 branch).
64+
65+
This is evidence about the tooling, not a claim about run history. It's suggestive — the
66+
SMP tooling is what's been maintained, the MPI mode and its example config drifted untouched —
67+
but most processing ran on canfar, which isn't visible from this repo, so how much MPI was
68+
actually used is a question for the people who ran it, not something the repo can answer.
69+
70+
## Implications
71+
72+
- The MPI fix is worth landing — `mpi` is a supported mode and getting it working through
73+
the container on candide was the point — framed as enablement/verification, not as
74+
unblocking some known-active workload.
75+
- Production scripts (SMP + SLURM + conda) are untouched by #737 and out of scope; they're
76+
also **not yet containerized** — a future gap to name.
77+
- **Decision deferred to Martin (asked in #737):** is MPI worth getting working /
78+
maintaining on candide at all, or should candide just use SMP (which works through
79+
the container — `candide_smp.sh`)? Given SMP and MPI are the same computation, MPI
80+
earns its keep only as an ergonomic convenience. We do *not* retire it unilaterally —
81+
it's a documented public mode; #737 leaves it in working order and Martin makes the
82+
call. If kept, add a CI smoke so it can't silently rot again; if dropped, removal is
83+
clean and contained (`mpi_run.py`, `run_mpi`, the `import_mpi` branches, `mpi4py`,
84+
`candide_mpi.sh`).

0 commit comments

Comments
 (0)