Skip to content

Commit a4825a7

Browse files
authored
Merge pull request #737 from CosmoStat/cleanup/candide-scripts-container
Fix MPI on candide (OpenMPI 5 image + latent code bug); containerize & SLURM-ify candide scripts
2 parents 469c6dc + 1300295 commit a4825a7

21 files changed

Lines changed: 578 additions & 263 deletions

File tree

.felt/docker-uv-revert/docker-uv-revert.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ tags:
66
- docker
77
- infra
88
created-at: 2026-04-27T11:26:45.677512058+02:00
9-
outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for Martin''s review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.'
9+
outcome: 'PR #719 (chore: switch Dockerfile to slim Python + uv lockfile) opened and CI-green on first try (3m31s); ready for review. Drops conda double-install, makes pyproject SSOT + uv.lock the pinned manifest, switches WeightWatcher from sed-patched source build to Debian''s pre-patched 1.12+dfsg-3 package, adds binary smoke tests to deploy-image.yml.'
1010
decisions:
1111
base:
1212
label: Base image
13-
rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. Martin's canfar concern is satisfied as long as the slim image works on canfar.
13+
rationale: Conda double-install was the actual problem; cleanest resolution is to drop conda entirely. The canfar deployment concern is satisfied as long as the slim image works on canfar.
1414
default: python-slim
1515
options:
1616
python-slim:
@@ -50,15 +50,15 @@ decisions:
5050
label: uv + pyproject + uv.lock; uv sync --frozen in Dockerfile
5151
modernize:
5252
label: Modernize package versions
53-
rationale: 'We determined which versions MUST stay pinned: only ngmix (Axel''s stable_version branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.'
53+
rationale: 'We determined which versions MUST stay pinned: only ngmix (pinned to a stable_version fork branch — replacement is tracked separately). Everything else can move to current latest because uv resolved cleanly and CI smoke test still passes (3m42s). If a real pipeline run on canfar surfaces a numpy-2 / pandas-3 break, the fix is a targeted constraint + uv lock, not a wholesale revert.'
5454
default: stay-current
5555
options:
5656
stay-conservative:
5757
label: Keep pre-v2 minimums (numpy 1.26, astropy 6.1, pandas 2.2); only bump when forced
5858
excluded: true
5959
excluded_reason: Drift between pyproject signal and lockfile reality; loses the chance to surface numpy-2/pandas-3 incompatibilities at PR time when CI is fast
6060
stay-current:
61-
label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to Axel's stable_version branch
61+
label: Bump pyproject minimums to current major versions (numpy 2, astropy 7, pandas 3, galsim 2.8, mpi4py 4.1, etc.); pin ngmix to its stable_version fork branch
6262
insights:
6363
ci-fast:
6464
claim: 'First CI run on PR #719 went green in 3m31s. uv installed 238 packages in 322ms — everything resolved to prebuilt wheels, no source compilation of galsim/mpi4py/python-pysap/etc. Massive speedup vs. previous build.'
@@ -97,11 +97,10 @@ The `--frozen` flag is the discipline mechanism: a stale lockfile cannot ship.
9797
## Followups
9898

9999
- Watch CI on #719. The slim-base apt list is conjectural — galsim/mpi4py/python-pysap pull a lot of system deps and we may need to add more (`libatlas-base-dev`, `libblas-dev`, etc).
100-
- If CI needs anything beyond what's in the apt block, that's the surface that benefits from a [[shapepipe/prs-in-flight]] note for next time.
101-
- After this lands, [[shapepipe/prs-in-flight]] PRs #708 and #714 may need a small rebase.
102-
- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — Martin floated it but we agreed slim should work on canfar.
100+
- If CI needs anything beyond what's in the apt block, that's worth noting for next time.
101+
- After this lands, PRs #708 and #714 may need a small rebase.
102+
- Optional: separate `Dockerfile.canfar` building on skaha if there's a concrete deployment reason. Currently conjectural — floated as a possibility, but slim should work on canfar.
103103

104104
## Connections
105105

106106
- [[shapepipe]] — root
107-
- [[shapepipe/prs-in-flight]] — touches the testing-scaffold xfail set and the develop-bugs PR

.felt/fabian-coord-bug/fabian-coord-bug.md

Lines changed: 0 additions & 10 deletions
This file was deleted.

.felt/ngmix-update/ngmix-update.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
---
2-
name: ngmix library upgrade + Lucy wrapper sync
2+
name: ngmix library upgrade + wrapper sync
33
tags:
44
- shapepipe
55
- ngmix
66
- future
77
created-at: 2026-04-27T11:26:51.026191639+02:00
8-
outcome: 'Future: replace Axel''s stable_version fork with upstream ngmix; reconcile with Lucy''s cleaned-up wrapper from her visit'
8+
outcome: 'Replace the pinned ngmix fork (a stable_version branch carrying not-yet-upstreamed fixes) with upstream ngmix once those land; reconcile the wrapper afterward.'
99
---

.felt/prs-in-flight/prs-in-flight.md

Lines changed: 0 additions & 76 deletions
This file was deleted.

.felt/shapepipe.md

Lines changed: 28 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,40 @@
11
---
2-
name: ShapePipe maintenance & PRs
2+
name: ShapePipe — project knowledge & active threads
33
tags:
44
- shapepipe
5-
- portolan
65
created-at: 2026-04-27T11:26:38.71538657+02:00
7-
outcome: 'Root: collaboration with Martin on ShapePipe — PRs, infra, future ngmix and Fabian work'
6+
outcome: 'Root of ShapePipe''s felt store: the stack division, repo conventions, and the why behind in-flight infra/cleanup threads.'
87
---
98

10-
ShapePipe is the UNIONS shape-measurement pipeline. I'm not the primary
11-
maintainer (that's Martin Kilbinger); my role is collaborator helping
12-
clean up infra, surface bugs, and keep the merge queue moving while
13-
Martin focuses on science threads.
9+
This is the root of ShapePipe's felt store — shared notes on architecture
10+
decisions, conventions, and in-flight work, for the team and AI agents alike.
11+
ShapePipe is the UNIONS galaxy shape-measurement pipeline; `CLAUDE.md` covers the
12+
build / container / CI overview, and the fibers here carry the *why*. Start here,
13+
then follow the links.
1414

15-
## Working agreement with Martin
15+
## Stack division
1616

17-
Surfaced over a 2026-04-27 walking conversation. Captured in
18-
[[shapepipe/prs-in-flight]] and the per-thread fibers below.
17+
ShapePipe **produces** shear catalogues; `sp_validation` / `cosmo_val`
18+
**consume** and validate them; `cs_util` holds code shared across both. A concern
19+
about *validating* catalogues belongs downstream, not in ShapePipe.
1920

20-
- I review and patch his PRs; he reviews mine. Bugs found during review
21-
go to a dedicated PR rather than getting bundled into his feature
22-
branch (per `feedback_separate_infra_prs`).
23-
- v2.0 was merged fast (it was ready). The skaha base it brought in is
24-
the active source of pain → see [[shapepipe/docker-uv-revert]].
25-
- I file the issues; Claude usually drafts the PRs in my voice.
26-
Disclosure on Claude-only review per
27-
`feedback_claude_only_review_disclosure`.
28-
29-
## Active threads
30-
31-
- **[[shapepipe/docker-uv-revert]]** — slim Python + uv lockfile, drop conda. PR #719 (draft).
32-
- **[[shapepipe/prs-in-flight]]** — tracking #708 (testing scaffold), #714 (develop bugs), #719 (this one).
33-
34-
## Future work
21+
## Conventions specific to this repo
3522

36-
- **[[shapepipe/ngmix-update]]** — replace Axel's stable_version fork
37-
with upstream ngmix; reconcile with Lucy's wrapper.
38-
- **[[shapepipe/fabian-coord-bug]]** — port Fabian's 1-line coord
39-
propagation fix; first need his image-sim code on github.
23+
- **Rho-statistics are obsolete inside ShapePipe.** PSF-systematics validation
24+
moved downstream to `sp_validation` / `cosmo_val` (via `shear_psf_leakage`);
25+
the stile/treecorr rho code was removed in #715. But the **meanshapes /
26+
ellipticity focal-plane plots** (`mccd_plots_runner`) are *deliberately kept*
27+
they are a general PSF/star-catalogue diagnostic, not rho-stats, and feed
28+
catalogue-paper figures. Don't delete that path along with rho-stats; see
29+
[[shapepipe/cleanup-rhostats-jobscripts]] for where the boundary actually sits.
30+
- Run the pipeline through the container; use `python3.12` explicitly inside it.
31+
- **ngmix** is pinned to a fork branch until fixes land upstream — don't bump
32+
that dependency line. [[ngmix-update]] tracks the path back to upstream.
4033

41-
## Conventions specific to this repo
34+
## Active threads
4235

43-
- Container runs through `app` (apptainer wrapper); use `python3.12`
44-
inside the shapepipe container (see `reference_containers`).
45-
- ShapePipe produces; `sp_validation` consumes; `cs_util` is shared (see
46-
`project_stack_division`).
47-
- Rho stats are obsolete here — sp_validation/cosmo_val took over (see
48-
`project_rho_stats_obsolete`).
49-
- Royal "we" in PR/issue voice; specific findings attributed to Claude
50-
by name (see `feedback_writing_voice_on_cails_behalf`).
36+
- **[[shapepipe/ci-green-on-develop]]** / **[[shapepipe/test-suite]]** — a
37+
tiered, in-image test suite and trustworthy CI on `develop`.
38+
- **[[docker-uv-revert]]** — slim Python base + uv lockfile, dropping conda.
39+
- **[[shapepipe/mpi-hybrid]]** — running hybrid MPI through the container on candide.
40+
- **[[ngmix-update]]** — replacing the pinned ngmix fork with upstream.

.felt/shapepipe/ci-develop-trigger/ci-develop-trigger.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ just CI. Deserves its own issue; #732 doesn't touch it.
6464

6565
## Knock-on
6666

67-
[[shapepipe/prs-in-flight]]: **#729** (actions group, bumps `setup-miniconda`
67+
**#729** (actions group, bumps `setup-miniconda`
6868
v3→v4) hit the layer-1 failure too — confirming the action bump alone
6969
doesn't fix the path. #729 must rebase on top of #732 once it merges before
7070
it can go green. The smoke-test work in [[shapepipe/smoke-test-read-only]]

0 commit comments

Comments
 (0)