Skip to content

Commit 6fe67e8

Browse files
coreyjadamskashif
authored andcommitted
Centralize cache delete-and-push mechanism to one place (#1645)
* Centralize cache delete-and-push mechanism to one place * Pin CI deps in a file, instead of the github action. * Use a localized reinstally for pyg. * increasing retry backoff Signed-off-by: Kashif Rasul <kashif.rasul@gmail.com>
1 parent 2c5bb96 commit 6fe67e8

6 files changed

Lines changed: 335 additions & 169 deletions

File tree

.github/CACHE_CONTRACT.md

Lines changed: 75 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ from three independent sources: a pinned CUDA container image, a pinned
3737
| Invalidates when | container image or Python version changes (prefix change → new slot) |
3838
| Does **not** invalidate on | `uv.lock`, `pyproject.toml`, or kernel source changes (each compiler handles its own source-hash invalidation internally) |
3939
| Restore semantics | **fail-open**; missing cache only costs compilation time, never correctness |
40-
| Save semantics | nightly `testmon` job only, via delete-before-save; PR workflows restore but never save |
40+
| Save semantics | nightly `testmon` job only, via the `replace-cache` action; PR workflows restore but never save |
4141

4242
The JIT compilation cache bundles all JIT compiler artifact directories
4343
under a single umbrella path. Each compiler writes to a subdirectory
@@ -57,6 +57,74 @@ old ones.
5757
To add a new JIT backend: create a subdirectory under `$JIT_CACHE_DIR`,
5858
set the backend's cache-path env var in the test step, done.
5959

60+
### Testmon database cache (`.testmondata*`)
61+
62+
| Property | Value |
63+
|---|---|
64+
| Key | `<TESTMON_CACHE_KEY_PREFIX>-latest` |
65+
| Prefix encodes | nightly identity (`testmon-nightly`) |
66+
| Suffix | literal `latest` (mutable slot, refreshed via delete-before-save) |
67+
| Contents | `.testmondata`, `.testmondata-shm`, `.testmondata-wal` -- testmon's per-test dependency graph and last-run signatures |
68+
| Invalidates when | prefix is bumped (essentially never, by design) |
69+
| Does **not** invalidate on | `uv.lock` or `pyproject.toml` changes -- testmon detects changed dependency hashes itself and re-runs only the affected tests |
70+
| Restore semantics | **fail-open**; a miss only costs full-suite runtime, never correctness, and testmon handles stale DBs gracefully |
71+
| Save semantics | nightly `testmon` job only, via the `replace-cache` action with `if: always()` so partial DBs from flaky runs still publish |
72+
73+
Historical note: the key was previously suffixed with
74+
`hashFiles('uv.lock', 'pyproject.toml')`. Because GitHub Actions
75+
caches are immutable, two consecutive nightlies with an unchanged
76+
lockfile (the common case) collided on the same key, and the second
77+
save logged `Failed to save: Unable to reserve cache` only as a
78+
*warning*. The stale DB persisted for days, PRs restored it via the
79+
prefix fallback, and testmon then invalidated everything because the
80+
realized environment had drifted away from what the cached DB
81+
recorded. Switching to a `-latest` mutable slot via `replace-cache`
82+
fixes the save bug, and the embedded verify step turns any future
83+
silent save failure into a hard job failure.
84+
85+
### Coverage baseline cache (`.coverage*`)
86+
87+
| Property | Value |
88+
|---|---|
89+
| Key | `<COVERAGE_CACHE_KEY_PREFIX>-latest` |
90+
| Prefix encodes | nightly identity (`coverage-nightly`) |
91+
| Suffix | literal `latest` (mutable slot, refreshed via delete-before-save) |
92+
| Contents | parallel-mode coverage shards (`.coverage.*`) produced by the nightly's full-suite pytest run, before `coverage combine` |
93+
| Invalidates when | prefix is bumped |
94+
| Does **not** invalidate on | `uv.lock` or `pyproject.toml` changes |
95+
| Restore semantics | **fail-open**; PR coverage merges its own shards on top of the restored baseline |
96+
| Save semantics | nightly `coverage` job only, via the `replace-cache` action |
97+
98+
Same immutable-key bug class as testmon; migrated to the `-latest`
99+
slot for parity.
100+
101+
## Reusable building blocks
102+
103+
### `replace-cache` action ([.github/actions/replace-cache/action.yml](actions/replace-cache/action.yml))
104+
105+
All four mutable-slot caches above (uv, JIT, testmon, coverage) share
106+
the same delete-before-save recipe: GitHub Actions cache slots are
107+
immutable, so refreshing a `-latest` key requires deleting the
108+
existing entry, calling `actions/cache/save`, and (because the save
109+
silently no-ops on key collision) re-querying `gh cache list` to
110+
confirm the slot now exists. The `replace-cache` composite action
111+
encapsulates that recipe:
112+
113+
```yaml
114+
- name: Replace <some> cache
115+
if: <caller-supplied gate>
116+
uses: ./.github/actions/replace-cache
117+
with:
118+
path: <one or more paths>
119+
key: <foo>-latest
120+
description: <human-readable label>
121+
github-token: ${{ secrets.GITHUB_TOKEN }}
122+
```
123+
124+
The verify step is on by default (`verify: "true"`). Disable only
125+
when verification is genuinely undesirable; the default exists
126+
because silent save failures are how stale slots persist for days.
127+
60128
## Why no `.venv` cache
61129

62130
A previous iteration of this pipeline also cached the realized `.venv`
@@ -114,10 +182,12 @@ Guarantees:
114182
- **Concurrency**: the nightly workflow declares
115183
`concurrency: nightly-github-uv` with `cancel-in-progress: false` so
116184
two overlapping runs cannot race on the static `-latest` uv cache key.
117-
- **Save verification**: after `actions/cache/save@v4` writes the uv
118-
download cache slot, the workflow re-queries `gh cache list` to
119-
confirm the entry exists. `cache/save` silently no-ops on key
120-
collision; without verification a corrupted slot can persist for days.
185+
- **Save verification**: every mutable-slot save (uv download, JIT,
186+
testmon, coverage) goes through the `replace-cache` action, which
187+
re-queries `gh cache list` after `actions/cache/save` and fails the
188+
job if the slot is not visible. `cache/save` silently no-ops on
189+
key collision and only logs a warning on reservation failure;
190+
without verification a corrupted slot can persist for days.
121191
- **Lockfile-mutation guard**: [.github/actions/setup-uv-env/action.yml](actions/setup-uv-env/action.yml)
122192
snapshots `sha256(uv.lock)` and `sha256(pyproject.toml)` before any uv
123193
command runs and compares them again at the end. Any drift (caused by
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
name: Replace mutable-slot cache
2+
description: |
3+
Save a path to a GitHub Actions cache slot under a mutable key, replacing
4+
any existing entry under that key. The actions/cache/save action is a
5+
silent no-op when its key already exists (caches are immutable), so the
6+
canonical recipe to "refresh" a `-latest`-style slot is delete-before-save:
7+
8+
1. Look up the key with `gh cache list` and `gh cache delete` it if
9+
present (tolerate missing key for first runs and prefix bumps).
10+
2. Save with actions/cache/save@v5.
11+
3. Optionally re-query `gh cache list` to confirm the save took effect.
12+
GitHub's cache index is eventually consistent, so the verify step
13+
retries a few times before failing. Without this verify step, a
14+
collision that prevented the save (e.g. a concurrent writer or a
15+
stale slot the delete somehow missed) would surface only as a
16+
`Warning: Cache save failed.` line and a corrupted slot would
17+
persist indefinitely.
18+
19+
This action centralises that recipe so callers can express
20+
delete-before-save as a single step. Each invocation site supplies its
21+
own `if:` gate (e.g. `if: always()` or `if: <cold-cache>`) at the
22+
workflow level; this action does not encode any save policy of its own.
23+
24+
Failure semantics: the delete tolerates a missing slot, but the save
25+
itself and (when enabled) the verify step will fail the job loudly. A
26+
silent save failure on a mutable slot is exactly the bug class this
27+
action exists to prevent.
28+
inputs:
29+
path:
30+
description: |
31+
Cache path(s), forwarded to actions/cache/save. Multi-line values
32+
are supported with the same semantics as actions/cache.
33+
required: true
34+
key:
35+
description: |
36+
Mutable-slot cache key (e.g. `foo-latest`). This is the slot that
37+
will be deleted (if present) and then saved.
38+
required: true
39+
description:
40+
description: |
41+
Short human-readable label used in log messages, e.g. "uv download
42+
cache" or "testmon database".
43+
required: false
44+
default: "cache"
45+
verify:
46+
description: |
47+
When `"true"` (default), re-query `gh cache list` after the save and
48+
fail the job if the slot is not visible within a few retries. Set
49+
to `"false"` only when verification is genuinely undesirable; the
50+
default exists because silent save failures are how stale slots
51+
persist for days.
52+
required: false
53+
default: "true"
54+
github-token:
55+
description: |
56+
Token for the `gh` CLI used by the delete and verify steps. Pass
57+
the workflow's `secrets.GITHUB_TOKEN` from the calling workflow.
58+
required: true
59+
runs:
60+
using: composite
61+
steps:
62+
- name: Delete stale ${{ inputs.description }} entry
63+
shell: bash
64+
env:
65+
GH_TOKEN: ${{ inputs.github-token }}
66+
CACHE_KEY: ${{ inputs.key }}
67+
CACHE_DESC: ${{ inputs.description }}
68+
REPO: ${{ github.repository }}
69+
run: |
70+
set -euo pipefail
71+
if ! command -v gh >/dev/null 2>&1; then
72+
echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot."
73+
exit 1
74+
fi
75+
# Use --json key + --jq for robust matching (no false positives
76+
# on prefix overlap from sibling cache keys).
77+
existing="$(gh cache list \
78+
--repo "$REPO" \
79+
--key "$CACHE_KEY" \
80+
--json key \
81+
--jq '.[].key' \
82+
| grep -Fx "$CACHE_KEY" || true)"
83+
if [ -n "$existing" ]; then
84+
gh cache delete "$CACHE_KEY" --repo "$REPO"
85+
echo "deleted stale ${CACHE_DESC}: $CACHE_KEY"
86+
else
87+
echo "no existing ${CACHE_DESC} to delete: $CACHE_KEY"
88+
fi
89+
90+
- name: Save ${{ inputs.description }}
91+
uses: actions/cache/save@v5
92+
with:
93+
path: ${{ inputs.path }}
94+
key: ${{ inputs.key }}
95+
96+
# actions/cache/save@v5 silently no-ops on key collision; if the
97+
# previous delete step somehow left the entry in place (or a
98+
# concurrent run repopulated it), we want a hard failure now rather
99+
# than a stale cache fed to tomorrow's consumer.
100+
- name: Verify ${{ inputs.description }} was saved
101+
if: inputs.verify == 'true'
102+
shell: bash
103+
env:
104+
GH_TOKEN: ${{ inputs.github-token }}
105+
CACHE_KEY: ${{ inputs.key }}
106+
CACHE_DESC: ${{ inputs.description }}
107+
REPO: ${{ github.repository }}
108+
run: |
109+
set -euo pipefail
110+
# GitHub's cache index is eventually consistent; use linear back-off
111+
# (10 s, 20 s, 30 s, 40 s, 50 s → 150 s total) before failing.
112+
for attempt in 1 2 3 4 5; do
113+
if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \
114+
| grep -Fxq "$CACHE_KEY"; then
115+
echo "${CACHE_DESC} present: $CACHE_KEY"
116+
exit 0
117+
fi
118+
echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..."
119+
sleep $((10 * attempt))
120+
done
121+
echo "::error::${CACHE_DESC} save did not take effect for key $CACHE_KEY"
122+
exit 1

.github/actions/setup-uv-env/action.yml

Lines changed: 22 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -166,63 +166,36 @@ runs:
166166
fi
167167
168168
# --- CI-only test dependencies ----------------------------------------
169-
# These modules are referenced by tests in test/ via @requires_module
170-
# or pytest.importorskip but have no home in pyproject extras (either
171-
# because they are pure CI test tooling like moto, or because they
172-
# require special install paths like the PyG wheel index). Installed
173-
# via `uv pip install` AFTER the lockfile-mutation guard so uv.lock
174-
# remains the single source of truth for the synced environment, and
175-
# this step layers on top without invalidating that guarantee.
176-
#
177-
# Spec coupling notes:
178-
# * Pure-PyPI specs mirror Dockerfile:240-243 to keep container and
179-
# CI environments in lockstep.
180-
# * The PyG --find-links URL encodes the locked torch version (see
181-
# uv.lock entry for torch 2.11.0+cu128). Bump the
182-
# torch-X.Y.Z+cu128 segment in lockstep when the torch pin moves
183-
# -- same hand-maintained coupling already documented for the
184-
# natten wheel index in pyproject.toml.
185-
#
186-
# earth2grid is intentionally NOT installed here: the source build
187-
# (see Dockerfile:246 for the canonical pin) is fragile in the CI
188-
# container. HEALPix tests will continue to skip via @requires_module
189-
# until earth2grid ships a wheel or the build is fixed.
190-
#
191-
# torch_scatter / torch_sparse / torch_cluster are pulled in as a
192-
# source build via the gnns extra. Without CUDA toolchain visible
193-
# at build time uv lands the CPU-only wheel, which makes FigConvNet
194-
# (and any model calling torch_scatter.segment_csr / similar on a
195-
# CUDA tensor) fail with "Not compiled with CUDA support". Force a
196-
# reinstall from the PyG wheel index, which ships pre-built CUDA
197-
# wheels matching the locked torch version.
169+
# Test deps with no home in pyproject extras (pure CI tooling, or
170+
# special install paths like the PyG wheel index) layered on top of
171+
# the synced env. Installed AFTER the lockfile-mutation guard so
172+
# uv.lock remains the single source of truth for the synced
173+
# environment; this step layers on top without invalidating that
174+
# guarantee. Pin list and per-pin rationale live in
175+
# .github/ci-requirements.txt.
198176
- name: Install CI-only test dependencies
199177
shell: bash
200178
env:
201179
UV_LINK_MODE: copy
202-
PYG_WHL_INDEX: "https://data.pyg.org/whl/torch-2.11.0+cu128.html"
203-
# cuml-cu12 -> cudf -> numba caps numpy at <=2.2; uv.lock pins
204-
# numpy 2.2.6 under the cu12 extra precisely for this reason.
205-
# Without an explicit constraint here, uv pip install resolves
206-
# transitive deps (e.g. tensorstore, pyarrow) freely and bumps
207-
# numpy to 2.4.x, which crashes every test that touches cuml.
208-
# Re-applying the cap on each layered install keeps numba happy.
209-
NUMPY_PIN: "numpy<2.3"
210180
run: |
211181
set -euo pipefail
212182
echo "::group::install CI-only test dependencies"
183+
# `--reinstall-package` (NOT bare `--reinstall`) is scoped to the
184+
# four PyG names so we swap their CPU-only wheels (installed by
185+
# `uv sync` via the `gnns` extra) for the CUDA wheels from the
186+
# --find-links index baked into the requirements file. Bare
187+
# `--reinstall` would imply `--refresh` over the FULL resolved
188+
# closure (~60 packages including transitive deps like pyarrow,
189+
# cryptography, opentelemetry), bumping them to whatever's
190+
# currently latest on PyPI and breaking ABI compatibility with
191+
# the lockfile-pinned GPU stack (cudf / cuml / pylibcudf were
192+
# built against the synced pyarrow ABI, etc.).
213193
uv pip install --python .venv/bin/python \
214-
"$NUMPY_PIN" \
215-
"moto[s3]>=5.0.28" \
216-
"numpy-stl" \
217-
"scikit-image>=0.24.0" \
218-
"shapely" \
219-
"multi-storage-client[boto3]>=0.33.0" \
220-
"tensorstore" \
221-
"pyarrow"
222-
uv pip install --python .venv/bin/python --reinstall \
223-
"$NUMPY_PIN" \
224-
torch_scatter torch_sparse torch_cluster pyg_lib \
225-
--find-links "$PYG_WHL_INDEX"
194+
--reinstall-package torch_scatter \
195+
--reinstall-package torch_sparse \
196+
--reinstall-package torch_cluster \
197+
--reinstall-package pyg_lib \
198+
-r .github/ci-requirements.txt
226199
echo "::endgroup::"
227200
228201
- name: Report cache sizes

.github/ci-requirements.txt

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
# CI-only test dependencies layered on top of `uv sync --frozen`.
2+
#
3+
# Why a separate file (not pyproject extras):
4+
# * Pure test tooling (moto) or no extras home (numpy-stl, scikit-image, ...).
5+
# * PyG CUDA wheels live on a private index, not PyPI.
6+
#
7+
# Why every spec is `==`-pinned:
8+
# testmon hashes site-packages content into its env fingerprint; any
9+
# upstream release between the nightly that builds the testmon DB and
10+
# the PR that consumes it would invalidate the entire cache. Bumping a
11+
# pin is a deliberate change: the next nightly repopulates the DB
12+
# against the new fingerprint and PRs start hitting again. Pinning
13+
# here only stabilises *direct* deps; transitive churn (e.g. boto3 ->
14+
# urllib3) can still trigger invalidations, in which case the next
15+
# escalation is a constraints.txt passed via `uv pip install -c ...`.
16+
#
17+
# Coupling notes (bump in lockstep):
18+
# * Dockerfile lines 240-243 still use `>=` lower bounds for the
19+
# pure-PyPI specs; tighten in a follow-up.
20+
# * The PyG --find-links URL encodes the locked torch version (see
21+
# uv.lock for torch 2.11.0+cu128). PyG wheels carry a local-version
22+
# label like `2.1.2+pt211cu128`; per PEP 440 that sorts above plain
23+
# `2.1.2`, so a `==2.1.2` pin selects the CUDA wheel without us
24+
# hard-coding the full local segment.
25+
#
26+
# earth2grid is intentionally NOT installed: the source build is fragile
27+
# in the CI container. HEALPix tests skip via @requires_module until
28+
# earth2grid ships a wheel or the build is fixed.
29+
30+
# PyG CUDA wheel index. Bump the torch-X.Y.Z+cu128 segment in lockstep
31+
# with the locked torch version.
32+
--find-links https://data.pyg.org/whl/torch-2.11.0+cu128.html
33+
34+
# cuml-cu12 -> cudf -> numba caps numpy at <=2.2; uv.lock pins numpy
35+
# 2.2.6 under the cu12 extra for this reason. Re-applied here so layered
36+
# installs (tensorstore, pyarrow, ...) cannot bump numpy to 2.4.x and
37+
# crash every test that touches cuml.
38+
numpy<2.3
39+
40+
# Test tooling with no extras home.
41+
moto[s3]==5.2.1
42+
numpy-stl==3.2.0
43+
scikit-image==0.26.0
44+
shapely==2.1.2
45+
multi-storage-client[boto3]==0.48.0
46+
tensorstore==0.1.83
47+
pyarrow==24.0.0
48+
49+
# PyG CUDA wheels. The `gnns` extra installs CPU-only wheels via
50+
# `uv sync` (no CUDA toolchain visible at build time), which makes
51+
# torch_scatter.segment_csr / similar fail on CUDA tensors with
52+
# "Not compiled with CUDA support". setup-uv-env passes
53+
# `--reinstall-package` for each of these four names so they're swapped
54+
# for the CUDA wheels from the --find-links index above; the rest of
55+
# the resolved closure is left alone (a bare `--reinstall` would refresh
56+
# every transitive dep too and break ABI compatibility with lockfile-
57+
# pinned packages like pyarrow / cudf / cuml).
58+
torch_scatter==2.1.2
59+
torch_sparse==0.6.18
60+
torch_cluster==1.6.3
61+
pyg_lib==0.6.0

0 commit comments

Comments
 (0)