Skip to content

Commit 363a55a

Browse files
authored
Merge branch 'main' into coreyjadams-python-3.14-support
2 parents 367d95f + 8e46db6 commit 363a55a

252 files changed

Lines changed: 22928 additions & 7337 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../skills

.github/CACHE_CONTRACT.md

Lines changed: 129 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ from three independent sources: a pinned CUDA container image, a pinned
3737
| Invalidates when | container image or Python version changes (prefix change → new slot) |
3838
| Does **not** invalidate on | `uv.lock`, `pyproject.toml`, or kernel source changes (each compiler handles its own source-hash invalidation internally) |
3939
| Restore semantics | **fail-open**; missing cache only costs compilation time, never correctness |
40-
| Save semantics | nightly `testmon` job only, via delete-before-save; PR workflows restore but never save |
40+
| Save semantics | nightly `testmon` job only, via the `replace-cache` action; PR workflows restore but never save |
4141

4242
The JIT compilation cache bundles all JIT compiler artifact directories
4343
under a single umbrella path. Each compiler writes to a subdirectory
@@ -57,6 +57,128 @@ old ones.
5757
To add a new JIT backend: create a subdirectory under `$JIT_CACHE_DIR`,
5858
set the backend's cache-path env var in the test step, done.
5959

60+
### Testmon database cache (`.testmondata*`)
61+
62+
| Property | Value |
63+
|---|---|
64+
| Key | `<TESTMON_CACHE_KEY_PREFIX>-latest` |
65+
| Prefix encodes | nightly identity (`testmon-nightly`) |
66+
| Suffix | literal `latest` (mutable slot, refreshed via delete-before-save) |
67+
| Contents | `.testmondata`, `.testmondata-shm`, `.testmondata-wal` -- testmon's per-test dependency graph and last-run signatures |
68+
| Invalidates when | prefix is bumped (essentially never, by design) |
69+
| Does **not** invalidate on | `uv.lock` or `pyproject.toml` changes -- testmon detects changed dependency hashes itself and re-runs only the affected tests |
70+
| Restore semantics | **fail-open**; a miss only costs full-suite runtime, never correctness, and testmon handles stale DBs gracefully |
71+
| Save semantics | nightly `testmon` job only, via the `replace-cache` action with `if: always()` so partial DBs from flaky runs still publish |
72+
73+
Historical note: the key was previously suffixed with
74+
`hashFiles('uv.lock', 'pyproject.toml')`. Because GitHub Actions
75+
caches are immutable, two consecutive nightlies with an unchanged
76+
lockfile (the common case) collided on the same key, and the second
77+
save logged `Failed to save: Unable to reserve cache` only as a
78+
*warning*. The stale DB persisted for days, PRs restored it via the
79+
prefix fallback, and testmon then invalidated everything because the
80+
realized environment had drifted away from what the cached DB
81+
recorded. Switching to a `-latest` mutable slot via `replace-cache`
82+
fixes the save bug, and the embedded verify step turns any future
83+
silent save failure into a hard job failure.
84+
85+
#### Why a separate `ci-requirements.lock`
86+
87+
The cache fix above only addresses *saving* the DB; the DB is still
88+
worthless to PRs if testmon's environment fingerprint at PR time
89+
differs from the fingerprint stored at nightly time. Testmon
90+
computes that fingerprint from `importlib.metadata.distributions()`
91+
over the active venv -- i.e. *everything* in
92+
`.venv/lib/python3.12/site-packages`, not just the lockfile-pinned
93+
closure.
94+
95+
`setup-uv-env` builds the venv in two layered steps:
96+
97+
1. `uv sync --frozen --group dev --extra <EXTRAS_TAG>` -- deterministic
98+
against `uv.lock`.
99+
2. `uv pip install -r .github/ci-requirements.txt` -- adds CI-only
100+
test deps that have no home in pyproject extras (moto,
101+
scikit-image, numpy-stl, shapely, multi-storage-client, tensorstore,
102+
plus the PyG CUDA wheel swap).
103+
104+
Step 2 is *not* covered by `uv.lock`. Several of the direct pins in
105+
`ci-requirements.txt` are absent from `uv.lock` entirely, so their
106+
transitive closure (`responses`, `xmltodict`, `jsonpath-ng`,
107+
`lazy-loader`, `tifffile`, `pywavelets`, `imageio`,
108+
`antlr4-python3-runtime`, ...) gets re-resolved fresh against PyPI on
109+
every job. A single transitive minor bump between the nightly that
110+
publishes the testmon DB and the PR that consumes it changes the
111+
sorted `name version` string testmon hashes, trips its
112+
"packages installed have been changed" guard, and re-runs the entire
113+
suite.
114+
115+
[`.github/ci-requirements.lock`](ci-requirements.lock) is a fully
116+
pinned closure of `ci-requirements.txt` (direct + transitive), passed
117+
to the layered install via `--constraint`. It is generated by
118+
[`.github/regen-ci-deps-lock.sh`](regen-ci-deps-lock.sh), and must be
119+
regenerated and committed whenever a `==` pin in
120+
`ci-requirements.txt` changes.
121+
122+
Two ways to run the regen:
123+
124+
1. **Standalone [`Regen CI-deps Lock`](workflows/regen-ci-deps-lock.yml)
125+
workflow** (workflow_dispatch). Runs the regen on a CPU runner
126+
in ~5 min and uploads `.github/ci-requirements.lock` as an
127+
artifact for the maintainer to download and commit. Requires
128+
the workflow file to be on the default branch (GitHub refuses
129+
workflow_dispatch on files that exist only on feature branches --
130+
both the UI dropdown and `gh workflow run --ref` enforce this).
131+
132+
2. **Local docker.** See the header of
133+
[`.github/regen-ci-deps-lock.sh`](regen-ci-deps-lock.sh) for the
134+
`docker run …` invocation. Useful when iterating on the script
135+
itself or when the standalone workflow is unavailable (e.g. a
136+
feature branch where the workflow file has not yet landed on
137+
the default branch).
138+
139+
### Coverage baseline cache (`.coverage*`)
140+
141+
| Property | Value |
142+
|---|---|
143+
| Key | `<COVERAGE_CACHE_KEY_PREFIX>-latest` |
144+
| Prefix encodes | nightly identity (`coverage-nightly`) |
145+
| Suffix | literal `latest` (mutable slot, refreshed via delete-before-save) |
146+
| Contents | parallel-mode coverage shards (`.coverage.*`) produced by the nightly's full-suite pytest run, before `coverage combine` |
147+
| Invalidates when | prefix is bumped |
148+
| Does **not** invalidate on | `uv.lock` or `pyproject.toml` changes |
149+
| Restore semantics | **fail-open**; PR coverage merges its own shards on top of the restored baseline |
150+
| Save semantics | nightly `coverage` job only, via the `replace-cache` action |
151+
152+
Same immutable-key bug class as testmon; migrated to the `-latest`
153+
slot for parity.
154+
155+
## Reusable building blocks
156+
157+
### `replace-cache` action ([.github/actions/replace-cache/action.yml](actions/replace-cache/action.yml))
158+
159+
All four mutable-slot caches above (uv, JIT, testmon, coverage) share
160+
the same delete-before-save recipe: GitHub Actions cache slots are
161+
immutable, so refreshing a `-latest` key requires deleting the
162+
existing entry, calling `actions/cache/save`, and (because the save
163+
silently no-ops on key collision) re-querying `gh cache list` to
164+
confirm the slot now exists. The `replace-cache` composite action
165+
encapsulates that recipe:
166+
167+
```yaml
168+
- name: Replace <some> cache
169+
if: <caller-supplied gate>
170+
uses: ./.github/actions/replace-cache
171+
with:
172+
path: <one or more paths>
173+
key: <foo>-latest
174+
description: <human-readable label>
175+
github-token: ${{ secrets.GITHUB_TOKEN }}
176+
```
177+
178+
The verify step is on by default (`verify: "true"`). Disable only
179+
when verification is genuinely undesirable; the default exists
180+
because silent save failures are how stale slots persist for days.
181+
60182
## Why no `.venv` cache
61183

62184
A previous iteration of this pipeline also cached the realized `.venv`
@@ -114,10 +236,12 @@ Guarantees:
114236
- **Concurrency**: the nightly workflow declares
115237
`concurrency: nightly-github-uv` with `cancel-in-progress: false` so
116238
two overlapping runs cannot race on the static `-latest` uv cache key.
117-
- **Save verification**: after `actions/cache/save@v4` writes the uv
118-
download cache slot, the workflow re-queries `gh cache list` to
119-
confirm the entry exists. `cache/save` silently no-ops on key
120-
collision; without verification a corrupted slot can persist for days.
239+
- **Save verification**: every mutable-slot save (uv download, JIT,
240+
testmon, coverage) goes through the `replace-cache` action, which
241+
re-queries `gh cache list` after `actions/cache/save` and fails the
242+
job if the slot is not visible. `cache/save` silently no-ops on
243+
key collision and only logs a warning on reservation failure;
244+
without verification a corrupted slot can persist for days.
121245
- **Lockfile-mutation guard**: [.github/actions/setup-uv-env/action.yml](actions/setup-uv-env/action.yml)
122246
snapshots `sha256(uv.lock)` and `sha256(pyproject.toml)` before any uv
123247
command runs and compares them again at the end. Any drift (caused by

.github/ISSUE_TEMPLATE/bug_report.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ body:
3131
attributes:
3232
label: Version
3333
description: What version of PhysicsNeMo are you running?
34-
placeholder: "example: 2.0.0"
34+
placeholder: "example: 2.1.0"
3535
validations:
3636
required: true
3737

.github/actions/bootstrap-cudnn-ci/action.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,10 @@ runs:
7979
PIP_BREAK_SYSTEM_PACKAGES: "1"
8080
run: |
8181
set -euo pipefail
82-
pip install --quiet huggingface_hub[hf_xet]
82+
# typer >= 0.26 dropped click as a runtime dep, but huggingface_hub's
83+
# `hf` CLI still does `import click`. Install click explicitly until
84+
# upstream re-adds it (or the CLI stops importing click directly).
85+
pip install --quiet huggingface_hub[hf_xet] click
8386
8487
- name: Install uv (pinned)
8588
shell: bash

.github/actions/download-ci-data/action.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@ runs:
5757
fi
5858
5959
if ! command -v hf >/dev/null 2>&1; then
60-
uv pip install --system --quiet huggingface_hub[hf_xet]
60+
# See bootstrap-cudnn-ci: typer 0.26 dropped click; hf CLI still imports it.
61+
uv pip install --system --quiet huggingface_hub[hf_xet] click
6162
fi
6263
hf download "${HF_REPO}" \
6364
--repo-type dataset \
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
name: Replace mutable-slot cache
2+
description: |
3+
Save a path to a GitHub Actions cache slot under a mutable key, replacing
4+
any existing entry under that key. The actions/cache/save action is a
5+
silent no-op when its key already exists (caches are immutable), so the
6+
canonical recipe to "refresh" a `-latest`-style slot is delete-before-save:
7+
8+
1. Look up the key with `gh cache list` and `gh cache delete` it if
9+
present (tolerate missing key for first runs and prefix bumps).
10+
2. Save with actions/cache/save@v5.
11+
3. Optionally re-query `gh cache list` to confirm the save took effect.
12+
GitHub's cache index is eventually consistent, so the verify step
13+
retries a few times before failing. Without this verify step, a
14+
collision that prevented the save (e.g. a concurrent writer or a
15+
stale slot the delete somehow missed) would surface only as a
16+
`Warning: Cache save failed.` line and a corrupted slot would
17+
persist indefinitely.
18+
19+
This action centralises that recipe so callers can express
20+
delete-before-save as a single step. Each invocation site supplies its
21+
own `if:` gate (e.g. `if: always()` or `if: <cold-cache>`) at the
22+
workflow level; this action does not encode any save policy of its own.
23+
24+
Failure semantics: the delete tolerates a missing slot, but the save
25+
itself and (when enabled) the verify step will fail the job loudly. A
26+
silent save failure on a mutable slot is exactly the bug class this
27+
action exists to prevent.
28+
inputs:
29+
path:
30+
description: |
31+
Cache path(s), forwarded to actions/cache/save. Multi-line values
32+
are supported with the same semantics as actions/cache.
33+
required: true
34+
key:
35+
description: |
36+
Mutable-slot cache key (e.g. `foo-latest`). This is the slot that
37+
will be deleted (if present) and then saved.
38+
required: true
39+
description:
40+
description: |
41+
Short human-readable label used in log messages, e.g. "uv download
42+
cache" or "testmon database".
43+
required: false
44+
default: "cache"
45+
verify:
46+
description: |
47+
When `"true"` (default), re-query `gh cache list` after the save and
48+
fail the job if the slot is not visible within a few retries. Set
49+
to `"false"` only when verification is genuinely undesirable; the
50+
default exists because silent save failures are how stale slots
51+
persist for days.
52+
required: false
53+
default: "true"
54+
github-token:
55+
description: |
56+
Token for the `gh` CLI used by the delete and verify steps. Pass
57+
the workflow's `secrets.GITHUB_TOKEN` from the calling workflow.
58+
required: true
59+
runs:
60+
using: composite
61+
steps:
62+
- name: Delete stale ${{ inputs.description }} entry
63+
shell: bash
64+
env:
65+
GH_TOKEN: ${{ inputs.github-token }}
66+
CACHE_KEY: ${{ inputs.key }}
67+
CACHE_DESC: ${{ inputs.description }}
68+
REPO: ${{ github.repository }}
69+
run: |
70+
set -euo pipefail
71+
if ! command -v gh >/dev/null 2>&1; then
72+
echo "::error::gh CLI not on PATH; cannot manage ${CACHE_DESC} slot."
73+
exit 1
74+
fi
75+
# Use --json key + --jq for robust matching (no false positives
76+
# on prefix overlap from sibling cache keys).
77+
existing="$(gh cache list \
78+
--repo "$REPO" \
79+
--key "$CACHE_KEY" \
80+
--json key \
81+
--jq '.[].key' \
82+
| grep -Fx "$CACHE_KEY" || true)"
83+
if [ -n "$existing" ]; then
84+
gh cache delete "$CACHE_KEY" --repo "$REPO"
85+
echo "deleted stale ${CACHE_DESC}: $CACHE_KEY"
86+
else
87+
echo "no existing ${CACHE_DESC} to delete: $CACHE_KEY"
88+
fi
89+
90+
- name: Save ${{ inputs.description }}
91+
uses: actions/cache/save@v5
92+
with:
93+
path: ${{ inputs.path }}
94+
key: ${{ inputs.key }}
95+
96+
# actions/cache/save@v5 silently no-ops on key collision; if the
97+
# previous delete step somehow left the entry in place (or a
98+
# concurrent run repopulated it), we want a hard failure now rather
99+
# than a stale cache fed to tomorrow's consumer.
100+
- name: Verify ${{ inputs.description }} was saved
101+
if: inputs.verify == 'true'
102+
shell: bash
103+
env:
104+
GH_TOKEN: ${{ inputs.github-token }}
105+
CACHE_KEY: ${{ inputs.key }}
106+
CACHE_DESC: ${{ inputs.description }}
107+
REPO: ${{ github.repository }}
108+
run: |
109+
set -euo pipefail
110+
# GitHub's cache index is eventually consistent; use linear back-off
111+
# (10 s, 20 s, 30 s, 40 s, 50 s → 150 s total) before failing.
112+
for attempt in 1 2 3 4 5; do
113+
if gh cache list --repo "$REPO" --key "$CACHE_KEY" --json key --jq '.[].key' \
114+
| grep -Fxq "$CACHE_KEY"; then
115+
echo "${CACHE_DESC} present: $CACHE_KEY"
116+
exit 0
117+
fi
118+
echo "attempt $attempt: ${CACHE_DESC} not yet visible, sleeping..."
119+
sleep $((10 * attempt))
120+
done
121+
echo "::error::${CACHE_DESC} save did not take effect for key $CACHE_KEY"
122+
exit 1

0 commit comments

Comments
 (0)