Skip to content

Commit bfcd0de

Browse files
committed
docs: design spec for coverage-based test selection (done right)
1 parent 574e53d commit bfcd0de

1 file changed

Lines changed: 147 additions & 0 deletions

File tree

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Coverage-based test selection, done right
2+
3+
**Status:** Design / awaiting review
4+
**Date:** 2026-05-29
5+
**Supersedes:** the removed gcov coverage-cache (`--only-changes` / `--build-coverage-cache`), deleted in MFlowCode/MFC#1460.
6+
7+
## Problem
8+
9+
CI turnaround is dominated by running the full ~610-test suite on every PR, across a slow self-hosted compiler matrix. We want to run only the tests a change can actually affect — but the previous attempt was unreliable and was removed. The post-mortem (see #1460) found the *idea* (execution-based selection) was sound; the *implementation* rotted:
10+
11+
- keyed on `UUID = CRC32(trace)`, so any `cases.py` edit silently orphaned entries;
12+
- the committed cache shipped ~3 months stale and nothing ever auto-refreshed it;
13+
- every failure path silently fell back to the full suite, so breakage was invisible;
14+
- `git merge-base` failed on shallow self-hosted clones, defeating selection exactly on the slow jobs.
15+
16+
We also established that **only execution-based selection is sound**: static feature/label maps miss transitive side effects (a change to a shared module silently alters behavior of a load-bearing module like `m_bubbles.fpp`), and a static `use`/`#:include` dependency graph over-approximates to "all tests" because linking ≠ execution. Therefore selection must use real per-test execution coverage (gcov), and the engineering challenge is making the freshness and reliability *trustworthy*.
17+
18+
## Invariants (the north stars)
19+
20+
Every design choice serves these two rules:
21+
22+
1. **Soundness — the selector may run MORE tests than necessary, never FEWER.** Under-inclusion is structurally impossible; every uncertainty resolves to "run it." Only over-inclusion (wasted time) is permitted.
23+
2. **Anti-rot — staleness is always LOUD, never silent.** No path silently degrades to the full suite without surfacing it. A stale or under-covering map is impossible to *not* notice.
24+
25+
## Non-goals
26+
27+
- Changing the golden-file identity model (`UUID = CRC32(trace)`); coverage keying is fully decoupled from it.
28+
- Line-level coverage. File-level (`.fpp`) is sufficient and far cheaper to collect and store.
29+
- Selection on GPU/Cray-specific behavior — the map is CPU/gfortran-only; GPU paths are handled by the conservative ladder + the master backstop, not by selection.
30+
- Eliminating the cost of refreshing the map: refreshing inherently runs the full suite under instrumentation. We amortize it; we do not avoid it.
31+
32+
## Architecture
33+
34+
### Components
35+
36+
| Component | Responsibility |
37+
|---|---|
38+
| **Coverage map** (`tests/coverage_map.json.gz`, committed) | The single authoritative artifact: `param_hash → covered .fpp files`. Its git history *is* the freshness record. |
39+
| **Map builder** (`./mfc.sh test --build-coverage-map`) | Resurrected gcov collection from the deleted `coverage.py`, re-keyed by `param_hash`. |
40+
| **Selector** (`./mfc.sh test --only-changes`) | Applies the conservative ladder to choose which tests run; emits a visible summary. |
41+
| **Refresh workflow** (master) | Rebuilds the map on a cadence + relevant triggers and commits it back to master via a bot. |
42+
| **Map-health workflow** (master, scheduled) | Fails LOUD (red + notify) if the map is stale or under-covers. The anti-rot enforcement, deliberately kept OFF the PR path so it can never wedge a PR. |
43+
44+
### Data model
45+
46+
The map is gzipped JSON:
47+
48+
```json
49+
{
50+
"<param_hash>": ["src/simulation/m_rhs.fpp", "src/common/m_helper.fpp", "..."],
51+
"_meta": { "built_at": "<ISO-8601 UTC>", "git_sha": "<sha>", "gfortran_version": "<str>", "n_tests": 610 }
52+
}
53+
```
54+
55+
**`param_hash`** = SHA-256 over the test's *canonical* defining param dict — sorted keys, normalized float repr — as produced by `cases.py`, **excluding** coverage-run overrides (`t_step_stop=1`, etc.). Properties this gives us:
56+
57+
- Cosmetic trace edits (renaming/reordering a `CaseGeneratorStack` segment) leave params unchanged → **same key** (robust to the thing that rotted the old system).
58+
- A real param change (the thing that changes behavior) → **new key** → not in map → run it (correct *and* conservative).
59+
- Coverage keying is independent of the golden-file UUID.
60+
61+
### The conservative ladder (soundness, made concrete)
62+
63+
A test runs if **any** rung holds; rungs are checked in order, each resolving uncertainty toward "run":
64+
65+
1. **Changed-file list unavailable** → run **ALL** (loud annotation).
66+
2. A changed file is in **`ALWAYS_RUN_ALL`**: the GPU/Fypp macro & include files (`src/common/include/**`), `CMakeLists.txt`, `toolchain/cmake/**`, the parameter codegen (`toolchain/mfc/params/**`), and build/run scripts → run **ALL**. (Note: ordinary `src/common/**.fpp` *modules* are NOT here — they are coverage-tracked and selectable across all three targets, which is the whole point; only the macro/codegen/build inputs that the CPU map can't attribute line-for-line are forced to run all.)
67+
3. A changed **`.f90`/`.f`** under `src/` (the map tracks `.fpp` only) → run **ALL**.
68+
4. A changed **`.fpp` covered by *zero* tests** in the map → run **ALL** (loud). This single rung closes the GPU blind spot: a file compiled only under `#ifdef MFC_OpenACC` is never covered by the CPU map, so a change to it forces a full run instead of silently selecting nothing.
69+
5. A test's **`param_hash` is absent** from the map (new/changed test) → run **it**.
70+
6. A test's covered files **∩ changed `.fpp` ≠ ∅** → run **it**.
71+
7. Otherwise → skip.
72+
73+
The only residual gap is one shared by *all* coverage-based test-impact-analysis: a change that newly makes test *T* reach file *Y* when *T* did not previously execute the edited file. This is caught by the **master full-suite backstop** below.
74+
75+
### Gate model
76+
77+
- **PR required check = the selected subset** (fast).
78+
- **Master push runs the FULL suite** — the backstop for (a) the reachability-change gap above and (b) GPU/Cray paths the CPU map cannot see.
79+
- A rare escape is caught on master and forward-fixed. (Chosen explicitly over a full-suite merge gate, for maximum PR-time savings.)
80+
81+
### Changed-file detection
82+
83+
The selector consumes the changed-file list from CI's existing `dorny/paths-filter` step (already computed reliably) rather than `git merge-base` on a shallow runner — eliminating the old shallow-clone failure. For **local** runs it uses `git diff` against the merge-base with a self-healing `git fetch --deepen` retry; if that still fails, it runs ALL (rung 1), loudly.
84+
85+
## Data flows
86+
87+
**Refresh (on master):**
88+
```
89+
trigger (weekly floor | push touching cases.py or use/#:include graph)
90+
→ build --gcov → test --build-coverage-map (full suite under gcov, per-test GCOV_PREFIX isolation)
91+
→ write tests/coverage_map.json.gz → bot commits to master (guarded from re-triggering the test matrix)
92+
```
93+
94+
**PR selection:**
95+
```
96+
paths-filter → changed files
97+
→ load committed map (loud if missing/corrupt → run ALL, annotated)
98+
→ conservative ladder → selected subset
99+
→ emit summary (ran N/M; map age; rung reasons)
100+
```
101+
102+
## Observability
103+
104+
Every PR test job prints a one-line summary to the job log/summary, e.g.:
105+
```
106+
Coverage selection: ran 47/610 tests · map age 2d · always-run rungs: none · skipped 563 (no overlap)
107+
```
108+
This makes per-PR selection behavior visible at a glance. Breakage shows up as "ran 610/610 (rung 1: changed-file list unavailable)" rather than a silent full run.
109+
110+
## Anti-rot enforcement
111+
112+
The **map-health workflow** runs on a schedule on master and **fails red + notifies** if either:
113+
- `_meta.built_at` is older than the **staleness threshold**, or
114+
- the map covers fewer than the **minimum-coverage fraction** of current `param_hash`es (too many conservative-includes ⇒ selection is degrading).
115+
116+
It is intentionally NOT a PR-required check, so a refresh outage can never block unrelated PRs — PRs degrade to run-all *visibly* instead. Rot surfaces as a red master workflow + notification.
117+
118+
## CI integration summary
119+
120+
- `test.yml` PR jobs: pass the paths-filter changed-file list to `./mfc.sh test --only-changes`; selected subset is the required check.
121+
- `test.yml` master push: full suite (unchanged behavior = backstop).
122+
- New `coverage-refresh.yml` (master): rebuild + bot-commit the map. Guard the map-only commit from re-triggering the full matrix (`paths-ignore` on the map file and/or a `[skip ci]` convention).
123+
- New `coverage-health.yml` (scheduled): the loud anti-rot monitor.
124+
125+
## Build sequence
126+
127+
1. **Resurrect + re-key the collector** by `param_hash`. Unit test: cosmetic trace edits keep the key; a param change alters it; collection is deterministic.
128+
2. **Selector + observability**, landed **disabled (shadow mode)**: it computes and prints what it *would* select but the suite still runs in full. This yields real evidence that selection never under-selects, at zero risk.
129+
3. **Map builder + first committed map**, built on master.
130+
4. **Refresh workflow** (bot commits) + **map-health workflow** (loud monitor).
131+
5. **Flip the selector on** as the PR gate once shadow data shows it never under-selects.
132+
133+
Shadow mode (step 2) is the safety valve: selection proves itself against reality before it gates anything.
134+
135+
## Tunable parameters (defaults; confirm during review)
136+
137+
- **Refresh cadence:** weekly floor (Mon 06:00 UTC) **plus** master pushes touching `cases.py` or the `use`/`#:include` dependency graph.
138+
- **Staleness threshold:** 10 days (3 days' grace beyond the weekly refresh).
139+
- **Minimum-coverage fraction:** 80% of current `param_hash`es present in the map.
140+
- **Map storage:** committed at `tests/coverage_map.json.gz`, refreshed by bot push to master (matches the existing homebrew-release push pattern). Alternative if master-history churn is unwanted: a dedicated `coverage-map` branch or a release asset.
141+
142+
## Risks and mitigations
143+
144+
- **Refresh cost (full gcov run, ~hours on Phoenix).** Inherent to coverage; mitigated by amortizing (weekly + targeted triggers, never per-PR) and made visible by the health workflow.
145+
- **CPU-only map misses GPU-only files.** Mitigated by ladder rung 4 (zero-coverage changed file → run all) + the master backstop.
146+
- **Reachability-change gap.** Mitigated by the master full-suite backstop; inherent to all coverage TIA.
147+
- **Bot push to master.** Uses the existing CI-push pattern; guarded against CI re-trigger loops.

0 commit comments

Comments
 (0)