|
| 1 | +# Coverage-based test selection, done right |
| 2 | + |
| 3 | +**Status:** Design / awaiting review |
| 4 | +**Date:** 2026-05-29 |
| 5 | +**Supersedes:** the removed gcov coverage-cache (`--only-changes` / `--build-coverage-cache`), deleted in MFlowCode/MFC#1460. |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +CI turnaround is dominated by running the full ~610-test suite on every PR, across a slow self-hosted compiler matrix. We want to run only the tests a change can actually affect — but the previous attempt was unreliable and was removed. The post-mortem (see #1460) found the *idea* (execution-based selection) was sound; the *implementation* rotted: |
| 10 | + |
| 11 | +- keyed on `UUID = CRC32(trace)`, so any `cases.py` edit silently orphaned entries; |
| 12 | +- the committed cache shipped ~3 months stale and nothing ever auto-refreshed it; |
| 13 | +- every failure path silently fell back to the full suite, so breakage was invisible; |
| 14 | +- `git merge-base` failed on shallow self-hosted clones, defeating selection exactly on the slow jobs. |
| 15 | + |
| 16 | +We also established that **only execution-based selection is sound**: static feature/label maps miss transitive side effects (a change to a shared module silently alters behavior of a load-bearing module like `m_bubbles.fpp`), and a static `use`/`#:include` dependency graph over-approximates to "all tests" because linking ≠ execution. Therefore selection must use real per-test execution coverage (gcov), and the engineering challenge is making the freshness and reliability *trustworthy*. |
| 17 | + |
| 18 | +## Invariants (the north stars) |
| 19 | + |
| 20 | +Every design choice serves these two rules: |
| 21 | + |
| 22 | +1. **Soundness — the selector may run MORE tests than necessary, never FEWER.** Under-inclusion is structurally impossible; every uncertainty resolves to "run it." Only over-inclusion (wasted time) is permitted. |
| 23 | +2. **Anti-rot — staleness is always LOUD, never silent.** No path silently degrades to the full suite without surfacing it. A stale or under-covering map is impossible to *not* notice. |
| 24 | + |
| 25 | +## Non-goals |
| 26 | + |
| 27 | +- Changing the golden-file identity model (`UUID = CRC32(trace)`); coverage keying is fully decoupled from it. |
| 28 | +- Line-level coverage. File-level (`.fpp`) is sufficient and far cheaper to collect and store. |
| 29 | +- Selection on GPU/Cray-specific behavior — the map is CPU/gfortran-only; GPU paths are handled by the conservative ladder + the master backstop, not by selection. |
| 30 | +- Eliminating the cost of refreshing the map: refreshing inherently runs the full suite under instrumentation. We amortize it; we do not avoid it. |
| 31 | + |
| 32 | +## Architecture |
| 33 | + |
| 34 | +### Components |
| 35 | + |
| 36 | +| Component | Responsibility | |
| 37 | +|---|---| |
| 38 | +| **Coverage map** (`tests/coverage_map.json.gz`, committed) | The single authoritative artifact: `param_hash → covered .fpp files`. Its git history *is* the freshness record. | |
| 39 | +| **Map builder** (`./mfc.sh test --build-coverage-map`) | Resurrected gcov collection from the deleted `coverage.py`, re-keyed by `param_hash`. | |
| 40 | +| **Selector** (`./mfc.sh test --only-changes`) | Applies the conservative ladder to choose which tests run; emits a visible summary. | |
| 41 | +| **Refresh workflow** (master) | Rebuilds the map on a cadence + relevant triggers and commits it back to master via a bot. | |
| 42 | +| **Map-health workflow** (master, scheduled) | Fails LOUD (red + notify) if the map is stale or under-covers. The anti-rot enforcement, deliberately kept OFF the PR path so it can never wedge a PR. | |
| 43 | + |
| 44 | +### Data model |
| 45 | + |
| 46 | +The map is gzipped JSON: |
| 47 | + |
| 48 | +```json |
| 49 | +{ |
| 50 | + "<param_hash>": ["src/simulation/m_rhs.fpp", "src/common/m_helper.fpp", "..."], |
| 51 | + "_meta": { "built_at": "<ISO-8601 UTC>", "git_sha": "<sha>", "gfortran_version": "<str>", "n_tests": 610 } |
| 52 | +} |
| 53 | +``` |
| 54 | + |
| 55 | +**`param_hash`** = SHA-256 over the test's *canonical* defining param dict — sorted keys, normalized float repr — as produced by `cases.py`, **excluding** coverage-run overrides (`t_step_stop=1`, etc.). Properties this gives us: |
| 56 | + |
| 57 | +- Cosmetic trace edits (renaming/reordering a `CaseGeneratorStack` segment) leave params unchanged → **same key** (robust to the thing that rotted the old system). |
| 58 | +- A real param change (the thing that changes behavior) → **new key** → not in map → run it (correct *and* conservative). |
| 59 | +- Coverage keying is independent of the golden-file UUID. |
| 60 | + |
| 61 | +### The conservative ladder (soundness, made concrete) |
| 62 | + |
| 63 | +A test runs if **any** rung holds; rungs are checked in order, each resolving uncertainty toward "run": |
| 64 | + |
| 65 | +1. **Changed-file list unavailable** → run **ALL** (loud annotation). |
| 66 | +2. A changed file is in **`ALWAYS_RUN_ALL`**: the GPU/Fypp macro & include files (`src/common/include/**`), `CMakeLists.txt`, `toolchain/cmake/**`, the parameter codegen (`toolchain/mfc/params/**`), and build/run scripts → run **ALL**. (Note: ordinary `src/common/**.fpp` *modules* are NOT here — they are coverage-tracked and selectable across all three targets, which is the whole point; only the macro/codegen/build inputs that the CPU map can't attribute line-for-line are forced to run all.) |
| 67 | +3. A changed **`.f90`/`.f`** under `src/` (the map tracks `.fpp` only) → run **ALL**. |
| 68 | +4. A changed **`.fpp` covered by *zero* tests** in the map → run **ALL** (loud). This single rung closes the GPU blind spot: a file compiled only under `#ifdef MFC_OpenACC` is never covered by the CPU map, so a change to it forces a full run instead of silently selecting nothing. |
| 69 | +5. A test's **`param_hash` is absent** from the map (new/changed test) → run **it**. |
| 70 | +6. A test's covered files **∩ changed `.fpp` ≠ ∅** → run **it**. |
| 71 | +7. Otherwise → skip. |
| 72 | + |
| 73 | +The only residual gap is one shared by *all* coverage-based test-impact-analysis: a change that newly makes test *T* reach file *Y* when *T* did not previously execute the edited file. This is caught by the **master full-suite backstop** below. |
| 74 | + |
| 75 | +### Gate model |
| 76 | + |
| 77 | +- **PR required check = the selected subset** (fast). |
| 78 | +- **Master push runs the FULL suite** — the backstop for (a) the reachability-change gap above and (b) GPU/Cray paths the CPU map cannot see. |
| 79 | +- A rare escape is caught on master and forward-fixed. (Chosen explicitly over a full-suite merge gate, for maximum PR-time savings.) |
| 80 | + |
| 81 | +### Changed-file detection |
| 82 | + |
| 83 | +The selector consumes the changed-file list from CI's existing `dorny/paths-filter` step (already computed reliably) rather than `git merge-base` on a shallow runner — eliminating the old shallow-clone failure. For **local** runs it uses `git diff` against the merge-base with a self-healing `git fetch --deepen` retry; if that still fails, it runs ALL (rung 1), loudly. |
| 84 | + |
| 85 | +## Data flows |
| 86 | + |
| 87 | +**Refresh (on master):** |
| 88 | +``` |
| 89 | +trigger (weekly floor | push touching cases.py or use/#:include graph) |
| 90 | + → build --gcov → test --build-coverage-map (full suite under gcov, per-test GCOV_PREFIX isolation) |
| 91 | + → write tests/coverage_map.json.gz → bot commits to master (guarded from re-triggering the test matrix) |
| 92 | +``` |
| 93 | + |
| 94 | +**PR selection:** |
| 95 | +``` |
| 96 | +paths-filter → changed files |
| 97 | + → load committed map (loud if missing/corrupt → run ALL, annotated) |
| 98 | + → conservative ladder → selected subset |
| 99 | + → emit summary (ran N/M; map age; rung reasons) |
| 100 | +``` |
| 101 | + |
| 102 | +## Observability |
| 103 | + |
| 104 | +Every PR test job prints a one-line summary to the job log/summary, e.g.: |
| 105 | +``` |
| 106 | +Coverage selection: ran 47/610 tests · map age 2d · always-run rungs: none · skipped 563 (no overlap) |
| 107 | +``` |
| 108 | +This makes per-PR selection behavior visible at a glance. Breakage shows up as "ran 610/610 (rung 1: changed-file list unavailable)" rather than a silent full run. |
| 109 | + |
| 110 | +## Anti-rot enforcement |
| 111 | + |
| 112 | +The **map-health workflow** runs on a schedule on master and **fails red + notifies** if either: |
| 113 | +- `_meta.built_at` is older than the **staleness threshold**, or |
| 114 | +- the map covers fewer than the **minimum-coverage fraction** of current `param_hash`es (too many conservative-includes ⇒ selection is degrading). |
| 115 | + |
| 116 | +It is intentionally NOT a PR-required check, so a refresh outage can never block unrelated PRs — PRs degrade to run-all *visibly* instead. Rot surfaces as a red master workflow + notification. |
| 117 | + |
| 118 | +## CI integration summary |
| 119 | + |
| 120 | +- `test.yml` PR jobs: pass the paths-filter changed-file list to `./mfc.sh test --only-changes`; selected subset is the required check. |
| 121 | +- `test.yml` master push: full suite (unchanged behavior = backstop). |
| 122 | +- New `coverage-refresh.yml` (master): rebuild + bot-commit the map. Guard the map-only commit from re-triggering the full matrix (`paths-ignore` on the map file and/or a `[skip ci]` convention). |
| 123 | +- New `coverage-health.yml` (scheduled): the loud anti-rot monitor. |
| 124 | + |
| 125 | +## Build sequence |
| 126 | + |
| 127 | +1. **Resurrect + re-key the collector** by `param_hash`. Unit test: cosmetic trace edits keep the key; a param change alters it; collection is deterministic. |
| 128 | +2. **Selector + observability**, landed **disabled (shadow mode)**: it computes and prints what it *would* select but the suite still runs in full. This yields real evidence that selection never under-selects, at zero risk. |
| 129 | +3. **Map builder + first committed map**, built on master. |
| 130 | +4. **Refresh workflow** (bot commits) + **map-health workflow** (loud monitor). |
| 131 | +5. **Flip the selector on** as the PR gate once shadow data shows it never under-selects. |
| 132 | + |
| 133 | +Shadow mode (step 2) is the safety valve: selection proves itself against reality before it gates anything. |
| 134 | + |
| 135 | +## Tunable parameters (defaults; confirm during review) |
| 136 | + |
| 137 | +- **Refresh cadence:** weekly floor (Mon 06:00 UTC) **plus** master pushes touching `cases.py` or the `use`/`#:include` dependency graph. |
| 138 | +- **Staleness threshold:** 10 days (3 days' grace beyond the weekly refresh). |
| 139 | +- **Minimum-coverage fraction:** 80% of current `param_hash`es present in the map. |
| 140 | +- **Map storage:** committed at `tests/coverage_map.json.gz`, refreshed by bot push to master (matches the existing homebrew-release push pattern). Alternative if master-history churn is unwanted: a dedicated `coverage-map` branch or a release asset. |
| 141 | + |
| 142 | +## Risks and mitigations |
| 143 | + |
| 144 | +- **Refresh cost (full gcov run, ~hours on Phoenix).** Inherent to coverage; mitigated by amortizing (weekly + targeted triggers, never per-PR) and made visible by the health workflow. |
| 145 | +- **CPU-only map misses GPU-only files.** Mitigated by ladder rung 4 (zero-coverage changed file → run all) + the master backstop. |
| 146 | +- **Reachability-change gap.** Mitigated by the master full-suite backstop; inherent to all coverage TIA. |
| 147 | +- **Bot push to master.** Uses the existing CI-push pattern; guarded against CI re-trigger loops. |
0 commit comments