|
| 1 | +--- |
| 2 | +title: "AI-Assisted Engineering Harness" |
| 3 | +description: "How Atmosphere keeps prose claims honest against running code — capability snapshot, drift log, validators, and Claude Code Stop hook as a reusable feedback-loop pattern for AI-assisted projects." |
| 4 | +--- |
| 5 | + |
| 6 | +The model is the engine. The harness is the rails. |
| 7 | + |
| 8 | +This chapter documents the small instrumentation layer Atmosphere uses to |
| 9 | +keep its own engineering loop honest under heavy AI-assisted contribution. |
| 10 | +It exists because we burned ourselves enough times shipping prose that |
| 11 | +disagreed with code (capability counts off by 3, runtime lists missing |
| 12 | +adopters, "PENDING" features that shipped weeks ago) that we eventually |
| 13 | +turned the catch-and-fix protocol into running code. The pattern is small |
| 14 | +and reusable; if you maintain a project that AI agents contribute to, the |
| 15 | +shape here transfers. |
| 16 | + |
| 17 | +The framing is Justin Reock's |
| 18 | +[*AI-Assisted Engineering*](https://www.infoq.com/presentations/ai-assisted-engineering/) |
| 19 | +(InfoQ, 2026-05): the orgs that get +20% from AI have an instrumented |
| 20 | +feedback loop on claim quality; the orgs that get −20% don't. Utilization |
| 21 | +metrics ("% of code AI-authored", "AI-assisted PR count") trigger |
| 22 | +Goodhart's Law and lose validity once they become targets. The right |
| 23 | +impact metric is **change failure rate by agent claim**. |
| 24 | + |
| 25 | +## The directory shape |
| 26 | + |
| 27 | +Everything lives under `.harness/` at the repo root, plus a few scripts |
| 28 | +and a Claude Code hook. Anyone seeing the directory knows it's |
| 29 | +project-engineering plumbing, not runtime code: |
| 30 | + |
| 31 | +``` |
| 32 | +.harness/ |
| 33 | +├── README.md Operator manual for this directory |
| 34 | +├── capabilities.snapshot.json Canonical capability matrix snapshot |
| 35 | +└── drift-log.md Append-only record of caught hallucinations |
| 36 | +
|
| 37 | +scripts/ |
| 38 | +├── regen-capability-snapshot.sh Re-derive snapshot from source |
| 39 | +├── validate-capability-claims.sh Pre-push gate: prose ↔ snapshot agreement |
| 40 | +└── validate-drift-log.sh Pre-push gate: append-only structural hygiene |
| 41 | +
|
| 42 | +modules/ai-test/.../CapabilitySnapshotTest.java |
| 43 | + JUnit mirror of the bash validator |
| 44 | +
|
| 45 | +.claude/ |
| 46 | +├── hooks/check-drift-log.sh Stop hook: block session-end on undocumented drift |
| 47 | +└── settings.json Project-level Claude Code hook registration |
| 48 | +``` |
| 49 | + |
| 50 | +## Capability snapshot — pin prose against running code |
| 51 | + |
| 52 | +`AiCapability` is a 20-entry Java enum |
| 53 | +([source](https://github.com/Atmosphere/atmosphere/blob/main/modules/ai/src/main/java/org/atmosphere/ai/AiCapability.java)). |
| 54 | +Each of the 9 framework runtimes overrides |
| 55 | +`AbstractAgentRuntimeContractTest.expectedCapabilities()` to declare its |
| 56 | +exact subset, and the contract test asserts the runtime's live |
| 57 | +`capabilities()` method returns the same set. That's the existing per-runtime |
| 58 | +gate — it catches code drift but doesn't catch *prose drift* in the |
| 59 | +README's count claims. |
| 60 | + |
| 61 | +The snapshot closes that gap. `scripts/regen-capability-snapshot.sh` |
| 62 | +parses `AiCapability.java` and every `*RuntimeContractTest.{java,kt}` |
| 63 | +file, then writes a deterministic JSON aggregate to |
| 64 | +`.harness/capabilities.snapshot.json`: |
| 65 | + |
| 66 | +```json |
| 67 | +{ |
| 68 | + "schema_version": 1, |
| 69 | + "capabilities": { |
| 70 | + "count": 20, |
| 71 | + "names": ["AGENT_ORCHESTRATION", "AUDIO", "BUDGET_ENFORCEMENT", ...] |
| 72 | + }, |
| 73 | + "runtimes": { |
| 74 | + "count": 9, |
| 75 | + "items": [ |
| 76 | + { "name": "AdkAgentRuntime", "module": "modules/adk", |
| 77 | + "language": "java", |
| 78 | + "expected_capabilities": ["AGENT_ORCHESTRATION", ...] }, |
| 79 | + ... |
| 80 | + ] |
| 81 | + } |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +Two enforcement points consume it: |
| 86 | + |
| 87 | +1. **`scripts/validate-capability-claims.sh`** — wired into pre-push |
| 88 | + Tier 1. Greps `modules/ai/README.md` for tight count patterns |
| 89 | + (`\bAll \d+ runtimes?\b` and similar) and asserts each match equals |
| 90 | + the snapshot count. |
| 91 | +2. **`CapabilitySnapshotTest`** in `modules/ai-test` — same logic in pure |
| 92 | + Java, so `mvn test` catches the same drift. |
| 93 | + |
| 94 | +The snapshot itself is committed; PR reviewers see "9 → 10 runtimes" as a |
| 95 | +diff hunk without grepping. The `LC_ALL=C` shell forcing in the regen |
| 96 | +script ensures bash `sort` matches Java's `String.compareTo` so the JSON |
| 97 | +ordering is identical to the JUnit test's `TreeSet<String>` view. |
| 98 | + |
| 99 | +This is structurally the same pattern |
| 100 | +[caveman's `evals/snapshots/results.json`](https://github.com/JuliusBrussee/caveman) |
| 101 | +uses for token-compression numbers — commit the snapshot to git so CI is |
| 102 | +deterministic and free, and any change is reviewable as a diff. |
| 103 | + |
| 104 | +## Drift log — record the *rate*, not just incidents |
| 105 | + |
| 106 | +`.harness/drift-log.md` is append-only. Every time a Claude session |
| 107 | +catches itself (or gets caught) saying something that disagrees with the |
| 108 | +code, the agent adds a structured row: |
| 109 | + |
| 110 | +| # | Claim | Truth | Slip path | Gate added | |
| 111 | +|---|-------|-------|-----------|------------| |
| 112 | +| N | what was stated | what the code says | how it bypassed existing gates | the regression-class fix (validator, test, memory update, prose grep) — `none` is a legitimate value | |
| 113 | + |
| 114 | +Bundling log update + gate addition + prose fix in **one commit** makes |
| 115 | +each session's impact diff-reviewable. Per Reock, the signal is the |
| 116 | +*rate* of entries over time, not the cleanliness of any single one. |
| 117 | +Don't gatekeep; better to over-record minor drift than under-record it. |
| 118 | + |
| 119 | +The first 10 entries (seeded the day the log was created) record actual |
| 120 | +session events: a memory file claimed "1 Quarkus build step" when the |
| 121 | +code had 14; "PENDING" features that had shipped weeks earlier; |
| 122 | +off-by-one runtime counts in narrative prose. The 11th entry recorded a |
| 123 | +CI-caught regression where a wall-clock test asserted |
| 124 | +`observed > limit` but our scheduled-task fix made `observed == limit` a |
| 125 | +legitimate trip outcome. That entry's gate column reads "JDK 21/26 CI |
| 126 | +matrix caught it within 12 min" — which is **the most honest gate value |
| 127 | +of all**: an existing gate worked. |
| 128 | + |
| 129 | +## Two enforcement points for the drift log |
| 130 | + |
| 131 | +The log is structurally append-only. Two layers keep it that way and |
| 132 | +keep it populated: |
| 133 | + |
| 134 | +**`scripts/validate-drift-log.sh`** — pre-push Tier 1. Asserts: |
| 135 | + |
| 136 | +1. File exists and parses. |
| 137 | +2. ≥1 `## YYYY-MM-DD` section. |
| 138 | +3. No future-dated sections. |
| 139 | +4. Sections in chronological order (oldest top, newest bottom). |
| 140 | +5. Pre-existing sections (older than today) match `origin/main` verbatim. |
| 141 | + |
| 142 | +It does **not** enforce that drift gets *added* — that's the next layer's |
| 143 | +job. |
| 144 | + |
| 145 | +**Claude Code `Stop` hook** at `.claude/hooks/check-drift-log.sh`, |
| 146 | +registered in `.claude/settings.json`. Fires at session end: |
| 147 | + |
| 148 | +1. Reads transcript path from hook input JSON. |
| 149 | +2. Greps for high-precision drift-correction patterns: |
| 150 | + `stale memory`, `\boff-by-one\b`, |
| 151 | + `I (was wrong|claimed)…(but|actual|truth)`, |
| 152 | + `memor… was/is wrong/stale/out of date`, |
| 153 | + `fabricated rule/stat/count/claim`, |
| 154 | + `verified by grep…disagree/contradict/wrong/stale`. |
| 155 | +3. If matched **and** `.harness/drift-log.md` was not modified this |
| 156 | + session (working tree, untracked, or last 3 commits), emits |
| 157 | + `{"decision": "block", "reason": "..."}` to force the agent to |
| 158 | + either append an entry or explicitly state the correction was |
| 159 | + trivial. |
| 160 | +4. `stop_hook_active=true` short-circuits to no-op so deliberate skips |
| 161 | + don't loop. |
| 162 | + |
| 163 | +Patterns are deliberately narrow to minimize false positives. If a |
| 164 | +recurring real correction shape isn't matching, add a new pattern with |
| 165 | +concrete real-session evidence — don't loosen existing ones. |
| 166 | + |
| 167 | +## What this looks like in practice |
| 168 | + |
| 169 | +A typical session might go: |
| 170 | + |
| 171 | +1. Claude claims "X is shipped" based on a 30-day-old memory file. |
| 172 | +2. ChefFamille (or `git grep` self-catch) says "verified by grep — that |
| 173 | + class doesn't exist on `main`". |
| 174 | +3. Claude reads the actual source, confirms the drift. |
| 175 | +4. Claude appends an entry to `.harness/drift-log.md` documenting the |
| 176 | + claim, truth, slip path, and what gate was added. |
| 177 | +5. Claude bundles the log entry + any prose fix + the gate (e.g., a |
| 178 | + regex pattern in `validate-capability-claims.sh`) into one commit. |
| 179 | +6. Pre-push Tier 1 runs both validators in <1s; commit lands. |
| 180 | +7. At session end the Stop hook checks the transcript: drift language |
| 181 | + present, log file modified, no block. |
| 182 | + |
| 183 | +Without the hook, session 2 of the same day forgets and makes the same |
| 184 | +class of claim again. With the hook, the agent is re-engaged before the |
| 185 | +session can end, and either logs or explicitly states "trivial — not |
| 186 | +worth logging" (the hook then no-ops via `stop_hook_active`). |
| 187 | + |
| 188 | +## What this is *not* |
| 189 | + |
| 190 | +- **Not a replacement for code review.** The validators only check |
| 191 | + prose-vs-snapshot agreement and structural hygiene. They don't catch |
| 192 | + semantic bugs, performance regressions, or architectural mistakes. |
| 193 | +- **Not a utilization metric.** We don't count "% of commits AI-authored" |
| 194 | + or "tokens spent per feature". Those measures invite Goodhart's Law. |
| 195 | +- **Not a substitute for verification at session start.** The |
| 196 | + `feedback_drift_log.md` memory rule says: re-verify against current |
| 197 | + code before quoting any memory file older than the most recent |
| 198 | + CHANGELOG bump. The drift log records what slipped past that rule; |
| 199 | + the rule itself is the primary defense. |
| 200 | + |
| 201 | +## Adopting the pattern in your project |
| 202 | + |
| 203 | +The shape is small enough to copy. Concretely, for a project with an |
| 204 | +LLM-facing agent integration: |
| 205 | + |
| 206 | +1. **Pick one or two count claims you make in your README that have |
| 207 | + gone wrong before.** Runtime count, capability count, sample count, |
| 208 | + backend count — anything quantitative that you've shipped wrong. |
| 209 | +2. **Build a snapshot** parsed from canonical source. JSON, committed |
| 210 | + to git, regenerated by a single shell script. Add `LC_ALL=C` so |
| 211 | + sort is deterministic across hosts. |
| 212 | +3. **Add one validator** that greps your README for those count claims |
| 213 | + and asserts against the snapshot. Wire it into your pre-push hook. |
| 214 | +4. **Add an append-only drift log** with one row per caught |
| 215 | + hallucination. Don't stress about the schema — `claim`, `truth`, |
| 216 | + `slip path`, `gate` is enough. |
| 217 | +5. **Add a Claude Code Stop hook** (or your agent runtime's equivalent) |
| 218 | + that greps the transcript for drift-correction language and blocks |
| 219 | + session end if the log wasn't updated. Use narrow patterns; broad |
| 220 | + patterns cause false-positive loops. |
| 221 | + |
| 222 | +That's the whole pattern. Roughly 500 lines of bash + 250 lines of Java |
| 223 | +in our case. Lower bound for any project: the snapshot + one validator, |
| 224 | +maybe 100 lines, gives you the diff-reviewable curve. |
| 225 | + |
| 226 | +## Further reading |
| 227 | + |
| 228 | +- Justin Reock, *AI-Assisted Engineering* — |
| 229 | + [InfoQ talk](https://www.infoq.com/presentations/ai-assisted-engineering/), |
| 230 | + 2026-05. The DX measurement framework (utilization vs. impact vs. cost) |
| 231 | + and the Goodhart's Law warning. |
| 232 | +- [`walkinglabs/learn-harness-engineering`](https://github.com/walkinglabs/learn-harness-engineering) |
| 233 | + — the five-subsystem framework (Instructions, State, Verification, |
| 234 | + Scope, Lifecycle). Treats the harness as engineering work rather than |
| 235 | + configuration. |
| 236 | +- [`juliusbrussee/caveman`](https://github.com/JuliusBrussee/caveman) — |
| 237 | + the snapshot-as-source-of-truth pattern with a three-arm |
| 238 | + baseline/control/treatment eval methodology. Inspired the |
| 239 | + diff-reviewable shape of `capabilities.snapshot.json`. |
| 240 | +- Atmosphere's |
| 241 | + [`.harness/README.md`](https://github.com/Atmosphere/atmosphere/blob/main/.harness/README.md) |
| 242 | + — operator manual for the directory. |
0 commit comments