Skip to content

Commit 7e266b6

Browse files
committed
docs(tutorial): add chapter 34 — AI-Assisted Engineering Harness
Public-facing chapter on Atmosphere's instrumentation for keeping prose claims honest against running code: capability snapshot + drift log + pre-push validators + Claude Code Stop hook. Cites Reock's InfoQ talk, walkinglabs/learn-harness-engineering, and juliusbrussee/caveman as framing sources.
1 parent 13fe8c4 commit 7e266b6

2 files changed

Lines changed: 243 additions & 0 deletions

File tree

docs/astro.config.mjs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ export default defineConfig({
8888
{ label: '@AgentScope & Goal-Hijacking', slug: 'tutorial/31-agent-scope' },
8989
{ label: 'OWASP Agentic Top-10 Matrix', slug: 'tutorial/32-owasp-agentic-matrix' },
9090
{ label: 'Plan-and-Verify', slug: 'tutorial/33-plan-and-verify' },
91+
{ label: 'AI-Assisted Engineering Harness', slug: 'tutorial/34-harness-engineering' },
9192
{ label: 'Migration 2.x → 4.0', slug: 'tutorial/22-migration' },
9293
],
9394
},
Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
---
2+
title: "AI-Assisted Engineering Harness"
3+
description: "How Atmosphere keeps prose claims honest against running code — capability snapshot, drift log, validators, and Claude Code Stop hook as a reusable feedback-loop pattern for AI-assisted projects."
4+
---
5+
6+
The model is the engine. The harness is the rails.
7+
8+
This chapter documents the small instrumentation layer Atmosphere uses to
9+
keep its own engineering loop honest under heavy AI-assisted contribution.
10+
It exists because we burned ourselves enough times shipping prose that
11+
disagreed with code (capability counts off by 3, runtime lists missing
12+
adopters, "PENDING" features that shipped weeks ago) that we eventually
13+
turned the catch-and-fix protocol into running code. The pattern is small
14+
and reusable; if you maintain a project that AI agents contribute to, the
15+
shape here transfers.
16+
17+
The framing is Justin Reock's
18+
[*AI-Assisted Engineering*](https://www.infoq.com/presentations/ai-assisted-engineering/)
19+
(InfoQ, 2026-05): the orgs that get +20% from AI have an instrumented
20+
feedback loop on claim quality; the orgs that get −20% don't. Utilization
21+
metrics ("% of code AI-authored", "AI-assisted PR count") trigger
22+
Goodhart's Law and lose validity once they become targets. The right
23+
impact metric is **change failure rate by agent claim**.
24+
25+
## The directory shape
26+
27+
Everything lives under `.harness/` at the repo root, plus a few scripts
28+
and a Claude Code hook. Anyone seeing the directory knows it's
29+
project-engineering plumbing, not runtime code:
30+
31+
```
32+
.harness/
33+
├── README.md Operator manual for this directory
34+
├── capabilities.snapshot.json Canonical capability matrix snapshot
35+
└── drift-log.md Append-only record of caught hallucinations
36+
37+
scripts/
38+
├── regen-capability-snapshot.sh Re-derive snapshot from source
39+
├── validate-capability-claims.sh Pre-push gate: prose ↔ snapshot agreement
40+
└── validate-drift-log.sh Pre-push gate: append-only structural hygiene
41+
42+
modules/ai-test/.../CapabilitySnapshotTest.java
43+
JUnit mirror of the bash validator
44+
45+
.claude/
46+
├── hooks/check-drift-log.sh Stop hook: block session-end on undocumented drift
47+
└── settings.json Project-level Claude Code hook registration
48+
```
49+
50+
## Capability snapshot — pin prose against running code
51+
52+
`AiCapability` is a 20-entry Java enum
53+
([source](https://github.com/Atmosphere/atmosphere/blob/main/modules/ai/src/main/java/org/atmosphere/ai/AiCapability.java)).
54+
Each of the 9 framework runtimes overrides
55+
`AbstractAgentRuntimeContractTest.expectedCapabilities()` to declare its
56+
exact subset, and the contract test asserts the runtime's live
57+
`capabilities()` method returns the same set. That's the existing per-runtime
58+
gate — it catches code drift but doesn't catch *prose drift* in the
59+
README's count claims.
60+
61+
The snapshot closes that gap. `scripts/regen-capability-snapshot.sh`
62+
parses `AiCapability.java` and every `*RuntimeContractTest.{java,kt}`
63+
file, then writes a deterministic JSON aggregate to
64+
`.harness/capabilities.snapshot.json`:
65+
66+
```json
67+
{
68+
"schema_version": 1,
69+
"capabilities": {
70+
"count": 20,
71+
"names": ["AGENT_ORCHESTRATION", "AUDIO", "BUDGET_ENFORCEMENT", ...]
72+
},
73+
"runtimes": {
74+
"count": 9,
75+
"items": [
76+
{ "name": "AdkAgentRuntime", "module": "modules/adk",
77+
"language": "java",
78+
"expected_capabilities": ["AGENT_ORCHESTRATION", ...] },
79+
...
80+
]
81+
}
82+
}
83+
```
84+
85+
Two enforcement points consume it:
86+
87+
1. **`scripts/validate-capability-claims.sh`** — wired into pre-push
88+
Tier 1. Greps `modules/ai/README.md` for tight count patterns
89+
(`\bAll \d+ runtimes?\b` and similar) and asserts each match equals
90+
the snapshot count.
91+
2. **`CapabilitySnapshotTest`** in `modules/ai-test` — same logic in pure
92+
Java, so `mvn test` catches the same drift.
93+
94+
The snapshot itself is committed; PR reviewers see "9 → 10 runtimes" as a
95+
diff hunk without grepping. The `LC_ALL=C` shell forcing in the regen
96+
script ensures bash `sort` matches Java's `String.compareTo` so the JSON
97+
ordering is identical to the JUnit test's `TreeSet<String>` view.
98+
99+
This is structurally the same pattern
100+
[caveman's `evals/snapshots/results.json`](https://github.com/JuliusBrussee/caveman)
101+
uses for token-compression numbers — commit the snapshot to git so CI is
102+
deterministic and free, and any change is reviewable as a diff.
103+
104+
## Drift log — record the *rate*, not just incidents
105+
106+
`.harness/drift-log.md` is append-only. Every time a Claude session
107+
catches itself (or gets caught) saying something that disagrees with the
108+
code, the agent adds a structured row:
109+
110+
| # | Claim | Truth | Slip path | Gate added |
111+
|---|-------|-------|-----------|------------|
112+
| N | what was stated | what the code says | how it bypassed existing gates | the regression-class fix (validator, test, memory update, prose grep) — `none` is a legitimate value |
113+
114+
Bundling log update + gate addition + prose fix in **one commit** makes
115+
each session's impact diff-reviewable. Per Reock, the signal is the
116+
*rate* of entries over time, not the cleanliness of any single one.
117+
Don't gatekeep; better to over-record minor drift than under-record it.
118+
119+
The first 10 entries (seeded the day the log was created) record actual
120+
session events: a memory file claimed "1 Quarkus build step" when the
121+
code had 14; "PENDING" features that had shipped weeks earlier;
122+
off-by-one runtime counts in narrative prose. The 11th entry recorded a
123+
CI-caught regression where a wall-clock test asserted
124+
`observed > limit` but our scheduled-task fix made `observed == limit` a
125+
legitimate trip outcome. That entry's gate column reads "JDK 21/26 CI
126+
matrix caught it within 12 min" — which is **the most honest gate value
127+
of all**: an existing gate worked.
128+
129+
## Two enforcement points for the drift log
130+
131+
The log is structurally append-only. Two layers keep it that way and
132+
keep it populated:
133+
134+
**`scripts/validate-drift-log.sh`** — pre-push Tier 1. Asserts:
135+
136+
1. File exists and parses.
137+
2. ≥1 `## YYYY-MM-DD` section.
138+
3. No future-dated sections.
139+
4. Sections in chronological order (oldest top, newest bottom).
140+
5. Pre-existing sections (older than today) match `origin/main` verbatim.
141+
142+
It does **not** enforce that drift gets *added* — that's the next layer's
143+
job.
144+
145+
**Claude Code `Stop` hook** at `.claude/hooks/check-drift-log.sh`,
146+
registered in `.claude/settings.json`. Fires at session end:
147+
148+
1. Reads transcript path from hook input JSON.
149+
2. Greps for high-precision drift-correction patterns:
150+
`stale memory`, `\boff-by-one\b`,
151+
`I (was wrong|claimed)…(but|actual|truth)`,
152+
`memor… was/is wrong/stale/out of date`,
153+
`fabricated rule/stat/count/claim`,
154+
`verified by grep…disagree/contradict/wrong/stale`.
155+
3. If matched **and** `.harness/drift-log.md` was not modified this
156+
session (working tree, untracked, or last 3 commits), emits
157+
`{"decision": "block", "reason": "..."}` to force the agent to
158+
either append an entry or explicitly state the correction was
159+
trivial.
160+
4. `stop_hook_active=true` short-circuits to no-op so deliberate skips
161+
don't loop.
162+
163+
Patterns are deliberately narrow to minimize false positives. If a
164+
recurring real correction shape isn't matching, add a new pattern with
165+
concrete real-session evidence — don't loosen existing ones.
166+
167+
## What this looks like in practice
168+
169+
A typical session might go:
170+
171+
1. Claude claims "X is shipped" based on a 30-day-old memory file.
172+
2. ChefFamille (or `git grep` self-catch) says "verified by grep — that
173+
class doesn't exist on `main`".
174+
3. Claude reads the actual source, confirms the drift.
175+
4. Claude appends an entry to `.harness/drift-log.md` documenting the
176+
claim, truth, slip path, and what gate was added.
177+
5. Claude bundles the log entry + any prose fix + the gate (e.g., a
178+
regex pattern in `validate-capability-claims.sh`) into one commit.
179+
6. Pre-push Tier 1 runs both validators in <1s; commit lands.
180+
7. At session end the Stop hook checks the transcript: drift language
181+
present, log file modified, no block.
182+
183+
Without the hook, session 2 of the same day forgets and makes the same
184+
class of claim again. With the hook, the agent is re-engaged before the
185+
session can end, and either logs or explicitly states "trivial — not
186+
worth logging" (the hook then no-ops via `stop_hook_active`).
187+
188+
## What this is *not*
189+
190+
- **Not a replacement for code review.** The validators only check
191+
prose-vs-snapshot agreement and structural hygiene. They don't catch
192+
semantic bugs, performance regressions, or architectural mistakes.
193+
- **Not a utilization metric.** We don't count "% of commits AI-authored"
194+
or "tokens spent per feature". Those measures invite Goodhart's Law.
195+
- **Not a substitute for verification at session start.** The
196+
`feedback_drift_log.md` memory rule says: re-verify against current
197+
code before quoting any memory file older than the most recent
198+
CHANGELOG bump. The drift log records what slipped past that rule;
199+
the rule itself is the primary defense.
200+
201+
## Adopting the pattern in your project
202+
203+
The shape is small enough to copy. Concretely, for a project with an
204+
LLM-facing agent integration:
205+
206+
1. **Pick one or two count claims you make in your README that have
207+
gone wrong before.** Runtime count, capability count, sample count,
208+
backend count — anything quantitative that you've shipped wrong.
209+
2. **Build a snapshot** parsed from canonical source. JSON, committed
210+
to git, regenerated by a single shell script. Add `LC_ALL=C` so
211+
sort is deterministic across hosts.
212+
3. **Add one validator** that greps your README for those count claims
213+
and asserts against the snapshot. Wire it into your pre-push hook.
214+
4. **Add an append-only drift log** with one row per caught
215+
hallucination. Don't stress about the schema — `claim`, `truth`,
216+
`slip path`, `gate` is enough.
217+
5. **Add a Claude Code Stop hook** (or your agent runtime's equivalent)
218+
that greps the transcript for drift-correction language and blocks
219+
session end if the log wasn't updated. Use narrow patterns; broad
220+
patterns cause false-positive loops.
221+
222+
That's the whole pattern. Roughly 500 lines of bash + 250 lines of Java
223+
in our case. Lower bound for any project: the snapshot + one validator,
224+
maybe 100 lines, gives you the diff-reviewable curve.
225+
226+
## Further reading
227+
228+
- Justin Reock, *AI-Assisted Engineering*
229+
[InfoQ talk](https://www.infoq.com/presentations/ai-assisted-engineering/),
230+
2026-05. The DX measurement framework (utilization vs. impact vs. cost)
231+
and the Goodhart's Law warning.
232+
- [`walkinglabs/learn-harness-engineering`](https://github.com/walkinglabs/learn-harness-engineering)
233+
— the five-subsystem framework (Instructions, State, Verification,
234+
Scope, Lifecycle). Treats the harness as engineering work rather than
235+
configuration.
236+
- [`juliusbrussee/caveman`](https://github.com/JuliusBrussee/caveman)
237+
the snapshot-as-source-of-truth pattern with a three-arm
238+
baseline/control/treatment eval methodology. Inspired the
239+
diff-reviewable shape of `capabilities.snapshot.json`.
240+
- Atmosphere's
241+
[`.harness/README.md`](https://github.com/Atmosphere/atmosphere/blob/main/.harness/README.md)
242+
— operator manual for the directory.

0 commit comments

Comments
 (0)