Skip to content

Commit 4fd791c

Browse files
authored
Merge pull request #72 from mataeil/feat/v1.8.0-quality-loop
feat(v1.8.0): drive quality to good — fix thrashing guard, per-dimension capture, dimension-lock, remainder
2 parents 7176243 + 47c958a commit 4fd791c

7 files changed

Lines changed: 358 additions & 32 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "ooda-loop",
33
"displayName": "OODA-loop",
4-
"version": "1.7.0",
4+
"version": "1.8.0",
55
"description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",
66
"author": {
77
"name": "Taeil Ma",

.claude/ooda-evolution-v1.8.0.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# OODA-loop v1.8.0 — driving quality to "good", not "passable"
2+
3+
**Date:** 2026-06-19
4+
**Purpose reminder:** the f1-racing game is a **probe**, not the product. Its job
5+
is to expose what OODA-loop still gets wrong. This release is about the loop.
6+
7+
## The probe's verdict (why we believe the loop is still the bottleneck)
8+
9+
v1.7.0 fixed *measurement* (artifact axis) and *anti-incrementalism* (leap
10+
cycles). It worked — the loop went from a flat, lying A to a climbing, honest D.
11+
But after **three** real leap cycles the owner's verdict is unchanged: *still too
12+
crappy.* The data says why:
13+
14+
| | artifact_quality | note |
15+
|---|---|---|
16+
| 22 feature cycles (old loop) | **0.394** (flat) | Goodhart collapse |
17+
| + leap 1 (visual) | 0.447 | +0.053 |
18+
| + leap 2 (track) | 0.472 | +0.025 |
19+
| + leap 3 (car/cockpit) | 0.522 | +0.050 |
20+
| bar (shippable) | 0.65 | not reached |
21+
| "genuinely good" | ~0.80 | far off |
22+
23+
Average **~+0.04 per leap**. At that rate "good" is ~7 leaps away. The loop
24+
*climbs* but it climbs **too slowly and stops too low**. Three concrete failures
25+
observed during the campaign point at the loop, not the game:
26+
27+
1. **It settles for "+0.05 better."** A leap's success test is
28+
`min_dimension_delta` (0.05) — so the moment a dimension nudges up, the loop
29+
declares victory and moves the weighted-gap target elsewhere, abandoning a
30+
dimension while it is still *bad* (visual went 0.22→0.41→0.59 across two
31+
separate leaps with other work in between; it was never driven to the bar in
32+
one sustained push).
33+
2. **It accepts partial implementation of its own plan.** Leap 3's design panel
34+
specced car/cockpit **and** materials/lighting **and** environment; only
35+
car/cockpit shipped. The dropped materials/lighting (shadows, tone-mapping,
36+
textures) was the *single most-repeated* critic complaint at every leap — and
37+
it silently vanished. The act step under-delivers with no completion check.
38+
3. **It can't see motion, feel, or fun.** The critic scores one still
39+
screenshot, so `driving_feel` and `fun_challenge` are unmeasurable and
40+
unimprovable — they sit at 0.38–0.51 and cap the weighted mean.
41+
42+
## Root cause (from a 13-agent, adversarially-verified diagnosis)
43+
44+
> The loop **detects** a quality gap and takes *a* step at it, but is not built to
45+
> **close** it. It (1) can't see ~45% of its own rubric — `driving_feel` + `fun_challenge`
46+
> were scored from a still screenshot and sat frozen across all 25 cycles; (2)
47+
> abandons a dimension after a +0.05 nudge and rotates targets instead of driving
48+
> one to the bar; (3) had a **silently broken thrashing guard** (read a nonexistent
49+
> `leap_delta` field → never fired → could thrash forever); and (4) accepts partial
50+
> implementation of its own leap plans, orphaning the rest. It optimizes to
51+
> "better," never to "good."
52+
53+
The devil's-advocate agent confirmed the leap *routing* is sound — the binding
54+
constraint is **perception** (the critic can't measure feel/fun), not more leap
55+
machinery. That reframed the fix.
56+
57+
## The v1.8.0 upgrades (ranked by leverage)
58+
59+
1. **Fix the thrashing guard (prerequisite, real bug).** 2-G now counts
60+
`leap_attempts[].delta_score` on the `leap_target` (was: nonexistent `leap_delta`
61+
on `weakest_dimension`). The HALT safety valve actually fires now —
62+
`rubric_score.failed_leaps()` is the deterministic ground truth.
63+
2. **Per-dimension `capture_method` (5-G).** Each rubric axis declares how its
64+
evidence is captured; experiential axes use `gameplay_metrics` — a
65+
HUMAN-AUTHORED, hash-verified, protected harness (same independence invariant
66+
as the rubric hash). Missing/unverified harness → the axis scores `null`
67+
(capture_failure) + a skill_gap, never a faked or screenshot-fallback score.
68+
Unlocks the frozen 45% of rubric weight.
69+
3. **Dimension lock until bar (2-G).** After a successful leap whose target is
70+
still below `bar − eps`, keep the plateau active on the SAME target so 3-K
71+
leaps it again next cycle — drive-to-bar, not detect-and-nudge. A tolerance
72+
band + the (now-working) max-attempts HALT prevent infinite lock.
73+
`rubric_score.lock_target()` is the deterministic ground truth;
74+
`config.leap.lock_until_bar` (default true) toggles it.
75+
4. **Auto-queue the remainder (5-G).** If a leap passes its delta gate but the
76+
dimension is still below bar, the *independent critic's* score (not the
77+
maker's self-report) queues a high-RICE remainder so dropped scope can't
78+
vanish. Records `leap_dim_still_below_bar`.
79+
80+
All four are deterministic where possible (`scripts/rubric_score.py`), config-
81+
driven, and gaming-resistant. `tests/verify.py` 59 → **61**.
82+
83+
## What we deliberately did NOT change (rejected / devil's-advocate-validated)
84+
85+
- **Don't raise the bar to 0.80 yet.** It only re-points the weighted-gap target
86+
at an unmeasurable dimension and (pre-fix) would HALT the loop. Raise only after
87+
a `gameplay_metrics` leap proves feel/fun are responsive. Stage 0.65→0.75→0.80.
88+
- **No inner multi-pass refine loop (yet).** The 3 completed leaps each landed
89+
substantial single-pass deltas (+0.19/+0.27/+0.18); the cap was perception, not
90+
pass-count. Adding passes on unmeasurable dimensions = cost with no signal.
91+
- **No LLM-component-coverage gate.** Gameable (touch one file per "component" →
92+
100%). Change 4's critic-driven remainder is the robust substitute.
93+
- **No `multi_probe` still-sequence.** A burst of stills still can't tell
94+
responsive steering from sluggish; `gameplay_metrics` is the right instrument.
95+
96+
## Validation (leap 4 under the upgraded loop)
97+
98+
Ran a real leap under v1.8.0, targeting `visual_fidelity` (dimension lock kept it
99+
as the target). It shipped exactly the materials/lighting that leap 3 dropped —
100+
ACES filmic tone mapping, soft directional shadows (cars/buildings/kerbs cast,
101+
road/ground receive), and baked contact shadows under every car.
102+
103+
- Independent re-critique: **visual_fidelity 0.59 → 0.63** (Δ +0.04). Building
104+
shadows and tonal range are now visible; objects are grounded.
105+
- Artifact trajectory across all leaps: **0.394 → 0.447 → 0.472 → 0.522 → 0.533**.
106+
107+
**The decisive finding (the whole point — f1 is the probe):** +0.04 is *at the
108+
gate boundary*, and the critic still under-credits the change — because
109+
`visual_fidelity` has hit the **ceiling of what a still screenshot can measure
110+
(~0.63)**. The loop's next weighted-gap target is `fun_challenge` (0.38), which a
111+
screenshot *cannot* judge at all. So the upgraded loop's correct next move is
112+
exactly what v1.8.0 built: with the **fixed** thrashing guard, two failed leaps on
113+
`fun_challenge`**HALT requesting a human-authored `gameplay_metrics` harness**,
114+
instead of thrashing forever (the v1.7.x bug) or faking a score.
115+
116+
The bottleneck has *moved*: from "the loop can't close gaps" (v1.7.0) to "the loop
117+
can't **perceive** the experiential half of quality" (now). v1.8.0's per-dimension
118+
`capture_method` is the mechanism to close it — but it requires a human to author
119+
the measurement harness, because the loop is forbidden (by design) from grading
120+
its own standard. **That hand-off is the next experiment.**

CHANGELOG.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,42 @@ independently. Bump there signals migration work for downstream projects.
88

99
---
1010

11+
## [v1.8.0] — 2026-06-19
12+
13+
### Changed — drive quality to "good", not "passable" (config schema 1.4.0)
14+
15+
The F1 probe stayed crap after **three** v1.7.0 leaps (artifact 0.394 → 0.447 →
16+
0.472 → 0.522, never reaching the 0.65 bar). A 13-agent adversarially-verified
17+
diagnosis (`.claude/ooda-evolution-v1.8.0.md`) found the loop **detects** a
18+
quality gap but isn't built to **close** it. Four fixes, ranked by leverage:
19+
20+
- **Thrashing-guard bug fix (prerequisite).** evolve 2-G counted a nonexistent
21+
`leap_delta` field on `weakest_dimension`, so the guard's `fails` count was
22+
ALWAYS 0 and the HALT safety valve **never fired** — the loop could thrash a
23+
dimension forever. Now counts `leap_attempts[].delta_score` on the actual
24+
`leap_target` (`rubric_score.failed_leaps()`).
25+
- **Per-dimension `capture_method` (5-G).** The critic scored every axis from one
26+
screenshot, so `driving_feel` + `fun_challenge` (**45% of the rubric's weight**)
27+
were frozen across all 25 cycles. Each axis now declares its capture;
28+
experiential axes use `gameplay_metrics` — a human-authored, hash-verified,
29+
protected harness (same independence invariant as the rubric hash). Missing/
30+
unverified → score `null` + skill_gap, never a faked or silent-screenshot score.
31+
- **Dimension lock until bar (2-G).** A successful leap that left its target below
32+
`bar − eps` now keeps the plateau active on the SAME target (drive-to-bar, not
33+
detect-and-nudge + rotate). `rubric_score.lock_target()`; toggle with
34+
`config.leap.lock_until_bar`. Tolerance band + the now-working max-attempts HALT
35+
prevent infinite lock.
36+
- **Auto-queue remainder (5-G).** A leap that passes its delta gate but leaves the
37+
dimension below bar queues a high-RICE remainder, triggered by the *independent
38+
critic's* score (not the maker's self-report), so dropped/partial scope can't be
39+
silently orphaned (as leap 3's materials/lighting was).
40+
41+
**Rejected** (kept the loop honest): raising the bar to 0.80 before feel/fun are
42+
measurable; an inner multi-pass refine loop; an LLM-component-coverage gate
43+
(gameable); `multi_probe` still-sequences. `tests/verify.py` 59 → **61**.
44+
45+
---
46+
1147
## [v1.7.0] — 2026-06-17
1248

1349
### Added — artifact-grounded evaluation + quantum-leap cycles (config schema 1.3.0)

config.example.json

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"schema_version": "1.3.0",
2+
"schema_version": "1.4.0",
33
"project": {
44
"name": "my-app",
55
"locale": "en",
@@ -270,16 +270,27 @@
270270
"plateau_window": 4,
271271
"plateau_eps": 0.05,
272272
"locked": true,
273-
"dimensions": []
273+
"__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality.",
274+
"dimensions": [],
275+
"__example_dimension__": {
276+
"name": "driving_feel",
277+
"weight": 0.25,
278+
"capture_method": "gameplay_metrics",
279+
"gameplay_metrics_command": "node tools/feel_harness.mjs",
280+
"gameplay_metrics_hash": "<sha256 of the harness file>",
281+
"metrics_fields": ["input_lag_ms", "physics_response_ms", "completion_rate"],
282+
"description": "Responsive steering, weight transfer, distinct braking — judge against: input_lag_ms<40, physics_response_ms stable, completion_rate>0.7."
283+
}
274284
},
275285
"leap": {
276-
"__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend.",
286+
"__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend. v1.8.0: lock_until_bar keeps leaping the SAME dimension until it clears bar (drive-to-good, not detect-and-nudge).",
277287
"max_lines": 1500,
278288
"min_dimension_delta": 0.05,
279289
"max_attempts_per_dimension": 2,
280290
"max_per_day": 2,
281291
"gap_weight": 30.0,
282-
"cost_limit_usd": 0.5
292+
"cost_limit_usd": 0.5,
293+
"lock_until_bar": true
283294
},
284295
"goal_completion_idle": true
285296
}

scripts/rubric_score.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,43 @@ def detect_plateau(outcomes: list, rubric: dict) -> dict:
214214
}
215215

216216

217+
def failed_leaps(outcomes: list, dimension: str, min_delta: float) -> int:
218+
"""v1.8.0 thrashing-guard count (deterministic ground truth for evolve 2-G).
219+
Counts leap cycles whose `leap_attempts` on `dimension` failed to clear
220+
`min_delta`. The v1.7.x engine read a nonexistent field `leap_delta` and
221+
matched the raw weakest_dimension, so this was ALWAYS 0 and the guard never
222+
fired — the loop could thrash forever. The real ledger field is
223+
`leap_attempts[].delta_score`, keyed on the dimension actually leapt."""
224+
n = 0
225+
for e in outcomes:
226+
if e.get("cycle_mode") != "leap":
227+
continue
228+
for a in (e.get("leap_attempts") or []):
229+
if a.get("dimension") == dimension and a.get("delta_score", 1.0) < min_delta:
230+
n += 1
231+
return n
232+
233+
234+
def lock_target(outcomes: list, rubric: dict, leap_target: str | None) -> str | None:
235+
"""v1.8.0 dimension-lock: after a SUCCESSFUL leap whose target is still below
236+
(bar − eps), return that target so evolve 2-G keeps the plateau active on it
237+
(drive-to-bar) instead of coasting through feature cycles and re-rotating.
238+
Returns None when nothing should be locked (no leap last, regressed, or the
239+
target is at/near bar — the tolerance band stops critic variance from locking
240+
the loop forever; the max_attempts HALT is the genuine-stuck backstop)."""
241+
if not leap_target or not outcomes:
242+
return None
243+
last = outcomes[-1]
244+
if last.get("cycle_mode") != "leap" or last.get("result_type") == "leap_regressed":
245+
return None
246+
bar = rubric.get("bar", DEFAULT_BAR)
247+
eps = float(rubric.get("plateau_eps", DEFAULT_PLATEAU_EPS))
248+
score = (last.get("dimension_scores") or {}).get(leap_target)
249+
if score is None:
250+
return None
251+
return leap_target if score < bar - eps else None
252+
253+
217254
def compute(project: Path) -> dict:
218255
ev = Path(project) / "agent" / "state" / "evolve"
219256
config = _load(Path(project) / "config.json", {}) or {}

0 commit comments

Comments
 (0)