Merge pull request #72 from mataeil/feat/v1.8.0-quality-loop

mataeil · web-flow · commit 4fd791c19dae · 2026-06-19T15:05:53.000+09:00
feat(v1.8.0): drive quality to good — fix thrashing guard, per-dimension capture, dimension-lock, remainder
diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "ooda-loop",
   "displayName": "OODA-loop",
-  "version": "1.7.0",
+  "version": "1.8.0",
   "description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",
   "author": {
     "name": "Taeil Ma",
diff --git a/.claude/ooda-evolution-v1.8.0.md b/.claude/ooda-evolution-v1.8.0.md
@@ -0,0 +1,120 @@
+# OODA-loop v1.8.0 — driving quality to "good", not "passable"
+
+**Date:** 2026-06-19
+**Purpose reminder:** the f1-racing game is a **probe**, not the product. Its job
+is to expose what OODA-loop still gets wrong. This release is about the loop.
+
+## The probe's verdict (why we believe the loop is still the bottleneck)
+
+v1.7.0 fixed *measurement* (artifact axis) and *anti-incrementalism* (leap
+cycles). It worked — the loop went from a flat, lying A to a climbing, honest D.
+But after **three** real leap cycles the owner's verdict is unchanged: *still too
+crappy.* The data says why:
+
+| | artifact_quality | note |
+|---|---|---|
+| 22 feature cycles (old loop) | **0.394** (flat) | Goodhart collapse |
+| + leap 1 (visual) | 0.447 | +0.053 |
+| + leap 2 (track) | 0.472 | +0.025 |
+| + leap 3 (car/cockpit) | 0.522 | +0.050 |
+| bar (shippable) | 0.65 | not reached |
+| "genuinely good" | ~0.80 | far off |
+
+Average **~+0.04 per leap**. At that rate "good" is ~7 leaps away. The loop
+*climbs* but it climbs **too slowly and stops too low**. Three concrete failures
+observed during the campaign point at the loop, not the game:
+
+1. **It settles for "+0.05 better."** A leap's success test is
+   `min_dimension_delta` (0.05) — so the moment a dimension nudges up, the loop
+   declares victory and moves the weighted-gap target elsewhere, abandoning a
+   dimension while it is still *bad* (visual went 0.22→0.41→0.59 across two
+   separate leaps with other work in between; it was never driven to the bar in
+   one sustained push).
+2. **It accepts partial implementation of its own plan.** Leap 3's design panel
+   specced car/cockpit **and** materials/lighting **and** environment; only
+   car/cockpit shipped. The dropped materials/lighting (shadows, tone-mapping,
+   textures) was the *single most-repeated* critic complaint at every leap — and
+   it silently vanished. The act step under-delivers with no completion check.
+3. **It can't see motion, feel, or fun.** The critic scores one still
+   screenshot, so `driving_feel` and `fun_challenge` are unmeasurable and
+   unimprovable — they sit at 0.38–0.51 and cap the weighted mean.
+
+## Root cause (from a 13-agent, adversarially-verified diagnosis)
+
+> The loop **detects** a quality gap and takes *a* step at it, but is not built to
+> **close** it. It (1) can't see ~45% of its own rubric — `driving_feel` + `fun_challenge`
+> were scored from a still screenshot and sat frozen across all 25 cycles; (2)
+> abandons a dimension after a +0.05 nudge and rotates targets instead of driving
+> one to the bar; (3) had a **silently broken thrashing guard** (read a nonexistent
+> `leap_delta` field → never fired → could thrash forever); and (4) accepts partial
+> implementation of its own leap plans, orphaning the rest. It optimizes to
+> "better," never to "good."
+
+The devil's-advocate agent confirmed the leap *routing* is sound — the binding
+constraint is **perception** (the critic can't measure feel/fun), not more leap
+machinery. That reframed the fix.
+
+## The v1.8.0 upgrades (ranked by leverage)
+
+1. **Fix the thrashing guard (prerequisite, real bug).** 2-G now counts
+   `leap_attempts[].delta_score` on the `leap_target` (was: nonexistent `leap_delta`
+   on `weakest_dimension`). The HALT safety valve actually fires now —
+   `rubric_score.failed_leaps()` is the deterministic ground truth.
+2. **Per-dimension `capture_method` (5-G).** Each rubric axis declares how its
+   evidence is captured; experiential axes use `gameplay_metrics` — a
+   HUMAN-AUTHORED, hash-verified, protected harness (same independence invariant
+   as the rubric hash). Missing/unverified harness → the axis scores `null`
+   (capture_failure) + a skill_gap, never a faked or screenshot-fallback score.
+   Unlocks the frozen 45% of rubric weight.
+3. **Dimension lock until bar (2-G).** After a successful leap whose target is
+   still below `bar − eps`, keep the plateau active on the SAME target so 3-K
+   leaps it again next cycle — drive-to-bar, not detect-and-nudge. A tolerance
+   band + the (now-working) max-attempts HALT prevent infinite lock.
+   `rubric_score.lock_target()` is the deterministic ground truth;
+   `config.leap.lock_until_bar` (default true) toggles it.
+4. **Auto-queue the remainder (5-G).** If a leap passes its delta gate but the
+   dimension is still below bar, the *independent critic's* score (not the
+   maker's self-report) queues a high-RICE remainder so dropped scope can't
+   vanish. Records `leap_dim_still_below_bar`.
+
+All four are deterministic where possible (`scripts/rubric_score.py`), config-
+driven, and gaming-resistant. `tests/verify.py` 59 → **61**.
+
+## What we deliberately did NOT change (rejected / devil's-advocate-validated)
+
+- **Don't raise the bar to 0.80 yet.** It only re-points the weighted-gap target
+  at an unmeasurable dimension and (pre-fix) would HALT the loop. Raise only after
+  a `gameplay_metrics` leap proves feel/fun are responsive. Stage 0.65→0.75→0.80.
+- **No inner multi-pass refine loop (yet).** The 3 completed leaps each landed
+  substantial single-pass deltas (+0.19/+0.27/+0.18); the cap was perception, not
+  pass-count. Adding passes on unmeasurable dimensions = cost with no signal.
+- **No LLM-component-coverage gate.** Gameable (touch one file per "component" →
+  100%). Change 4's critic-driven remainder is the robust substitute.
+- **No `multi_probe` still-sequence.** A burst of stills still can't tell
+  responsive steering from sluggish; `gameplay_metrics` is the right instrument.
+
+## Validation (leap 4 under the upgraded loop)
+
+Ran a real leap under v1.8.0, targeting `visual_fidelity` (dimension lock kept it
+as the target). It shipped exactly the materials/lighting that leap 3 dropped —
+ACES filmic tone mapping, soft directional shadows (cars/buildings/kerbs cast,
+road/ground receive), and baked contact shadows under every car.
+
+- Independent re-critique: **visual_fidelity 0.59 → 0.63** (Δ +0.04). Building
+  shadows and tonal range are now visible; objects are grounded.
+- Artifact trajectory across all leaps: **0.394 → 0.447 → 0.472 → 0.522 → 0.533**.
+
+**The decisive finding (the whole point — f1 is the probe):** +0.04 is *at the
+gate boundary*, and the critic still under-credits the change — because
+`visual_fidelity` has hit the **ceiling of what a still screenshot can measure
+(~0.63)**. The loop's next weighted-gap target is `fun_challenge` (0.38), which a
+screenshot *cannot* judge at all. So the upgraded loop's correct next move is
+exactly what v1.8.0 built: with the **fixed** thrashing guard, two failed leaps on
+`fun_challenge` → **HALT requesting a human-authored `gameplay_metrics` harness**,
+instead of thrashing forever (the v1.7.x bug) or faking a score.
+
+The bottleneck has *moved*: from "the loop can't close gaps" (v1.7.0) to "the loop
+can't **perceive** the experiential half of quality" (now). v1.8.0's per-dimension
+`capture_method` is the mechanism to close it — but it requires a human to author
+the measurement harness, because the loop is forbidden (by design) from grading
+its own standard. **That hand-off is the next experiment.**
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,42 @@ independently. Bump there signals migration work for downstream projects.
 
 ---
 
+## [v1.8.0] — 2026-06-19
+
+### Changed — drive quality to "good", not "passable" (config schema 1.4.0)
+
+The F1 probe stayed crap after **three** v1.7.0 leaps (artifact 0.394 → 0.447 →
+0.472 → 0.522, never reaching the 0.65 bar). A 13-agent adversarially-verified
+diagnosis (`.claude/ooda-evolution-v1.8.0.md`) found the loop **detects** a
+quality gap but isn't built to **close** it. Four fixes, ranked by leverage:
+
+- **Thrashing-guard bug fix (prerequisite).** evolve 2-G counted a nonexistent
+  `leap_delta` field on `weakest_dimension`, so the guard's `fails` count was
+  ALWAYS 0 and the HALT safety valve **never fired** — the loop could thrash a
+  dimension forever. Now counts `leap_attempts[].delta_score` on the actual
+  `leap_target` (`rubric_score.failed_leaps()`).
+- **Per-dimension `capture_method` (5-G).** The critic scored every axis from one
+  screenshot, so `driving_feel` + `fun_challenge` (**45% of the rubric's weight**)
+  were frozen across all 25 cycles. Each axis now declares its capture;
+  experiential axes use `gameplay_metrics` — a human-authored, hash-verified,
+  protected harness (same independence invariant as the rubric hash). Missing/
+  unverified → score `null` + skill_gap, never a faked or silent-screenshot score.
+- **Dimension lock until bar (2-G).** A successful leap that left its target below
+  `bar − eps` now keeps the plateau active on the SAME target (drive-to-bar, not
+  detect-and-nudge + rotate). `rubric_score.lock_target()`; toggle with
+  `config.leap.lock_until_bar`. Tolerance band + the now-working max-attempts HALT
+  prevent infinite lock.
+- **Auto-queue remainder (5-G).** A leap that passes its delta gate but leaves the
+  dimension below bar queues a high-RICE remainder, triggered by the *independent
+  critic's* score (not the maker's self-report), so dropped/partial scope can't be
+  silently orphaned (as leap 3's materials/lighting was).
+
+**Rejected** (kept the loop honest): raising the bar to 0.80 before feel/fun are
+measurable; an inner multi-pass refine loop; an LLM-component-coverage gate
+(gameable); `multi_probe` still-sequences. `tests/verify.py` 59 → **61**.
+
+---
+
 ## [v1.7.0] — 2026-06-17
 
 ### Added — artifact-grounded evaluation + quantum-leap cycles (config schema 1.3.0)
diff --git a/config.example.json b/config.example.json
@@ -1,5 +1,5 @@
 {
-  "schema_version": "1.3.0",
+  "schema_version": "1.4.0",
   "project": {
     "name": "my-app",
     "locale": "en",
@@ -270,16 +270,27 @@
     "plateau_window": 4,
     "plateau_eps": 0.05,
     "locked": true,
-    "dimensions": []
+    "__dimensions_doc__": "v1.8.0: each dimension may override capture_method so the critic gets the evidence it actually needs. 'screenshot' axes share one capture; EXPERIENTIAL axes (feel/fun/responsiveness) a screenshot cannot judge use 'gameplay_metrics' — a HUMAN-AUTHORED harness that exercises the artifact and emits metrics JSON. The harness MUST be in safety.protected_paths AND match gameplay_metrics_hash (independence gate, same invariant as the rubric hash); else the dimension scores null (capture_failure) rather than faking a score. Without per-dimension capture, experiential axes freeze at their initial score and silently cap artifact_quality.",
+    "dimensions": [],
+    "__example_dimension__": {
+      "name": "driving_feel",
+      "weight": 0.25,
+      "capture_method": "gameplay_metrics",
+      "gameplay_metrics_command": "node tools/feel_harness.mjs",
+      "gameplay_metrics_hash": "<sha256 of the harness file>",
+      "metrics_fields": ["input_lag_ms", "physics_response_ms", "completion_rate"],
+      "description": "Responsive steering, weight transfer, distinct braking — judge against: input_lag_ms<40, physics_response_ms stable, completion_rate>0.7."
+    }
   },
   "leap": {
-    "__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend.",
+    "__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend. v1.8.0: lock_until_bar keeps leaping the SAME dimension until it clears bar (drive-to-good, not detect-and-nudge).",
     "max_lines": 1500,
     "min_dimension_delta": 0.05,
     "max_attempts_per_dimension": 2,
     "max_per_day": 2,
     "gap_weight": 30.0,
-    "cost_limit_usd": 0.5
+    "cost_limit_usd": 0.5,
+    "lock_until_bar": true
   },
   "goal_completion_idle": true
 }
diff --git a/scripts/rubric_score.py b/scripts/rubric_score.py
@@ -214,6 +214,43 @@ def detect_plateau(outcomes: list, rubric: dict) -> dict:
     }
 
 
+def failed_leaps(outcomes: list, dimension: str, min_delta: float) -> int:
+    """v1.8.0 thrashing-guard count (deterministic ground truth for evolve 2-G).
+    Counts leap cycles whose `leap_attempts` on `dimension` failed to clear
+    `min_delta`. The v1.7.x engine read a nonexistent field `leap_delta` and
+    matched the raw weakest_dimension, so this was ALWAYS 0 and the guard never
+    fired — the loop could thrash forever. The real ledger field is
+    `leap_attempts[].delta_score`, keyed on the dimension actually leapt."""
+    n = 0
+    for e in outcomes:
+        if e.get("cycle_mode") != "leap":
+            continue
+        for a in (e.get("leap_attempts") or []):
+            if a.get("dimension") == dimension and a.get("delta_score", 1.0) < min_delta:
+                n += 1
+    return n
+
+
+def lock_target(outcomes: list, rubric: dict, leap_target: str | None) -> str | None:
+    """v1.8.0 dimension-lock: after a SUCCESSFUL leap whose target is still below
+    (bar − eps), return that target so evolve 2-G keeps the plateau active on it
+    (drive-to-bar) instead of coasting through feature cycles and re-rotating.
+    Returns None when nothing should be locked (no leap last, regressed, or the
+    target is at/near bar — the tolerance band stops critic variance from locking
+    the loop forever; the max_attempts HALT is the genuine-stuck backstop)."""
+    if not leap_target or not outcomes:
+        return None
+    last = outcomes[-1]
+    if last.get("cycle_mode") != "leap" or last.get("result_type") == "leap_regressed":
+        return None
+    bar = rubric.get("bar", DEFAULT_BAR)
+    eps = float(rubric.get("plateau_eps", DEFAULT_PLATEAU_EPS))
+    score = (last.get("dimension_scores") or {}).get(leap_target)
+    if score is None:
+        return None
+    return leap_target if score < bar - eps else None
+
+
 def compute(project: Path) -> dict:
     ev = Path(project) / "agent" / "state" / "evolve"
     config = _load(Path(project) / "config.json", {}) or {}
diff --git a/skills/evolve/SKILL.md b/skills/evolve/SKILL.md
diff --git a/tests/verify.py b/tests/verify.py

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,7 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "ooda-loop",`
`3`	`3`	`"displayName": "OODA-loop",`
`4`		`- "version": "1.7.0",`
	`4`	`+ "version": "1.8.0",`
`5`	`5`	`"description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",`
`6`	`6`	`"author": {`
`7`	`7`	`"name": "Taeil Ma",`