|
| 1 | +# OODA-loop v1.8.0 — driving quality to "good", not "passable" |
| 2 | + |
| 3 | +**Date:** 2026-06-19 |
| 4 | +**Purpose reminder:** the f1-racing game is a **probe**, not the product. Its job |
| 5 | +is to expose what OODA-loop still gets wrong. This release is about the loop. |
| 6 | + |
| 7 | +## The probe's verdict (why we believe the loop is still the bottleneck) |
| 8 | + |
| 9 | +v1.7.0 fixed *measurement* (artifact axis) and *anti-incrementalism* (leap |
| 10 | +cycles). It worked — the loop went from a flat, lying A to a climbing, honest D. |
| 11 | +But after **three** real leap cycles the owner's verdict is unchanged: *still too |
| 12 | +crappy.* The data says why: |
| 13 | + |
| 14 | +| | artifact_quality | note | |
| 15 | +|---|---|---| |
| 16 | +| 22 feature cycles (old loop) | **0.394** (flat) | Goodhart collapse | |
| 17 | +| + leap 1 (visual) | 0.447 | +0.053 | |
| 18 | +| + leap 2 (track) | 0.472 | +0.025 | |
| 19 | +| + leap 3 (car/cockpit) | 0.522 | +0.050 | |
| 20 | +| bar (shippable) | 0.65 | not reached | |
| 21 | +| "genuinely good" | ~0.80 | far off | |
| 22 | + |
| 23 | +Average **~+0.04 per leap**. At that rate "good" is ~7 leaps away. The loop |
| 24 | +*climbs* but it climbs **too slowly and stops too low**. Three concrete failures |
| 25 | +observed during the campaign point at the loop, not the game: |
| 26 | + |
| 27 | +1. **It settles for "+0.05 better."** A leap's success test is |
| 28 | + `min_dimension_delta` (0.05) — so the moment a dimension nudges up, the loop |
| 29 | + declares victory and moves the weighted-gap target elsewhere, abandoning a |
| 30 | + dimension while it is still *bad* (visual went 0.22→0.41→0.59 across two |
| 31 | + separate leaps with other work in between; it was never driven to the bar in |
| 32 | + one sustained push). |
| 33 | +2. **It accepts partial implementation of its own plan.** Leap 3's design panel |
| 34 | + specced car/cockpit **and** materials/lighting **and** environment; only |
| 35 | + car/cockpit shipped. The dropped materials/lighting (shadows, tone-mapping, |
| 36 | + textures) was the *single most-repeated* critic complaint at every leap — and |
| 37 | + it silently vanished. The act step under-delivers with no completion check. |
| 38 | +3. **It can't see motion, feel, or fun.** The critic scores one still |
| 39 | + screenshot, so `driving_feel` and `fun_challenge` are unmeasurable and |
| 40 | + unimprovable — they sit at 0.38–0.51 and cap the weighted mean. |
| 41 | + |
| 42 | +## Root cause (from a 13-agent, adversarially-verified diagnosis) |
| 43 | + |
| 44 | +> The loop **detects** a quality gap and takes *a* step at it, but is not built to |
| 45 | +> **close** it. It (1) can't see ~45% of its own rubric — `driving_feel` + `fun_challenge` |
| 46 | +> were scored from a still screenshot and sat frozen across all 25 cycles; (2) |
| 47 | +> abandons a dimension after a +0.05 nudge and rotates targets instead of driving |
| 48 | +> one to the bar; (3) had a **silently broken thrashing guard** (read a nonexistent |
| 49 | +> `leap_delta` field → never fired → could thrash forever); and (4) accepts partial |
| 50 | +> implementation of its own leap plans, orphaning the rest. It optimizes to |
| 51 | +> "better," never to "good." |
| 52 | +
|
| 53 | +The devil's-advocate agent confirmed the leap *routing* is sound — the binding |
| 54 | +constraint is **perception** (the critic can't measure feel/fun), not more leap |
| 55 | +machinery. That reframed the fix. |
| 56 | + |
| 57 | +## The v1.8.0 upgrades (ranked by leverage) |
| 58 | + |
| 59 | +1. **Fix the thrashing guard (prerequisite, real bug).** 2-G now counts |
| 60 | + `leap_attempts[].delta_score` on the `leap_target` (was: nonexistent `leap_delta` |
| 61 | + on `weakest_dimension`). The HALT safety valve actually fires now — |
| 62 | + `rubric_score.failed_leaps()` is the deterministic ground truth. |
| 63 | +2. **Per-dimension `capture_method` (5-G).** Each rubric axis declares how its |
| 64 | + evidence is captured; experiential axes use `gameplay_metrics` — a |
| 65 | + HUMAN-AUTHORED, hash-verified, protected harness (same independence invariant |
| 66 | + as the rubric hash). Missing/unverified harness → the axis scores `null` |
| 67 | + (capture_failure) + a skill_gap, never a faked or screenshot-fallback score. |
| 68 | + Unlocks the frozen 45% of rubric weight. |
| 69 | +3. **Dimension lock until bar (2-G).** After a successful leap whose target is |
| 70 | + still below `bar − eps`, keep the plateau active on the SAME target so 3-K |
| 71 | + leaps it again next cycle — drive-to-bar, not detect-and-nudge. A tolerance |
| 72 | + band + the (now-working) max-attempts HALT prevent infinite lock. |
| 73 | + `rubric_score.lock_target()` is the deterministic ground truth; |
| 74 | + `config.leap.lock_until_bar` (default true) toggles it. |
| 75 | +4. **Auto-queue the remainder (5-G).** If a leap passes its delta gate but the |
| 76 | + dimension is still below bar, the *independent critic's* score (not the |
| 77 | + maker's self-report) queues a high-RICE remainder so dropped scope can't |
| 78 | + vanish. Records `leap_dim_still_below_bar`. |
| 79 | + |
| 80 | +All four are deterministic where possible (`scripts/rubric_score.py`), config- |
| 81 | +driven, and gaming-resistant. `tests/verify.py` 59 → **61**. |
| 82 | + |
| 83 | +## What we deliberately did NOT change (rejected / devil's-advocate-validated) |
| 84 | + |
| 85 | +- **Don't raise the bar to 0.80 yet.** It only re-points the weighted-gap target |
| 86 | + at an unmeasurable dimension and (pre-fix) would HALT the loop. Raise only after |
| 87 | + a `gameplay_metrics` leap proves feel/fun are responsive. Stage 0.65→0.75→0.80. |
| 88 | +- **No inner multi-pass refine loop (yet).** The 3 completed leaps each landed |
| 89 | + substantial single-pass deltas (+0.19/+0.27/+0.18); the cap was perception, not |
| 90 | + pass-count. Adding passes on unmeasurable dimensions = cost with no signal. |
| 91 | +- **No LLM-component-coverage gate.** Gameable (touch one file per "component" → |
| 92 | + 100%). Change 4's critic-driven remainder is the robust substitute. |
| 93 | +- **No `multi_probe` still-sequence.** A burst of stills still can't tell |
| 94 | + responsive steering from sluggish; `gameplay_metrics` is the right instrument. |
| 95 | + |
| 96 | +## Validation (leap 4 under the upgraded loop) |
| 97 | + |
| 98 | +Ran a real leap under v1.8.0, targeting `visual_fidelity` (dimension lock kept it |
| 99 | +as the target). It shipped exactly the materials/lighting that leap 3 dropped — |
| 100 | +ACES filmic tone mapping, soft directional shadows (cars/buildings/kerbs cast, |
| 101 | +road/ground receive), and baked contact shadows under every car. |
| 102 | + |
| 103 | +- Independent re-critique: **visual_fidelity 0.59 → 0.63** (Δ +0.04). Building |
| 104 | + shadows and tonal range are now visible; objects are grounded. |
| 105 | +- Artifact trajectory across all leaps: **0.394 → 0.447 → 0.472 → 0.522 → 0.533**. |
| 106 | + |
| 107 | +**The decisive finding (the whole point — f1 is the probe):** +0.04 is *at the |
| 108 | +gate boundary*, and the critic still under-credits the change — because |
| 109 | +`visual_fidelity` has hit the **ceiling of what a still screenshot can measure |
| 110 | +(~0.63)**. The loop's next weighted-gap target is `fun_challenge` (0.38), which a |
| 111 | +screenshot *cannot* judge at all. So the upgraded loop's correct next move is |
| 112 | +exactly what v1.8.0 built: with the **fixed** thrashing guard, two failed leaps on |
| 113 | +`fun_challenge` → **HALT requesting a human-authored `gameplay_metrics` harness**, |
| 114 | +instead of thrashing forever (the v1.7.x bug) or faking a score. |
| 115 | + |
| 116 | +The bottleneck has *moved*: from "the loop can't close gaps" (v1.7.0) to "the loop |
| 117 | +can't **perceive** the experiential half of quality" (now). v1.8.0's per-dimension |
| 118 | +`capture_method` is the mechanism to close it — but it requires a human to author |
| 119 | +the measurement harness, because the loop is forbidden (by design) from grading |
| 120 | +its own standard. **That hand-off is the next experiment.** |
0 commit comments