Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "ooda-loop",
"displayName": "OODA-loop",
"version": "1.6.1",
"version": "1.7.0",
"description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",
"author": {
"name": "Taeil Ma",
Expand Down
Binary file added .claude/evidence/f1-AFTER-leap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .claude/evidence/f1-BEFORE.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
160 changes: 160 additions & 0 deletions .claude/ooda-evolution-v1.7.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# OODA-loop v1.7.0 — Artifact-Grounded Evaluation + Leap Cycles

**Date:** 2026-06-17
**Trigger:** F1-racing dogfood. 22 cycles, every cycle graded **A** (Loop 0.995),
futile **0%**, mission-hit **100%** — yet the actual game is *처참* (dismal):
a jagged Z-fighting cyan blob for a track, no recognizable car, flat untextured
world, polished HUD bolted over a broken core.

> The metrics said the loop was perfect. The artifact proved the metrics were
> lying. This is a **Goodhart collapse** — the loop optimized its own scoreboard,
> not the game.

This document diagnoses *why* the engine produced that outcome and specifies the
v1.7.0 evolution that fixes it.

---

## Part 1 — Diagnosis (the game as a mirror)

### Evidence
- `outcomes.json`: all 22 entries **identical** — `quality_multiplier: 0.5`,
`verifier_verdict: null`, `on_mission: true`. The evaluation layer recorded
the same number 22 times.
- `loop_scorecard.py:123` grade = `0.5*goal + 0.3*(1-futile) + 0.2*min(lv/0.5,1)`.
With goal=0.99 (self-written feature checklist), futile=0 (any commit), lv=0.5
(constant) → **0.995 = A, arithmetically guaranteed.**
- Screenshot (`f1-load.png`): glassmorphism HUD panels (lap / pos / minimap /
speed / ERS / tyre) sit over a broken 3D scene.
- `wc -l src/*.js`: 22 feature modules bolted onto one 415-line `main.js`. The
longest file is `track.js` (209) — the visual core — and it was never revisited
after the early cycles; 16 of 22 cycles added *new* peripheral modules.

### The four structural defects

**D1 — Goodhart collapse: the grade is mathematically pinned to A.**
None of the three grade terms measures whether the output is *good*. `goal` is a
self-authored checklist, `futile` only asks "did you commit anything," `lv` was a
constant. Any loop that (1) writes itself a feature list, (2) commits each cycle,
(3) records `pr_created` scores an A forever. (`loop_scorecard.py:114-127`)

**D2 — `quality_multiplier` is blind to the artifact.**
It is a pure function of *process state* (`pr_merged_held`=1.0 … `pr_created`=0.5
… `futile`=0.0). A beautiful feature and a broken one score **identically** if
both committed. The Reflect step's self-critique (`5-F`) critiques the *decision*,
never the *output*. The opt-in 7-B eval only checks "did you ship the feature you
declared" — goal-conformance, not artifact quality — and is *forbidden from
changing the score* (`7-B` honesty rule). So no signal that looks at the artifact
can ever move the number. (`score_outcome.py:22-59`, `evolve 6-C9`, `7-B`)

**D3 — Monotonic incrementalism: no quantum leaps are structurally possible.**
RICE = `Reach×Impact×Confidence / Effort`. An overhaul ("rebuild the track
rendering," "make the car look like an F1 car") is high-effort + uncertain
confidence → **low RICE → never selected.** The cycle template says *"implement
ONE focused feature"*; brainstorming generates *"NEW mission-aligned features."*
There is no plateau detector, no leap trigger, no reference standard ("what does
a real F1 game look like?"), no cohesion/debt accounting. 22 cycles = 22 isolated
features, never a consolidation or a step-change.

**D4 (deepest, non-obvious) — the testability gate created a perverse incentive.**
The *only* quality gate is `config.test_command = node tests/smoke.mjs`, which can
only assert **pure modules**. So the loop systematically chose work that *could*
be unit-tested (HUD widgets, gap math, ghost replay) and systematically avoided
work that *couldn't* (track mesh, car model, lighting, feel, fun) — exactly the
things that make an F1 game good. **Measurement didn't just fail to catch low
quality; it actively repelled the loop from quality.** The screenshot is the
proof: the testable shell is polished, the untestable core is broken.

### Root cause (one sentence)
> The loop measures **process** (did the machinery advance?) and is blind to the
> **artifact** (is the thing good?), and its action selection (RICE + feature
> template) can only ever take small safe steps — so it optimized the scoreboard
> while the game rotted, and could never re-found the broken core.

---

## Part 2 — The v1.7.0 Evolution

Three coordinated mechanisms. Each maps to one user complaint.

> "결과물에 대한 평가가 빈약하다 / 셀프 피드백이 약하다" → **M1 + M2**
> "변곡점에서 퀀텀 점프가 없다" → **M3**

### M1 — Artifact Critique (a real, independent critic) → fixes D2, D4
New Reflect sub-step **5-G**. After a build cycle produces output, an
**independent** evaluator (reuses the 7-B separate-model infrastructure) *engages
the actual artifact* — for a web app it renders it (screenshot) and exercises it;
for a library it calls the API — and scores it against a **mission rubric**
(`config.quality_rubric`), producing:
- `artifact_score ∈ [0,1]` — grounded, adversarial, **allowed to be low**.
- `dimension_scores` per rubric axis, and the single `weakest_dimension`.
- a one-line critique.

The critic is *not the maker* (independent context) and *must cite evidence*
(what it saw), satisfying the "don't grade your own work" principle while finally
measuring quality. Rubric for the game: `visual_fidelity, driving_feel,
fun_challenge, cohesion, performance, robustness`, each weighted, with a target
`bar` (e.g. 0.7).

### M2 — Honest scoring (kill Goodhart) → fixes D1
- `score_outcome.py`: `quality_multiplier = process_factor × artifact_factor`.
A `pr_created` (0.5 process) with `artifact_score 0.2` now scores **~0.1**, not
0.5. The artifact axis finally modulates the number. (Reframes the 7-B honesty
rule: artifact quality is *independent + grounded*, so it is allowed to move the
score; the maker's *self*-opinion still cannot.)
- `loop_scorecard.py grade()`: replace the gameable self-`goal` weight with
**artifact_quality**. New composite:
`0.45*artifact_quality + 0.25*goal_progress + 0.20*(1-futile) + 0.10*min(lv/0.5,1)`.
- **Goodhart Guard:** if process metrics are green (futile≈0, goal high) but
`artifact_quality < bar`, **cap the grade at C** and print
`⚠ MEASUREMENT WARNING: process green, artifact below bar — the scoreboard is lying.`
Surface `artifact_quality` as the headline KPI, above Loop Value.
- Goal progress requires **evidence** (an artifact_score crossing a bar), not
self-assertion.

### M3 — Leap Cycles (quantum jumps at inflection points) → fixes D3, D4
- **Plateau detection** (Orient, new **2-G**): track `artifact_score` across recent
build cycles. If it hasn't improved by ≥ `ε` over the last `N` build cycles
*despite shipping*, OR the same `weakest_dimension` has stayed weakest for `N`
cycles → declare a **plateau**.
- **Leap trigger** (Decide, new **3-K**): on plateau — or every `K` cycles, or
while `artifact_quality < bar` after warmup — the next cycle is forced into
**LEAP mode** instead of FEATURE mode. A leap cycle:
- does **not** add a feature; it makes a step-change on the `weakest_dimension`
(overhaul / rebuild / refactor-for-cohesion / raise the bar);
- is selected by **biggest gap-to-bar**, *bypassing pure RICE* (the mechanism
that structurally forbade overhauls);
- is allowed a larger diff (own size budget `config.leap.max_lines`);
- is verified by the **artifact critique** (must raise the targeted dimension),
not only the narrow unit-test gate — so untestable-but-vital work is first
class at last (directly undoing D4).
- Brainstorming must now also propose **quality/overhaul** items scored by
gap-to-bar, not only features.

### Safety (autonomous operation)
Leap mode loosens size limits and can bypass the unit-test gate — that is risky
under unattended Level-3 autonomy. Guards:
- Leap cycles still honor HALT, cost cap, protected_paths, and **must** pass the
artifact critique with a *measured improvement* on the targeted dimension or the
change is reverted (rollback protocol 4-C2).
- `config.leap.max_per_day` caps leaps; a leap that fails to improve twice running
on the same dimension escalates to a `skill_gap`/HALT instead of looping.
- The artifact critic is independent + evidence-citing → it cannot rubber-stamp.

---

## Part 3 — Implementation surface
- `scripts/rubric_score.py` (new) — deterministic rubric aggregation + bar/Goodhart logic + plateau detector.
- `scripts/score_outcome.py` — fold `artifact_score` into `quality_multiplier`.
- `scripts/loop_scorecard.py` — artifact_quality KPI, honest grade, Goodhart Guard.
- `skills/evolve/SKILL.md` — new 5-G, 2-G, 3-K; reframe 6-C9 + 7-B; evidence-based 2-C.
- `skills/dev-cycle/SKILL.md` — artifact gate + LEAP mode (size + RICE bypass).
- `config.example.json` — `quality_rubric`, `leap` blocks.
- `agent/state/evolve/CHANGELOG.md` + version → **v1.7.0**.

## Part 4 — Proof obligations (the fix must have teeth)
1. Re-grade the F1 game with the new scorecard → A must drop to a realistic
low grade (D/F) once `artifact_quality` (≈0.2 from the screenshot) is folded in.
2. Run ONE real **leap cycle** on the game that overhauls the broken visual core,
re-score the artifact, and show the dimension actually improved — a demonstrated
quantum jump, the thing the loop could never do before.
46 changes: 46 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,52 @@ independently. Bump there signals migration work for downstream projects.

---

## [v1.7.0] — 2026-06-17

### Added — artifact-grounded evaluation + quantum-leap cycles (config schema 1.3.0)

The F1-racing dogfood (22 cycles, every cycle graded **A** / futile 0% /
mission-hit 100%) produced a *처참* (dismal) game: a Z-fighting cyan blob for a
track, no recognizable car, polished HUD bolted over a broken core. The metrics
were perfect; the artifact proved them a lie — a **Goodhart collapse**. Root
cause: the loop measured **process** (did a PR/commit advance?) and was blind to
the **artifact** (is the thing good?), and its action selection (RICE + "one
focused feature") could only ever take small steps. Diagnosis + design:
`.claude/ooda-evolution-v1.7.0.md`.

- **Artifact axis in scoring (fixes D2).** `quality_multiplier = process_factor ×
artifact_factor` (`scripts/score_outcome.py`). A `pr_created` (0.5) whose
artifact scores 0.4 now records **0.2**, not 0.5. `artifact_score` is a
first-class field in `outcomes.json` (Step 6-C9). No rubric → factor 1.0
(process-only loops unchanged).
- **Step 5-G: Artifact Critique.** An *independent*, evidence-grounded critic
(separate model context) captures the real artifact (screenshot / API call /
benchmark per `quality_rubric.capture_method`) and scores it against a
HUMAN-AUTHORED, integrity-checked rubric — the loop may never author its own
grading standard.
- **Honest scorecard (fixes D1).** `scripts/loop_scorecard.py`: `★ Artifact
Quality` is the new headline KPI; the self-declared goal term is **evidence-
weighted** by artifact reality; an **artifact-only Goodhart Guard** caps the
grade (graduated C/D/F) and prints a measurement warning when artifact < bar.
The F1 run re-grades **A (0.995) → D (0.567)**.
- **Quantum-leap cycles (fixes D3).** Step 2-G plateau detector + Step 3-K
Leap-Mode Gate: when artifact quality plateaus *below bar*, the next cycle is
forced to **overhaul the weakest dimension** (step-change, RICE bypassed via a
gap-to-bar bonus, larger size budget) instead of adding another feature.
- **Leap safety.** Pre-PR artifact gate with a checkpoint baseline; revert on
`min_dimension_delta` miss (→ `leap_regressed`, quality 0.0); thrashing
escalates to HALT after `max_attempts_per_dimension`; per-leap cost cap +
`max_per_day`; protected-path diff-time check. Hardened by a 5-agent adversarial
red-team (gaming-resistance / autonomous-safety / implementability).
- **`on_mission` is now a real signal** for build cycles (`artifact_score >= bar`)
instead of a static config echo.
- **Config:** new `quality_rubric` (per-domain, canonical) + `leap` blocks in
`config.example.json`. **Tests:** new `scripts/rubric_score.py` (pure) + 8 new
`tests/verify.py` checks (now 58 passing; the previously-unregistered
`scorecard` suite is wired back in).

---

## [v1.6.1] — 2026-06-14

### Added / clarified — plugin namespacing + cloud routine recipe
Expand Down
23 changes: 21 additions & 2 deletions config.example.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"schema_version": "1.2.0",
"schema_version": "1.3.0",
"project": {
"name": "my-app",
"locale": "en",
Expand Down Expand Up @@ -260,7 +260,26 @@
"pr_merged",
"action_extracted"
],
"__doc__": "Opt-in maker/checker layer (evolve Step 7-B). When enabled, a SEPARATE model independently judges whether each gradeable cycle achieved its declared goal and writes {achieved,reason,confidence} into outcomes.json verifier_verdict. The deterministic quality_multiplier (Step 6-C9) is always the scorecard's ground truth; this is a second opinion, never an override. Default off = zero extra cost."
"__doc__": "Opt-in maker/checker layer (evolve Step 7-B). When enabled, a SEPARATE model independently judges whether each gradeable cycle achieved its declared GOAL (goal-conformance) and writes {achieved,reason,confidence} into outcomes.json verifier_verdict. Never overrides the score — a second opinion. Distinct from quality_rubric/Step 5-G, which judges whether the ARTIFACT is good and DOES move the score. Default off = zero extra cost."
},
"quality_rubric": {
"__doc__": "ARTIFACT-quality axis (evolve Step 5-G, v1.7.0) — the fix for the dogfood failure where every cycle scored 0.5 / graded A while the built thing was broken, because nothing measured the artifact. CANONICAL placement is PER-DOMAIN: config.domains[<build domain>].quality_rubric (evolve is domain-agnostic). This top-level block is the single-domain fallback. Each cycle that produces an artifact, an INDEPENDENT critic (separate model context) captures the real artifact via capture_method and scores each dimension 0..1; rubric_score.py aggregates to artifact_score, which MULTIPLIES the process score in 6-C9 and drives the Goodhart Guard + LEAP trigger. The rubric is HUMAN-AUTHORED and read-only to the loop — add 'quality_rubric' / the config path to safety.protected_paths so the loop can never write its own grading standard (gaming-resistance). Empty dimensions = artifact axis OFF (process-only scoring, back-compat).",
"bar": 0.65,
"capture_method": "screenshot",
"capture_command": "<serve + screenshot for web UIs | run + capture stdout for api_call | run benchmark>",
"plateau_window": 4,
"plateau_eps": 0.05,
"locked": true,
"dimensions": []
},
"leap": {
"__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend.",
"max_lines": 1500,
"min_dimension_delta": 0.05,
"max_attempts_per_dimension": 2,
"max_per_day": 2,
"gap_weight": 30.0,
"cost_limit_usd": 0.5
},
"goal_completion_idle": true
}
Loading
Loading