Skip to content

Commit 6377f41

Browse files
authored
Merge pull request #78 from mataeil/feat/v1.11.0-research-grounded
feat(v1.11.0): Research-Grounded OODA — the anti-maze methodology
2 parents 7bffee2 + f1481b7 commit 6377f41

7 files changed

Lines changed: 171 additions & 14 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "ooda-loop",
33
"displayName": "OODA-loop",
4-
"version": "1.10.1",
4+
"version": "1.11.0",
55
"description": "An autonomous operations layer for your live side project. It watches, re-orients from which PRs you merge and reject, and opens small revertible PRs — bounded by a HALT file, protected paths, and a hard cost cap. Built on Boyd's OODA loop. You stay in command.",
66
"author": {
77
"name": "Taeil Ma",

CHANGELOG.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,38 @@ independently. Bump there signals migration work for downstream projects.
88

99
---
1010

11+
## [v1.11.0] — 2026-06-21
12+
13+
### Added — Research-Grounded OODA (the anti-maze methodology)
14+
15+
The f1 probe hit the classic failure: it **iterated without improving** (a
16+
local-optimum "maze") because generation was anchored to model priors, not
17+
external ground truth. A heavy external-research pass (graphics/physics/art +
18+
the methodology literature) produced a cited playbook AND this 5-part fix, each
19+
grounded in published work:
20+
21+
- **Pre-generation research grounding (dev-cycle Step 3-PRE).** Before any
22+
quality leap, resolve a reference (config.references / a researched playbook),
23+
WebFetch the concrete block, derive acceptance criteria, THEN generate — and
24+
record `grounded_in`. (AlphaCodium arXiv:2401.08500: a structured pre-stage
25+
raised pass@5 19%→44%. AutoCodeRover; "concrete examples beat abstract specs".)
26+
- **Reference targets (config.references + config.research).** Named real-product
27+
levels + reference-implementation URLs + a research playbook path; mirror into
28+
principles.json as permanent memory.
29+
- **Reference comparison in the 5-G critic.** The critic names the ONE concrete
30+
attribute of the reference the artifact lacks per axis (`per_axis_gap`) — an
31+
actionable Observe signal, not "looks worse".
32+
- **Stall → REWRITE escalation (evolve 2-G + `recommend_rewrite()`).** When the
33+
incremental LEAPS themselves stall (not just the artifact), escalate from patch
34+
to a from-scratch REWRITE carrying a Reflexion (arXiv:2303.11366) negative-
35+
example memo, BEFORE giving up to a HALT — the fix for the `sky.visible=false`
36+
class (symptom patched on a wrong architecture).
37+
- **Diagnose/fix isolation on regression** (MASAI arXiv:2406.11638) noted for the
38+
regression path.
39+
40+
New deterministic helper `rubric_score.recommend_rewrite()` + test. verify.py
41+
63 → 64. plugin 1.10.1→1.11.0.
42+
1143
## [v1.10.1] — 2026-06-20
1244

1345
### Fixed/clarified — gate integrity (f1 probe, overnight)

config.example.json

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -310,8 +310,21 @@
310310
}
311311
}
312312
},
313+
"references": {
314+
"__doc__": "v1.11.0 (anti-maze) — REFERENCE TARGETS that ground the loop in external ground truth instead of model priors. dev-cycle Step 3-PRE resolves the reference for a technique, WebFetches the concrete block, derives acceptance criteria, THEN generates (AlphaCodium pattern). The 5-G critic scores AGAINST these (not the artifact's own past). Permanent: also mirror into agent/state/evolve/principles.json so they survive episode rollover. Populate per project with named real-product levels and reference-implementation URLs.",
315+
"visual": "https://threejs.org/examples/webgl_materials_car.html",
316+
"physics": "https://github.com/spacejack/carphysics2d/blob/master/public/js/Car.js",
317+
"camera": "https://github.com/mrdoob/Starter-Kit-Racing/blob/main/js/Camera.js",
318+
"quality_floor": 0.7
319+
},
320+
"research": {
321+
"__doc__": "v1.11.0 — a researched, CITED knowledge base the loop reads before leaping (the missing external-knowledge input that breaks the 'iterate without improving' maze). Built by a heavy web-research pass (graphics/physics/art/methodology) distilled into an actionable, sourced playbook. dev-cycle Step 3-PRE prefers a concrete playbook move (technique + parameters + source URL) over the model's first idea.",
322+
"playbook_path": "agent/state/research/playbook.md",
323+
"refresh_when_stalled": true
324+
},
313325
"leap": {
314-
"__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend. v1.8.0: lock_until_bar keeps leaping the SAME dimension until it clears bar (drive-to-good, not detect-and-nudge).",
326+
"__doc__": "Quantum-leap cycles (evolve Steps 2-G/3-K, v1.7.0) — the fix for monotonic incrementalism (RICE structurally forbids overhauls). When the artifact plateaus BELOW bar, the next cycle is forced into LEAP mode: it overhauls the weakest dimension (step-change, not a new feature), bypassing pure RICE via a gap-to-bar bonus, with a larger size budget and an ARTIFACT-improvement gate instead of the unit-test gate. Safety: min_dimension_delta must be cleared or the leap is reverted; max_attempts_per_dimension failures escalate to HALT; cost/day caps bound spend. v1.8.0: lock_until_bar keeps leaping the SAME dimension until it clears bar (drive-to-good, not detect-and-nudge). v1.11.0: rewrite_on_stall — when recommend_rewrite() fires (incremental leaps THEMSELVES stalled), escalate from patch to a from-scratch REWRITE carrying a Reflexion negative-example memo, instead of thrashing to HALT.",
327+
"rewrite_on_stall": true,
315328
"max_lines": 1500,
316329
"min_dimension_delta": 0.05,
317330
"max_attempts_per_dimension": 2,

scripts/rubric_score.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,35 @@ def asset_ceiling_hit(dimension: dict, score) -> bool:
287287
return ceil is not None and score is not None and score >= ceil
288288

289289

290+
def recommend_rewrite(outcomes: list, dimension: str, rubric: dict, min_failed: int = 2) -> dict:
291+
"""v1.11.0 stall → REWRITE escalation (the anti-maze fix).
292+
293+
`detect_plateau` says "the artifact stalled → do a LEAP". This says something
294+
sharper: "the incremental LEAPS THEMSELVES have stalled — stop patching the
295+
same approach, start over." True when the dimension is plateaued AND
296+
>= `min_failed` incremental leaps on it already failed to clear the plateau
297+
epsilon. The Reflect step then queues a from-scratch REWRITE (not another
298+
same-architecture patch) carrying a negative-example memo of the stalled
299+
approach — Reflexion (arXiv:2303.11366) verbal episodic memory so the next
300+
attempt explicitly avoids what stalled. This is the fix for the
301+
`sky.visible=false` class: a symptom patched cycle after cycle while the root
302+
cause (wrong IBL source) is preserved — exactly the f1 'iterate without
303+
improving' maze."""
304+
pl = detect_plateau(outcomes, rubric)
305+
if not pl.get("plateau"):
306+
return {"rewrite": False, "failed_leaps": 0, "reason": "not plateaued"}
307+
eps = rubric.get("plateau_eps", DEFAULT_PLATEAU_EPS)
308+
fails = failed_leaps(outcomes, dimension, eps)
309+
do = fails >= min_failed
310+
return {
311+
"rewrite": do,
312+
"failed_leaps": fails,
313+
"reason": (f"{fails} incremental leaps stalled on '{dimension}' (>= {min_failed}) "
314+
f"→ rewrite, don't patch") if do
315+
else f"incremental still viable ({fails} failed < {min_failed})",
316+
}
317+
318+
290319
def lock_target(outcomes: list, rubric: dict, leap_target: str | None) -> str | None:
291320
"""v1.8.0 dimension-lock: after a SUCCESSFUL leap whose target is still below
292321
(bar − eps), return that target so evolve 2-G keeps the plateau active on it

skills/dev-cycle/SKILL.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,8 +196,40 @@ Read context files before writing any code:
196196
3. Read `selected.source_domain` report (if referenced) to understand the
197197
motivation behind this action.
198198

199-
Analyze what needs to change based on the action title and source report, then
200-
implement the changes (write and/or edit files).
199+
### Step 3-PRE: Research grounding (v1.11.0 — the anti-maze step)
200+
201+
> **Why this exists.** The f1 dogfood proved the loop can "iterate without
202+
> improving" — a maze/local-optimum — when generation is anchored to the model's
203+
> own priors instead of to external ground truth. The fix (AlphaCodium
204+
> arXiv:2401.08500, which raised pass@5 19%→44% with a structured pre-generation
205+
> stage; AutoCodeRover; Simon Willison's "concrete examples beat abstract
206+
> requirements"): **ground every non-trivial change in an external reference
207+
> BEFORE writing code.** This is Boyd's Observe extended to the world's knowledge,
208+
> not just local state.
209+
210+
For any leap / quality-improving / "make it better" action (skip for a trivial
211+
mechanical edit), BEFORE writing code:
212+
213+
1. **Resolve a reference.** Read `config.references` (and `agent/state/research/*`
214+
if present — a researched, cited playbook). Pick the reference target for this
215+
technique/domain (e.g. a named real-product level, a reference implementation
216+
URL, or a specific playbook move with its concrete API/parameters).
217+
2. **Fetch the concrete block.** WebFetch / curl the *specific* reference snippet
218+
(the 30–50 lines that matter — the exact API calls, parameter values, order of
219+
operations), not the whole repo. If a research playbook already contains the
220+
cited concrete spec, use that.
221+
3. **Derive acceptance criteria** from the reference: "the implementation MUST
222+
(a) call X with params Y, (b) produce effect Z visible at camera/probe C,
223+
(c) not break the gate." Record them in the cycle's outcome as
224+
`reference_block` + `acceptance_criteria`.
225+
4. **Only then generate** — implement the cited technique, adapting names to the
226+
real code. The PR/outcome records WHICH reference grounded it (`grounded_in`).
227+
228+
A leap with no `grounded_in` reference is a red flag for the maze: prefer
229+
researching a concrete approach over reaching for the model's first idea.
230+
231+
Analyze what needs to change based on the action title, source report, **and the
232+
resolved reference block**, then implement the changes (write and/or edit files).
201233

202234
**Protected paths enforcement:**
203235

skills/evolve/SKILL.md

Lines changed: 43 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -742,13 +742,38 @@ for the implementation/build domain that declares config.domains[d].quality_rubr
742742
if a.dimension == p.leap_target
743743
AND a.delta_score < config.leap.min_dimension_delta)
744744
if fails >= config.leap.max_attempts_per_dimension:
745-
record skill_gap { name: "leap_stuck_{p.leap_target}", type:"quality_gap",
746-
detail:"Leap on {p.leap_target} failed {fails}× without clearing min delta —
747-
likely UNMEASURABLE by the current capture_method (see 5-G)." }
748-
Create HALT: "Leap on '{p.leap_target}' failed {fails}× — human review needed:
749-
supply a richer capture_method/metrics harness for this
750-
dimension, or reweight the rubric. The loop cannot self-fix it."
751-
set plateau_leap_blocked = true
745+
-- STALL → REWRITE escalation (v1.11.0, the anti-maze fix). Before giving up
746+
-- to a HALT, try ONE from-scratch REWRITE: repeated incremental leaps that
747+
-- stall are the signature of the maze (a symptom patched on a wrong
748+
-- architecture — the f1 `sky.visible=false` class). A patch can't escape;
749+
-- a rewrite can. rubric_score.recommend_rewrite() confirms the stall.
750+
rw = rubric_score.recommend_rewrite(outcomes.entries, p.leap_target, rubric,
751+
min_failed = config.leap.max_attempts_per_dimension)
752+
already_rewrote = any(e.cycle_mode == "rewrite" AND e.leap_target == p.leap_target
753+
for e in last config.leap.max_attempts_per_dimension outcomes)
754+
if config.leap.rewrite_on_stall AND rw.rewrite AND not already_rewrote:
755+
-- Reflexion (arXiv:2303.11366): carry a NEGATIVE-EXAMPLE memo so the
756+
-- rewrite explicitly avoids the stalled approach, and re-ground it
757+
-- (dev-cycle Step 3-PRE) in config.references / the research playbook —
758+
-- the rewrite must implement a CITED reference technique, not re-guess.
759+
write memo { type:"stall_detected", dimension:p.leap_target,
760+
stalled_approach: summary of the last {fails} leaps' diffs,
761+
instruction:"Start from scratch on {p.leap_target}. The incremental
762+
approach stalled. Do NOT reuse it. Ground the rewrite in
763+
config.references[{domain}] / the research playbook (Step 3-PRE)." }
764+
set orient.plateau = { active:true, leap_target:p.leap_target, mode:"rewrite",
765+
weakest_dimension:p.weakest_dimension, artifact_score:p.latest,
766+
reason:"incremental stalled → grounded REWRITE" }
767+
Print "[Orient] ♻️ STALL→REWRITE: incremental leaps on '{p.leap_target}' stalled {fails}×. Next cycle REWRITES from a cited reference (not another patch)."
768+
else:
769+
record skill_gap { name: "leap_stuck_{p.leap_target}", type:"quality_gap",
770+
detail:"Leap+rewrite on {p.leap_target} failed without clearing min delta —
771+
likely UNMEASURABLE by the current capture_method (see 5-G)." }
772+
Create HALT: "Leap+rewrite on '{p.leap_target}' stalled — human review needed:
773+
supply a richer capture_method/metrics harness or authored
774+
assets (asset_sources) for this dimension, or reweight the
775+
rubric. The loop cannot self-fix it."
776+
set plateau_leap_blocked = true
752777
else:
753778
set orient.plateau = { active:true, leap_target:p.leap_target,
754779
weakest_dimension:p.weakest_dimension,
@@ -1907,9 +1932,17 @@ verdict = critic(
19071932
'it exists and works' result is ~0.10 (score_0.10), NOT 0.5+. Worse than
19081933
score_0.10 → below 0.10. Do not grade on a curve where 'a decent
19091934
prototype' = good. ---
1910-
Score null only if the evidence is null. Output
1911-
{dimension_scores:{axis:score|null}, weakest_dimension, critique(<=30 words)}.",
1912-
input: { mission, rubric.dimensions (with reference anchors), evidence: dim_artifact }
1935+
--- REFERENCE COMPARISON (v1.11.0): you are also given `references` (the
1936+
reference target for this domain — a named real product and/or a
1937+
reference-implementation screenshot/spec). For each axis state the ONE
1938+
concrete attribute of the reference the artifact most lacks (name a
1939+
specific technique/parameter, e.g. 'no clearcoat on paint', 'camera has
1940+
no speed lead', 'tyres show no slip'). That gap-naming is the primary
1941+
Observe signal that drives the next leap — vague 'looks worse' is not
1942+
actionable; 'lacks X that reference has' is. ---
1943+
Score null only if the evidence is null. Output {dimension_scores:{axis:
1944+
score|null}, weakest_dimension, per_axis_gap:{axis:'lacks X'}, critique(<=30 words)}.",
1945+
input: { mission, rubric.dimensions (with reference anchors), config.references, evidence: dim_artifact }
19131946
) -> { dimension_scores, weakest_dimension, critique }
19141947
19151948
-- ASSET CEILING + HAND-OFF (v1.9.0 → v1.10.0). For the targeted dimension, after

tests/verify.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -752,6 +752,24 @@ def _mod(name, fn):
752752
f"with-assets ceiling={R.asset_ceiling(with_assets)} hit@0.36={R.asset_ceiling_hit(with_assets,0.36)}",
753753
)
754754

755+
# 13) v1.11.0 stall→REWRITE: when the artifact is plateaued AND the incremental
756+
# LEAPS themselves keep failing, escalate from patch to rewrite (the anti-maze
757+
# fix — the f1 'iterate without improving' loop). Plateaued-but-leaps-working
758+
# stays on incremental (leap, don't rewrite).
759+
rubRW = R.rubric_of({"quality_rubric": {"bar": 0.65, "plateau_window": 3,
760+
"plateau_eps": 0.05, "dimensions": [{"name": "v", "weight": 1}]}})
761+
stalled = [{"artifact_score": 0.50, "weakest_dimension": "v", "dimension_scores": {"v": 0.50},
762+
"cycle_mode": "leap", "leap_attempts": [{"dimension": "v", "delta_score": 0.01}]}] * 3
763+
viable = [{"artifact_score": 0.50, "weakest_dimension": "v", "dimension_scores": {"v": 0.50},
764+
"cycle_mode": "leap", "leap_attempts": [{"dimension": "v", "delta_score": 0.10}]}] * 3
765+
rw_s = R.recommend_rewrite(stalled, "v", rubRW)
766+
rw_v = R.recommend_rewrite(viable, "v", rubRW)
767+
r.check(
768+
"artifact-axis: stalled incremental leaps escalate to REWRITE; working leaps stay incremental (v1.11.0)",
769+
rw_s["rewrite"] is True and rw_s["failed_leaps"] >= 2 and rw_v["rewrite"] is False,
770+
f"stalled→{rw_s}; viable→{rw_v}",
771+
)
772+
755773

756774
def main() -> int:
757775
r = Runner()

0 commit comments

Comments
 (0)