Nightly 2026-04-26 v2 — 9 productive cycles, +0 code-driven goals (corpus-state isolated)#150
Merged
Nightly 2026-04-26 v2 — 9 productive cycles, +0 code-driven goals (corpus-state isolated)#150
Conversation
…Tags column Adds an optional Tags column to the GOALS.md gates table, parses it through to Goal.Tags, and propagates the tags into the Measurement JSON output of `ao goals measure`. Tags are comma- or semicolon-separated, lowercased, and empty entries are dropped. Tags `flywheel-compounding` (w=8) with `long-cycle, corpus-state` so operator tooling (e.g. /evolve selection) can recognize that this gate's green path depends on multi-session corpus growth, not the current commit. Marks the two compile-* gates with `runtime-artifact` for the same reason — their green path requires `ao defrag` (or Dream's overnight preview) to have written `.agents/.../defrag/latest.json` locally, and that file is gitignored. Backwards-compatible: gates without a Tags column continue to parse as before and emit `tags: null` in JSON output. Tests: - TestParseTagsCell (8 cases — empty, single, comma/semicolon, backticks, case-folding, dropped-empties) - TestParseGateRow_TagsHeaderColumn — full row with tags - TestParseGateRow_TagsAbsentByDefault — no Tags header → nil Tags - TestParseGatesTable_TagsRoundTrip — two-row table with mixed tags
…sserts The fuzz target proves no-panic but never asserts the parser populates chain.ID, chain.EpicID, or chain.Entries correctly for its seed corpus. A regression that silently dropped entries would still pass. Add TestFuzzParseChainLines_SeedCorrectness covering all 7 seeds: - metadata-plus-one-entry → ID + 1 entry, first step "research" - metadata-only → ID set, 0 entries - empty input → no error, no entries - non-JSON first line → returns error - malformed entry between valid ones → entry skipped, valid one survives - empty metadata object → no error, empty fields - epic_id with two entries → ID + EpicID + 2 entries Closes the post-mortem #6 finding "Add fuzz seed correctness assertions" for the last fuzz target that lacked one (cli/cmd/ao already had companion SeedCorrectness tests for fuzz_jsonl_test.go and fuzz_context_test.go).
The vibe complexity step previously emitted both Python and Go analyzer blocks unconditionally. For docs/shell/BATS-only epics this caused gocyclo to walk the entire cli/ tree and hang (real harm reported in post-mortem). Add a language-detection preflight that sets HAS_GO and HAS_PY from the diff (or from <path> when no diff base is available), then guard the radon and gocyclo blocks behind those flags. When a language is absent, emit `ℹ️ COMPLEXITY SKIPPED: no .<ext> files in diff` instead of running the analyzer. heal --strict and codex-parity audit both still pass. Closes the council-finding next-work item: "Add language filter to vibe complexity analysis" evidence: gocyclo hung during vibe for a docs/shell/BATS-only epic
…aged in evolve-coverage-delta.sh's grep accepted any "coverage: N" prefix and silently consumed Go's event-coverage informational line — emitted by packages with no _test.go files when the run uses -coverpkg, e.g. cmd/ao prints `coverage: 1/12 events (informational)`. The "1" was extracted as 1.0% and folded into the project average, dragging it down by tens of points. Tighten the regex to `coverage: [0-9]+(\.[0-9]+)?%` so only true percent-format lines participate in the average. Verified by isolated parser test (./tests/scripts/evolve-coverage-delta.bats): Input combining 1/12 events + 76.5% + 88.2% lines now averages to 82.3, matching what an operator expects (the event line is excluded entirely). Closes the post-mortem next-work finding "Fix coverage-ratchet.sh cmd/ao event format handling" (the report named the wrong script — the bug was in evolve-coverage-delta.sh's averaging step).
Cycle 1's Tags branch pushed parseGateRow to CC 18, exactly at the cli/internal/ go-complexity-ceiling threshold. Restore head-room by splitting the per-column read pattern into helpers: - cellAt(cells, colMap, key) → trimmed value if mapped & in-range - parseCheckCell(s) → strips wrapping backticks - parseWeightCell(s) → defaults to 5 on parse error / out-of-range parseGateRow now reads as a flat sequence of `if v, ok := cellAt(...)` applications, dropping CC to 7. The old behavior (defaults, fallbacks, empty-description → ID) is preserved by the helpers. Tests: - TestCellAt — mapped/missing/out-of-range/whitespace-trim cases - TestParseCheckCell — backtick strip, no-backticks, empty, single backtick - TestParseWeightCell — valid 1/5/10, out-of-range, non-numeric, empty - All existing TestParseGateRow_* and TestParseGatesTable_* still pass
The script's default `find -name '*coverage*_test.go'` audited an empty set on a clean checkout — CLAUDE.md bans `cov*_test.go` naming, so the report had nothing to report. Post-mortem #5 flagged this. Changes: - New default scope: `*_test.go` (audit all test files in TARGET_DIR). - New `--scope <pattern>` flag accepting `coverage` (legacy alias), `all` (explicit alias for the new default), or any glob. - Recognize `goleak.VerifyNone` as a valid assertion form (e.g. measure_signal_test.go's goroutine-leak sentinel). - Drop the `|| echo 0` fallback on `grep -c`. GNU grep prints `0` AND exits 1 when there are no matches, so the fallback double-printed ("0\n0\n") and broke the awk ratio math. Use `: ${var:=0}` instead. Closes the post-mortem next-work item: "Expand audit-assertion-density.sh to cover all test files" Tests (tests/scripts/audit-assertion-density.bats): - default scope sees all *_test.go files - default scope flags hollow files - --scope coverage restricts to *coverage*_test.go - --scope all is an explicit alias for the new default - --check exits 1 when hollow tests are present - --check exits 0 when no hollow tests in scope - custom glob via --scope is honored
Codex DAG steps in skills-codex/ should use Codex's \$skill invocation syntax, not Claude's Skill() tool form. The shared strict-delegation-contract.md mirror still carried 12 Skill() references, flagged by the vibe judge on 2026-04-19. Rewrites in skills-codex/shared/references/strict-delegation-contract.md: - Skill(skill="<name>", ...) → \$<skill> <args> (matches converter.sh's rule: s/Skill\(skill="([^"]+)"...)/$1 \$2/) - Three concrete examples in the Positive Pattern section now show \$discovery, \$crank, \$validation invocations (not Skill(skill=...)) - Anti-Pattern section talks about Codex sub-agent spawns (not Agent()) since that's the closest Codex equivalent - Anti-Pattern table renames "Skill() vs Agent()" → "Codex sub-agents vs \$<skill> invocations" with corresponding cell updates Verification: - skills/shared/references/strict-delegation-contract.md (Claude variant) is unchanged — still uses Skill() syntax which is correct for Claude - bash scripts/audit-codex-parity.sh → passed - bash skills/heal-skill/scripts/heal.sh --strict → clean Closes the council-finding next-work item: "Sweep skills-codex/ DAG bodies for Skill() to \$skill notation"
…uffix newAdhocContextRunID's 1-second timestamp granularity has been a recurring concern (sweep 2026-04-15, post-mortem next-work item). The actual collision rate is ~1/65 536 within a second because we suffix with 4 random hex chars, but the contract had no test pinning that promise. Add two tests: - TestEnsureContextDir_IdempotentOnAdhocCollision — calling ensureContextDir twice with the same adhoc-1234-abcd MUST reuse the same directory (proven by writing a marker between calls and reading it back from the second-call path) - TestNewAdhocContextRunID_DistinctSuffixesInSameSecond — two reads with distinct entropy in the same second produce distinct IDs (adhoc-2000-0001 vs adhoc-2000-fffe) Also expand godoc on newAdhocContextRunID and ensureContextDir to document the collision rate and the MkdirAll-idempotency contract, so future callers don't try to use the path itself as a session- uniqueness signal. Closes the post-mortem next-work item: "Document adhoc ID 1-second collision behavior"
Cycle 1 added Tags to GOALS.md (notably long-cycle, corpus-state on flywheel-compounding) but operators still had no way to ask "what's the project's score if we set aside the corpus-state goals I cannot move in this commit?". Add --exclude-tag <tag> to `ao goals measure`. The flag drops any goal whose Tags include the value before the snapshot runs. Combines with --goal (goal-id filter applies first, then tag filter). Demonstration: $ ao goals measure --json | jq .summary.score 92.66 $ ao goals measure --exclude-tag long-cycle --json | jq .summary.score 94.06 Implementation: - MeasureOptions.ExcludeTag (new field) - applyMeasureFilters helper extracted from RunMeasure body so both filter steps live together and RunMeasure stays under the cli/ CC=20 ceiling (was 17, would have grown to 21 inline → extracted) - goalHasTag predicate on Goal.Tags - Wired via --exclude-tag string flag (with help text) on goals_measure - cli/docs/COMMANDS.md regenerated Tests: - TestGoalsMeasure_ExcludeTagFilter: long-cycle goal excluded; total=1, failing=0, score=100 - TestGoalsMeasure_ExcludeTagPlusGoalIDInteraction: --goal narrows to a tagged goal first, then --exclude-tag drops it → empty snapshot, no error (named goal was found and then filtered)
CI failures from PR #150's first push: 1. codex-runtime-sections (validate-codex-generated-artifacts step) "shared: manifest generated_hash drift detected" "shared: marker generated_hash drift detected" Cycle 7 edited skills-codex/shared/references/strict-delegation-contract.md without regenerating the manifest hash. Ran scripts/regen-codex-hashes.sh which updated hashes for 1 skill (shared). 2. bats-tests (tests/scripts/evolve-coverage-delta.bats) The "parser uses the same regex anchors as the script" assertion grepped the script for a literal regex string, but the test's escaping of the literal didn't survive bats's preprocessor — the assertion failed even though the script's regex was correct. Replaced with an awk-based assertion that just confirms the script's coverage extraction line still includes "grep -oE" + "coverage:" + the "?%" anchor fragment. Robust against unrelated whitespace/quote shape changes; still catches a regression that drops the % suffix. Verified locally: full bats tests/scripts/*.bats and tests/hooks/*.bats suites green; validate-codex-generated-artifacts.sh passes.
boshu2
pushed a commit
that referenced
this pull request
Apr 26, 2026
hooks/write-time-quality.sh ran every Edit/Write but had zero test
coverage. A regression in any branch — Go fmt.Println in non-main, Python
bare-except / eval / missing-return-type-hint, shell missing
set -euo pipefail, the IS_TEST exemptions, the kill switch, the JSON
envelope shape — would silently degrade quality signal.
Add a 16-case bats fixture covering:
- tool-name filter (only Edit/Write trigger)
- missing/non-existent file are silent
- unsupported extension is silent
- AGENTOPS_HOOKS_DISABLED kill switch short-circuits
- Go: fmt.Println warns in non-main packages, silent in main and *_test.go
- Python: bare except warns; eval warns outside tests, silent in test_*.py;
missing return-type-hint on def-without-arrow warns
- Shell: missing 'set -euo pipefail' warns; presence suppresses warning
- JSON envelope (stdout-only) parses and includes hookEventName, file,
language, warning_count, warnings array
Each scenario uses a per-test temp file so cases don't bleed state. Pure
test addition; no production code changed.
NOTE: post-commit fitness measurement showed flywheel-proof transiently
fail due to a 503 on sum.golang.org (DNS cache overflow downloading the
go1.26.0 toolchain) — same network-flake mode PR #147 and #150 documented
on the same gate. Re-measure passes (score 92.66). Not caused by this
cycle (only test files touched).
https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T
boshu2
pushed a commit
that referenced
this pull request
Apr 26, 2026
hooks/standards-injector.sh maps .js → "javascript" and reads
skills/standards/references/javascript.md, but the file did not exist —
so every .js Edit/Write silently dropped the standards-context inject.
The hook's "fail-open on missing file" guard hid the gap.
Add references/javascript.md (Tier 1 baseline: ESM, prettier+eslint,
const/let, async/await, eqeqeq, common pitfalls, security defaults)
and link it in skills/standards/SKILL.md (table row + linked-references
list — required by skills/heal-skill --strict and the cmd/ao
TestSkillContract_ReferencesLinkedInSKILLMD test).
Sync the embedded copy via `cd cli && make sync-hooks` so the runtime
manifest matches the source. Add a 12-case bats fixture for
standards-injector.sh covering all six languages (go, ts, tsx, sh, js,
yaml/yml), the extensionless / missing / unsupported / kill-switch
silent paths, and exact-body-match assertions against the on-disk
references files.
Verified:
- hooks/standards-injector.sh on /x.js now returns 2111-byte body
matching the new file
- cd cli && go test -race ./cmd/ao -run TestSkillContract — pass
- bash skills/heal-skill/scripts/heal.sh --strict — All clean
- cd cli && make sync-hooks idempotent
NOTE: post-commit measurement shows flywheel-proof failing — same
network-environmental issue as cycle 8 (sum.golang.org 503 / DNS cache
overflow when the proof-run script downloads the go1.26.0 toolchain
into a fresh HOME). System Go is 1.24.7 but go.mod requires 1.26.0,
so GOTOOLCHAIN=local fallback also fails. Not caused by this cycle —
the proof-run path does not touch standards or hooks. Same pattern
PR #147 and #150 documented and shipped through.
https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T
boshu2
added a commit
that referenced
this pull request
Apr 26, 2026
…-driven goals (corpus-state isolated) (#152) * gate(flywheel-compounding): split σ=0/ρ=0 dormant hint from ρ=0-only The flywheel-compounding gate had one branched hint (ρ=0 → "use --cite applied|reference"), but ρ=0 covers two distinct corpus states: - σ=0 AND ρ=0 — no citations of ANY kind in the measurement window; the corpus is dormant. The fix is "run any ao lookup", not "switch --cite kind". The high-confidence hint is misleading here. - σ>0 AND ρ=0 — citation activity exists but only as retrieved-only hits; the existing hint applies. Add the σ=0 ρ=0 → dormant branch and a 6-case bats fixture pinning the three hint branches (PASS, σ=0 ρ=0 dormant, ρ=0-only, generic) plus the ao-failure path. Operators now see the right remediation per failure mode without inferring it from the σρδ numbers. This is a heavy-goal observability improvement, not a metric flip — the goal stays fail until corpus citations land over multiple sessions. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(codex_runtime): split detectLifecycleRuntimeProfile (CC 20→<14) detectLifecycleRuntimeProfileWithOptions sat at the cli/ CC ceiling (20). Any future case-arm tweak (e.g., a new runtime kind, or a new sub-state in the existing four) would have pushed it past the gate's threshold. Refactor: bundle the per-runtime config paths into a small struct (lifecycleManifestPaths) shared by four per-runtime helpers (populateCodexProfile / populateClaudeProfile / populateOpenCodeProfile / populateUnknownProfile). The detector body shrinks to a switch over the four helpers; each helper is straight-line and testable in isolation. Behavior unchanged — verified via: - go test -race ./cmd/ao -run "Lifecycle|Codex|Runtime" - ./bin/ao codex status --json (live invocation, same JSON shape and same "Detected Codex runtime without native hook support" reason) - go-complexity-ceiling gate: cli/ <20, cli/internal/ <18 https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * test(hooks): pin research-loop-detector behavior across 14 cases The PostToolUse research-spiral detector at hooks/research-loop-detector.sh had zero test coverage. A bad edit to the threshold ladder, the read-only-bash classification, the kill-switch short-circuits, or the JSON nudge formatting would ship silently. Add a bats fixture covering: - counter increment on Read/Grep/Glob/WebSearch/WebFetch - WARN/STRONG/STOP threshold transitions at 8/12/15 with the exact nudge text for each band - reset on Edit/Write/NotebookEdit - read-only Bash (grep/rg/cat/...) increments; execution Bash resets - AGENTOPS_HOOKS_DISABLED and AGENTOPS_RESEARCH_LOOP_DISABLED kill switches both short-circuit before any state mutation - threshold env-var overrides (AGENTOPS_RESEARCH_WARN_THRESHOLD) - STOP precedence over STRONG/WARN when all three are tied at 1 - emitted JSON parses round-trip via jq -e Run against the live hook in a tmpdir mock-repo to keep tests hermetic. All 14 scenarios PASS. Pure-test addition: no production code touched, no fitness regression. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(notebook): split runNotebookUpdate (CC 19→11) for headroom runNotebookUpdate sat at CC=19 — close to the cli/ ceiling of 20 — and mixed three concerns: memory-file resolution, entry resolution, and the update pipeline itself. A single new branch (e.g., a third entry source) would have failed the gate. Extract two helpers: - resolveNotebookMemoryFile(cwd) (string, bool) - resolveNotebookEntry(cwd) *pendingEntry Each is straight-line and individually testable; the main function now reads as a four-step pipeline (memory-file → entry → cursor-skip → parse/render/write). Behavior preserved — `ao notebook update --quiet` exit 0, no output, no state mutation when no MEMORY.md / no session entry. All cmd/ao tests pass; CC drops to 11 (well clear of the 20 ceiling). https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * test(beads): pin five 0%-coverage helpers behind 19 cases Five small pure helpers in cli/cmd/ao/beads.go and beads_audit_cluster.go had 0% line coverage: - beadMinInt — drives matches[:min(3, len)] citation clipping - beadTruncate — wraps the bd parse-error message - representativeIsEpic — picks epic vs leaf rendering for cluster output - firstNNonEmptyLines — derives the cluster summary excerpt - sortedMapKeys — supplies deterministic JSON ordering A regression in any of them would corrupt user-visible output silently (wrong message text, garbled cluster summary, non-deterministic JSON ordering breaking diffs) rather than panicking. None had a test pinning behavior. Add 19 cases covering: smaller-of-two and equal-args boundaries (incl. negatives and zeros), under/at/over the truncation limit (incl. n=0 on non-empty), epic-found / leaf-found / representative-missing / empty-cluster branches of representativeIsEpic, whitespace-handling and trim semantics of firstNNonEmptyLines, deterministic key order of sortedMapKeys regardless of bool values. All cases assert exact expected values (per .claude/rules/go.md). No production code touched; fitness unchanged at 92.66. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(contradict): split runContradict (CC 19→5) into 5 helpers runContradict bundled four concerns at CC=19 — close to the cli/ ceiling of 20: directory existence checks, file collection, entry parsing, pair-comparison loop, and dual-format output. A new file source or a new output format would have failed the gate. Extract: - collectContradictFiles: globs *.jsonl + *.md from learnings/patterns - parseContradictEntries: reads + tokenizes, drops empty/zero-word files - compareContradictPairs: O(n²) jaccard ≥ 0.4 + detectContradiction - relPathOrAbs: Rel-with-fallback path helper (lifted from inline blocks) - emitContradictResult: JSON-or-human writer Behavior preserved — verified via: - go test ./cmd/ao -run Contradict - ./bin/ao contradict (human output identical: 20 files, 190 pairs) - ./bin/ao contradict --output json (same {"total_files":20,...} shape) CC drops: runContradict 19→5; new helpers all ≤6. Headroom for future file-source additions. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * refactor(rpi_serve): split serveRPIState (CC 19→5) into 4 helpers serveRPIState mixed five HTTP-handler concerns at CC=19 — close to the cli/ ceiling: query-param parsing/validation, run-id resolution against the registry, fallback phased-state.json read, per-phase result gathering, and the active-runs listing. A new state source or response key would have failed the gate. Extract: - parseServeStateRunID: Validate run-id, write 400 on path traversal - resolveStateForRunID: Look up the run via resolveServeRun, write to resp on success, return the resolved root - loadFallbackPhasedState: Read .agents/rpi/phased-state.json directly only if the resolver did not already populate phased_state - loadPhaseResults: Gather phase-{1,2,3}-result.json into a phase_N map Behavior preserved — verified via: - go test ./cmd/ao -run TestServeRPIState (existing handler test) - go test ./cmd/ao (full package, 30s, all pass) - go vet clean CC drops: serveRPIState 19→below-5 (not in --threshold 5 listing); each new helper ≤6. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * test(hooks): pin write-time-quality across 16 per-language scenarios hooks/write-time-quality.sh ran every Edit/Write but had zero test coverage. A regression in any branch — Go fmt.Println in non-main, Python bare-except / eval / missing-return-type-hint, shell missing set -euo pipefail, the IS_TEST exemptions, the kill switch, the JSON envelope shape — would silently degrade quality signal. Add a 16-case bats fixture covering: - tool-name filter (only Edit/Write trigger) - missing/non-existent file are silent - unsupported extension is silent - AGENTOPS_HOOKS_DISABLED kill switch short-circuits - Go: fmt.Println warns in non-main packages, silent in main and *_test.go - Python: bare except warns; eval warns outside tests, silent in test_*.py; missing return-type-hint on def-without-arrow warns - Shell: missing 'set -euo pipefail' warns; presence suppresses warning - JSON envelope (stdout-only) parses and includes hookEventName, file, language, warning_count, warnings array Each scenario uses a per-test temp file so cases don't bleed state. Pure test addition; no production code changed. NOTE: post-commit fitness measurement showed flywheel-proof transiently fail due to a 503 on sum.golang.org (DNS cache overflow downloading the go1.26.0 toolchain) — same network-flake mode PR #147 and #150 documented on the same gate. Re-measure passes (score 92.66). Not caused by this cycle (only test files touched). https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * fix(standards): add javascript.md so .js Edit/Write injects standards hooks/standards-injector.sh maps .js → "javascript" and reads skills/standards/references/javascript.md, but the file did not exist — so every .js Edit/Write silently dropped the standards-context inject. The hook's "fail-open on missing file" guard hid the gap. Add references/javascript.md (Tier 1 baseline: ESM, prettier+eslint, const/let, async/await, eqeqeq, common pitfalls, security defaults) and link it in skills/standards/SKILL.md (table row + linked-references list — required by skills/heal-skill --strict and the cmd/ao TestSkillContract_ReferencesLinkedInSKILLMD test). Sync the embedded copy via `cd cli && make sync-hooks` so the runtime manifest matches the source. Add a 12-case bats fixture for standards-injector.sh covering all six languages (go, ts, tsx, sh, js, yaml/yml), the extensionless / missing / unsupported / kill-switch silent paths, and exact-body-match assertions against the on-disk references files. Verified: - hooks/standards-injector.sh on /x.js now returns 2111-byte body matching the new file - cd cli && go test -race ./cmd/ao -run TestSkillContract — pass - bash skills/heal-skill/scripts/heal.sh --strict — All clean - cd cli && make sync-hooks idempotent NOTE: post-commit measurement shows flywheel-proof failing — same network-environmental issue as cycle 8 (sum.golang.org 503 / DNS cache overflow when the proof-run script downloads the go1.26.0 toolchain into a fresh HOME). System Go is 1.24.7 but go.mod requires 1.26.0, so GOTOOLCHAIN=local fallback also fails. Not caused by this cycle — the proof-run path does not touch standards or hooks. Same pattern PR #147 and #150 documented and shipped through. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T * fix(proof-run): reuse cli/bin/ao when present so 503s on sum.golang.org don't fail flywheel-proof tests/e2e/proof-run.sh always rebuilt ao in a fresh \$HOME, so each gate invocation re-downloaded the go1.26.0 toolchain via sum.golang.org. When the sum DB returns 503 ("DNS cache overflow") the entire flywheel-proof gate (w=7) fails — even though the local cli/bin/ao is fresh and behavior is testable. Three changes: - PROOF_AO_BIN=/path env override: caller can pin a pre-built binary - Auto-detect \$REPO_ROOT/cli/bin/ao when present (and the override is unset) — covers the common case where `make build` ran first - PROOF_FORCE_BUILD=1 escape hatch: opt back into build-from-source when the goal IS to verify the toolchain path `require_cmd go` now only fires on the build path, so machines without go installed can still run the proof against a shipped binary. Verified: - bash tests/e2e/proof-run.sh — auto-detects cli/bin/ao, all 20 flywheel checks PASS in ~6s (was failing in 90s before) - PROOF_FORCE_BUILD=1 — still attempts go build (so the toolchain- path regression test still exists) - PROOF_AO_BIN=/path/to/ao — copies binary, skips build flywheel-proof flips fail→pass after this cycle. This is a code-driven flip (the script is the gate's only build path), not a runtime artifact. https://claude.ai/code/session_01TVzMVJ8FXdctstCrzTcM7T --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Second nightly run for 2026-04-26 (PR #147 was the morning run; merged at
800eea8a). Branched fromorigin/mainpost-merge. 9 productive cycles, 0 stale-audit cycles, 0 auto-reverts. Score holds at 92.66 (the only failing goal is the documented corpus-stateflywheel-compounding, weight 8).Fitness delta (score: 92.66 → 92.66)
long-cycle, corpus-state(cycle 1)applyMeasureFiltersBEFORE committing → final CC ≤16*Pre-Dream baseline was
failfor both compile-* gates because.agents/defrag/latest.jsonis gitignored and the env starts without it. Dream'sdefrag-previewwrites.agents/overnight/<run>/defrag/latest.jsonwhich the gates' fallback path consumes, flipping them topass. Taggedruntime-artifactin cycle 1 so this is no longer hidden classification.Code-driven flips vs runtime-artifact flips
overnight startwriting.agents/overnight/<run>/defrag/latest.json(gitignored — does not propagate via PR)The corpus-state
flywheel-compounding(w=8) was NOT pursued for a metric flip. PR #147 already did the legitimate observability fix (scripts/check-flywheel-compounding.shsurfaces σρδ + root cause). Today's heavy-goal partial fix is structural quarantine plumbing (cycles 1 + 9): explicit tagging in GOALS.md and an operator-facing--exclude-tagflag so future runs can score "what's actually code-actionable" without lowering the goal's weight.Per-cycle summary
flywheel-compounding(w=8) — tag withlong-cycle, corpus-state; runtime-artifact gates withruntime-artifact; Tags column parsing → JSON measurement output56199e81internal/ratchetfuzz target had no companion seed-correctness asserts; added 7-caseTestFuzzParseChainLines_SeedCorrectnessfa14b65cskills/vibe/SKILL.mdStep 2 ran radon/gocyclo unconditionally; gated onHAS_PY/HAS_GOfrom diff to stop docs/shell-only epics from hanging in gocyclo530dd699evolve-coverage-delta.shaveraged Go'scoverage: 1/12 eventsinformational line in as 1.0%, dragging project averages down. Tightened regex to require%anchor; addedtests/scripts/evolve-coverage-delta.bats(7 cases)456a9e16parseGateRowCC pushed to 18 by cycle 1; restored head-room to 7 by extractingcellAt/parseCheckCell/parseWeightCellhelpers (with unit tests)9a8a4a8caudit-assertion-density.shdefaulted to*coverage*_test.go(banned by CLAUDE.md → audited zero files). New default*_test.go+--scope <pattern>+ recognizegoleak.VerifyNone. 8-case bats fixtureac660ae2skills-codex/shared/references/strict-delegation-contract.mdhad 12Skill()references; rewrote to$skillnotation per Codex convention (Claude variant unchanged)bb619271newAdhocContextRunIDcollision behavior had no test; added idempotency-on-collision test + distinct-suffix test + expanded godoc on the 1/65 536 collision rate2923e95dao goals measure --exclude-tag <tag>flag — operators can now ask "score with corpus-state goals excluded" without lowering weights. Builds on cycle 1's tags.applyMeasureFiltersextracted for CC disciplined0c4220fFindings opened / closed / deferred
Closed via implementation (this run):
internal/ratchet/fuzz_test.gowas the last fuzz target without a companion correctness suite)evolve-coverage-delta.sh's averager, notcoverage-ratchet.shwhich doesn't exist)Skill()to$skillnotation" — closed by cycle 7Heavy-goal partial fix delivered (DEFINITIONS option b):
flywheel-compounding(w=8) — corpus-state, multi-session bound. PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147 added the observability gate. This run added the structural quarantine layer:Tagscolumn on the GOALS.md gates table (parsed intoGoal.Tags, surfaced inMeasurement.TagsJSON)flywheel-compoundingtaggedlong-cycle, corpus-stateruntime-artifactao goals measure --exclude-tag <tag>for operator-side filteringfail— that is the correct outcome (it must remain a fitness signal until applied/reference citations land in the corpus over multiple sessions). Documenting and tooling around the irreducibility, not gaming the metric.Inline-probe rejections (counted separately from stale-audit cycles):
tier=execution, limit 800, current 660 → already passingscripts/check-cmd-ao-coverage.shalready enforces a 76% floor on cli/cmd/aoINPUT=$(cat)+jq(PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147 audit confirmed; re-confirmed)applyContextFilteralready iteratesSections.Include(PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147)validateIntelScopealready validates section names (PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147)Deferred (not actioned — vague or out-of-scope-for-cycle):
goals(defines its own--jsonfor state isolation) and root--json; not a contract violationscripts/check-go-command-test-pair.sh) already enforces co-changeStale-audit count
The 6 rejections are well above the consecutive-3 threshold for ladder escalation, but escalation already happened naturally — every productive cycle came from either the heavy-goal lane or generator-layer findings (test, perf, refactor, docs-as-tool).
Auto-reverts
None. No goal with weight ≥ 3 regressed. Cycle 9's intermediate state pushed
RunMeasureto CC=21 (would have failedgo-complexity-ceiling); caught by the post-build measure check before commit and refactored down to ≤16 by extractingapplyMeasureFilters. Same response as auto-revert without losing the work.Quarantined goals
flywheel-compounding(w=8) — confirmed multi-session corpus-state goal. Now structurally taggedlong-cycle, corpus-state(cycle 1) AND has an operator-facing--exclude-tagescape (cycle 9). Recommend keeping the gate and weight as-is — the tag-based filter is the right mechanism for "give me a code-actionable score" rather than weight reduction.Dream meta-findings
dream-corpus-stale(rank 1): "Write AgentOps philosophy doc..." —docs/philosophy.mdexists, last_reviewed 2026-04-12. Same finding as PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147; Dream is still emitting a packet whose work shipped. Worth a Dream-curator pass.dream-corpus-stale-rank3(rank 3): "Backfill next-work queue rows to schema v1.3..." —scripts/check-next-work-schema-rows.sh(added in PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147 cycle 3) reports66 row(s) conform to v1.3 schema enums. Same stale finding as PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147. The Dream emission ranking is correlating poorly with corpus state; tracking this as a producer-side issue.Two stale Dream packets in two consecutive nightlies on the same date is a real signal that Dream's morning-packet generator is not consulting the recent next-work queue or recent merged PRs before ranking. Recommend a tractability probe in the Dream pipeline itself.
bd / tracker degradation notes
bdCLI unavailable:command -v bdreturns nothing, noscripts/install-bd.shexists in the repo, no.beads/directory. Identical to PR #147's environment. Cycles selected from heaviest-failing-goal + generator-layer findings + next-work queue instead. Worth a follow-up to either (a) shipscripts/install-bd.shso future runs can self-install, or (b) documentbd unavailableas the expected steady state and stop logging it as a degradation.Scope-discipline notes
main) — known false positive, ignored per spec. No silencing flag exists in the script. Same as PR Nightly 2026-04-26 — 6 productive cycles, +3 goals, fitness 79.8 → 92.7 #147.nightly/2026-04-26-v2403'd as expected; falling back to branch ref. Thenightly/2026-04-26-v2branch on origin is the anchor for tomorrow's audit.pre-push-gate.sh --fast(only failure is the worktree false positive; 32 actual checks pass or skip).cli/docs/COMMANDS.mdin sync with the new--exclude-tagflag.Validation
cd cli && go run ./cmd/ao autodev validate --file ../PROGRAM.md --json→valid:truecd cli && go vet ./...→ cleancd cli && go test ./internal/goals ./internal/ratchet ./cmd/ao→ all okbash skills/heal-skill/scripts/heal.sh --strict→ All clean. No findings.bash scripts/audit-codex-parity.sh→ Codex parity audit passed.bash tests/skills/lint-skills.sh→ 69/69 skills passbash scripts/check-go-absolute-complexity.sh --dir cli/ --threshold 20→ All functions below 20bash scripts/check-go-absolute-complexity.sh --dir cli/internal/ --threshold 18→ All functions below 18scripts/generate-cli-reference.sh --check→ cli/docs/COMMANDS.md is up to dateao goals measure --json: PASS=18, FAIL=1, SCORE=92.66ao goals measure --exclude-tag long-cycle --json: PASS=17, FAIL=0, SCORE=94.06 (cycle-9 demonstration)Commits
56199e81feat(goals): tag flywheel-compounding as long-cycle/corpus-state via Tags columnfa14b65ctest(ratchet): pin FuzzParseChainLines seed inputs with correctness asserts530dd699fix(vibe): gate radon/gocyclo on actual language presence in diff456a9e16fix(coverage): require % anchor so 'coverage: 1/12 events' isn't averaged in9a8a4a8crefactor(goals): drop parseGateRow CC 18→7 by extracting cell helpersac660ae2fix(audit): scope assertion-density audit to all *_test.go by defaultbb619271fix(codex): rewrite Skill() to $skill notation in delegation contract2923e95dtest(inject): pin adhoc context-id collision idempotency + distinct-suffixd0c4220ffeat(goals): --exclude-tag flag for ao goals measure(Branch ref
nightly/2026-04-26-v2on origin serves as tomorrow's audit anchor in lieu of a tag — tag push 403'd as expected.)Generated by Claude Code