coordinator run-log: observer-blank (blank-slate variant)#49893
coordinator run-log: observer-blank (blank-slate variant)#49893ellataira wants to merge 30 commits intoella/observer-blankfrom
Conversation
A long-running Python loop that iteratively proposes, implements,
evaluates, and ships changes to the observer anomaly-detection
pipeline via the Claude Agent SDK. Every commit on the scratch branch
`claude/observer-improvements` corresponds to one reviewed,
score-improving candidate.
Structure (tasks/coordinator/, 19 modules + 13 test files):
- driver.py iteration orchestrator (poll validations, sync
upstream, process inbox, pick candidate, impl,
eval, score, review, commit+push+validate)
- scheduler.py candidate picker with approach-family diversity
ban (K=3 consecutive non-improving → banned)
- proposer.py SDK subagent: brainstorms new candidates from
prior experiment outcomes when queue is dry
- sdk.py SDK wrappers (inbox interp / implement / review);
PreToolUse hook blocks any `git` bash command
from the implementation agent; transient errors
retried with exponential backoff
- reviewer.py Skeptic + Conservative persona prompts (Phase 1);
Duplicate Hunter, Algorithm Expert, Greybeard
staged for Phase 2+
- scoring.py report → baseline diff → strict regression +
recall floor gates (train-scoped)
- evaluator.py subprocess wrapper for q.eval-scenarios
- workspace_validate.py fire-and-forget post-ship eval-component on
workspace-evals-<detector>; polled at iteration
start; abandoned >48h
- git_ops.py scratch-branch-only plumbing: refuses off-branch
commit/push; sync_from_upstream fetches + merges
origin/q-branch-observer with conflict abort;
startup_cleanup reconciles post-crash state
(orphan working-tree diffs + unpushed commits)
- coord_out.py coordinator→user channel (file + Slack webhook)
- slack_out.py incoming webhook poster (fail-soft; no-op when
COORD_SLACK_WEBHOOK_URL unset)
- inbox.py user→coordinator; atomic rename beats truncation
races
- budget.py wall-hour tracking + 50%/80% milestone escalation
- config.py frozen constants (τ, plateau, retries, milestones)
- metrics.py markdown dashboard rendered from db.yaml
- db.py, schema.py atomic-write YAML persistence + typed dataclasses
- journal.py append-only JSONL event log
- {import_baseline,import_validations,seed_split,seed_candidates}.py
one-shot bootstrap scripts
Safety:
- Coordinator owns all git state. Commits and pushes are gated to the
scratch branch by WrongBranchError. No pushes to upstream.
- The implementation agent is sandboxed (Read/Edit/Write/Bash/Grep/Glob
only) and its Bash is filtered by is_git_command which catches
chained forms (`true && git push`, `cd x; git status`) and ignores
false positives (`ls -la git`, `gitk`, `git-foo`) — all covered by
tests/test_git_safety.py.
- db.yaml is the source of truth; save_db precedes every externally-
observable side effect (push, workspace dispatch) so a crash leaves
recoverable state.
- Restart runs git_ops.startup_cleanup: reverts mid-iteration orphan
diffs and pushes orphan commits from a crash between commit and push.
Tests: 90 passing (pyproject.toml scopes pytest to this subtree with
pythonpath=[..] to bypass the heavyweight tasks/__init__.py).
The scratch branch forks off origin/q-branch-observer on first run
regardless of operator HEAD; sync_from_upstream merges new upstream
commits at every iteration start.
Also gitignores .coordinator/ and eval-results/ so runtime state can
never leak into a commit.
See tasks/coordinator/README.md for the full flow diagrams, setup
commands, restartability matrix, and deferred items.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Model routing: Opus for deep-thinking SDK calls (implement, review, proposer); Sonnet for lightweight inbox interpretation. Exposed via CONFIG.model_deep / CONFIG.model_light; empty string falls back to SDK default. Threaded through sdk._run_query and proposer.propose. - De-hardcoded DETECTOR_WORKSPACE dict in workspace_validate. A new detector no longer requires a code change — workspace_for_detector derives the target ssh alias by convention (workspace-evals-<detector>). Unreachable workspaces fail soft via the existing _ssh journalling path. Test updated to verify the convention and the fail-soft behaviour. - QUICKSTART.md: zero-to-running walkthrough for the multi-day workspace-hosted driver setup. Creates coord-driver workspace, installs deps, seeds baseline/split/candidates, smoke-tests, then launches `--forever` in tmux. - README: added "Where things run" four-workspace diagram (driver does only short ops; three detector workspaces run post-ship eval-component), model-routing table, and QUICKSTART link. No functional change to the iteration loop itself. 91 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DD admins block new Slack incoming webhooks and app installs, so the user-facing notification channel switches to GitHub PR comments on the long-lived run-log PR (#49678). This is strictly simpler: `gh` is pre-authenticated on DD workspaces, no new tokens, no admin approval. Bidirectional: - Outbound (coordinator → user): github_out.post calls `gh pr comment <N> --body <text>` at every coord_out.emit event (budget milestones, phase exits, strict-regression auto-rejects, validation completions/abandonments, upstream conflicts). - Inbound (user → coordinator): github_in.poll calls `gh api --paginate issues/<N>/comments` at every iteration start, filters out the coordinator's own comments (zero-width-space marker prepended by github_out.format_message), appends the remainder to .coordinator/inbox.md for the normal SDK-interpret → ACK flow. State in .coordinator/github_state.json tracks the highest comment id already ingested so poll is idempotent. Modules: + tasks/coordinator/github_out.py ( post via `gh pr comment` ) + tasks/coordinator/github_in.py ( poll via `gh api issues/comments` ) - tasks/coordinator/slack_out.py ( removed — no longer needed ) - tasks/coordinator/tests/test_slack_out.py ( removed ) Wired: - coord_out.emit: swap slack_out → github_out; journal event renamed slack_post_failed → github_post_failed. - driver._run_iteration_body step 0a: github_in.poll before the regular inbox drain so user replies get picked up in the same iteration. Env var: COORD_GITHUB_PR_NUMBER run-log PR number (e.g. 49678). Unset → no-op (fail-soft); coord-out.md is still the canonical channel. Security: subprocess.run is always called with an argv list (no shell=True), so arbitrary body content cannot be interpreted as shell. github_state.json stores only last_seen_id; no tokens or bodies. Docs: QUICKSTART.md rewrites the transport setup as a one-liner against PR #49678 (watch it in the GitHub mobile app for push notifications). README.md updates the architecture diagram, module list, and troubleshooting table. Tests: +14 new covering gating, formatting, subprocess happy-path, argv structure, gh errors (cli missing, timeout, non-zero rc), dedup on re-poll, own-comment filtering via marker, chronological sort, malformed JSON lines, empty payload. 101 passing in the coordinator venv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…list
Four persona panels reviewed the harness; 10 cross-cutting findings
addressed here. Test count: 101 passing.
# Correctness (Scientist panel)
- Per-scenario F1 σ calibration. Added `ScenarioResult.f1_sigma` and
`measure_sigma.py` helper. `scoring.score_against_baseline` now uses
`3·σ_s` per scenario when measured, falls back to `CONFIG.tau_default`.
Fixes the "scalar τ=0.05 is smaller than observed per-scenario σ"
problem — we were gating on noise.
- Rolling "last shipped" reference for regression gates. `db.last_shipped_
per_scenario[detector]` updates after every ship; the strict-regression
and recall-floor gates compare against THIS (not the original baseline)
so a candidate that regresses from the immediately-prior commit is
blocked, even when accumulated prior gains would mask the regression
vs the original baseline. Cumulative deltas vs baseline stay visible.
Review prompt shows both ("vs baseline" + "vs last-ship") so the
reviewer can distinguish "added marginal signal" from "inherited".
- Overfit telltale. New `overfit_check.py`: every N ships, Spearman
rank-correlation between train-ranking and lockbox-ranking of all
shipped candidates. Drift below `CONFIG.overfit_spearman_threshold`
emits a `tripwire` coord-out. Lockbox scores never surface in any
agent prompt — Python consumes, agents don't.
# Brittleness (panel #2)
- Inbox orphan recovery. `inbox.recover_orphan_reading` called in
`driver.main` startup; archives any `inbox.md.reading` left behind
by a prior mid-drain crash with an `orphan-recovery` tag so the
original message isn't silently lost.
- Dirty-tree check moved BEFORE `sync_from_upstream`. Human-edited
working-tree changes under WATCH_PATHS can no longer be silently
auto-committed under a candidate id, or wiped by `merge --abort`
on upstream conflict.
- PendingValidation persisted BEFORE ssh dispatch (new `dispatching`
status). A crash between ssh return and db save can't lose track
of the in-flight remote tmux session. Ssh failure flips status to
`failed` with an audit trail. Startup reaps any orphaned
`dispatching` records as `failed`.
- Upstream-conflict halts cleanly. New `UpstreamConflictHalt` exception
propagates out of the --forever loop instead of re-trying the sync
every iteration (which would conflict and emit another coord-out
every ~10min, burning tokens + spamming the PR).
# Operability (SRE panel)
- Hard token-budget ceiling. `sdk.consume_token_count()` accumulates
input+output tokens from `ResultMessage.usage`; driver rolls into
`BudgetState.api_tokens_used`. When `CONFIG.api_token_ceiling` is
exceeded, `BudgetCeilingHalt` halts the loop with a `budget_halt`
coord-out. Default None (no ceiling); set it before multi-day runs
or an Opus loop edge case can burn $1-5k/day uncontrolled.
- Liveness heartbeat in `metrics.md`. Shows last journal event, ISO
timestamp, and a `⚠ LIVENESS` banner when stale > 30min. Median
iter wall-time over the last 10 iterations. Token % of ceiling
when set.
# Information flow (Maximalist panel)
- Personas collapsed. Replaced Skeptic + Conservative (both re-deriving
booleans from numbers scoring.py already computed) with a single
`hack_detector` that focuses on the judgment call rules CAN'T make:
"does this look like a real improvement or a metric-hack?" Prior-
experiment rationales are now in the prompt so the reviewer can see
if this approach was already rejected on a different iteration.
- Implementation agent gets prior-work context. `sdk.implement_candidate`
accepts `prior_experiments` (up to 5 most-recent same-family
experiments with rationales). Driver populates via
`_recent_same_family`. Agent can learn from past rejects instead
of re-exploring dead ends.
# Eval-component restructure
- Per-ship eval-component dispatch REMOVED. It was "validate every
config" dressed as "validate the component" — and it constantly
skipped when the workspace was busy.
- New policy: dispatch ONCE per new component, on plateau. When a
family iterating on a component hits `CONFIG.stuck_threshold`
consecutive non-improving experiments AND has ≥1 ship, dispatch
eval-component for each target component not yet in
`db.components_eval_dispatched`. Matches "certify this new
component" semantics; eval-component is a lagging audit, not a
per-config check.
- `components_eval_dispatched` starts EMPTY on `empty_db()`. Historical
reports (in eval-results/ + manually imported into db.validations)
validated BASELINE versions — after the coordinator modifies
scanmw/scanwelch/bocpd their historical data is stale. Re-run on
plateau.
# Tests
All 101 existing tests still pass. Updated
`test_dispatch_fails_soft_when_workspace_unreachable` and
`test_dispatch_ssh_failure_does_not_raise` to expect the
persist-before-dispatch audit trail (PendingValidation recorded with
status=failed on ssh failure, not silently dropped).
# What's still deferred
- Replicate-on-ship (3-5x rerun before committing; #5 in the panel
synthesis). Adds 18-30min per ship but gives seed-sign-consistency
before pushing. Straightforward add when desired.
- Pre-revert diff archive for rejected candidates (#6). Small, easy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Panel-review fixes landed in the previous commit; this one updates
README + QUICKSTART so the docs match the behavior, and removes the
hardcoded bocpd/scanmw/scanwelch list from the two bootstrap scripts
that baked it in.
# Doc updates (README + QUICKSTART)
- New section in README: "Attribution and rolling reference". Explains
why the branch accumulates commits monotonically and how
db.last_shipped_per_scenario replaces db.baseline as the regression-
gate reference, so candidates are blocked on marginal regressions
even when cumulative prior gains would mask them.
- Rewrote "Async validation (post-ship)" → "Async component validation
(plateau-gated)". Flow diagram matches the new behavior: dispatch
once per new component, when the family iterating on it plateaus.
components_eval_dispatched starts empty (no seed — coordinator
modifies known detectors, so their historical baseline validations
are stale).
- Updated the iteration flow diagram: step 5 shows rolling reference
+ per-scenario 3·σ gates, step 6 is the single hack_detector persona
(not Skeptic+Conservative), step 7 shows token accounting → budget
halt.
- Restartability matrix: added rows for inbox orphan recovery, dirty-
tree guard before sync, persist-before-dispatch, upstream-conflict
halt, budget-ceiling halt.
- QUICKSTART adds step 6b (measure_sigma.py) as strongly-recommended
before any real run. New troubleshooting rows for upstream halt,
budget halt, overfit tripwire, scalar-τ false-rejects, liveness
banner. "What to expect" clarifies the three halt events.
- Module layout table: added measure_sigma.py + overfit_check.py.
- reviewer.py description updated to the single hack_detector persona.
# De-hardcoded detector list
- import_baseline.py: replaced
--bocpd P --scanmw P --scanwelch P (3 mandatory flags)
with
--detector NAME=PATH (repeatable; any count, any names)
so adding a detector no longer requires editing the script.
- measure_sigma.py: default detector list was
["bocpd", "scanmw", "scanwelch"]
now pulls from db.baseline.detectors.keys() — measures whatever
was imported. --only still works as a narrowing filter.
- QUICKSTART "Prereqs" reframed: the three workspaces are the current
set, but new detectors work by convention — create
workspace-evals-<name>, done.
- Remaining concrete references to the three detectors in docs are
illustrative examples (smoke-test ssh, candidate YAML snippet, "only
scanmw runs"); kept intentionally since they show what to type given
the current state.
101/101 tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erfit - config: stuck_threshold 3→5, review_personas_phase1 2→1 (hack_detector only; Skeptic + Conservative collapsed). - git_ops.ensure_scratch_branch: prefer origin/<scratch> when it exists so subsequent pushes are fast-forward rather than non-FF rejected. - measure_sigma: prebuild testbench + scorer once before the N-run variance loop so per-run --no-build uses fresh binaries. - overfit_check: read each shipped candidate's F1s from its own exp.per_scenario (already contains all 10 scenarios from ship-time eval) instead of re-running lockbox evals on current HEAD — the re-run measured cumulative state, not candidate contribution. - tests: assert new config values. - docs: K=3 → K=5 in README and QUICKSTART. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erhaul)
Panel-reviewed pre-run fixes before kicking off the multi-day
unattended loop. Three groups of changes:
== Tier 1 — cost/safety/recovery ==
- config: non-None ceilings (5M tokens, 72h) + driver refuses --forever
with either unset. Default gun was loaded.
- db.empty_db seeds ceilings from CONFIG.
- workspace_validate: ssh/scp timeouts + connect/keepalive options so
a paused workspace cannot wedge the main loop.
- driver: fcntl.flock on .coordinator/driver.lock (single-instance).
- commit↔save_db reordered: save SHIPPED with sha="pending" BEFORE
git commit; startup reconciles pending-sha experiments.
- github_in: filter own comments by user.login (cached from `gh api
user`) instead of the fragile zero-width-space marker.
== Tier 1.5 — operational polish ==
- sdk: max_turns=60 (implement), =12 (review) — a runaway Grep loop
inside one _run_query can no longer spend a day of budget before
the per-iteration token check fires.
- git_ops.startup_cleanup: aborts stale MERGE_HEAD / CHERRY_PICK_HEAD
/ rebase-* residue before any iteration.
- inbox.inbox_lock (fcntl): serialises claim/ack with the gh-poll
append. Without it an append racing a rename writes into the
renamed inode. ack_and_archive also reorders: write ack BEFORE
renaming so a mid-op crash doesn't silently lose the trace.
- github_out/github_in: consecutive-error counters + driver-level
escalation (warn at 3, halt at 5). If gh auth expires on day 2,
we no longer burn tokens silently.
== Gate/review/plateau overhaul ==
Panel flagged the statistical gates as decorative or actively harmful:
- scoring: catastrophe filter (ΔF1 < -0.10 vs FROZEN baseline),
replacing both the per-scenario 3·σ gate (N=5 σ too noisy to
support) and the rolling-reference ratchet (noise-driven drift
let candidates strictly worse than baseline ship).
- reviewer: two personas in parallel — leakage_auditor (name leakage,
threshold-snapping, implicit identity, special cases) +
hack_detector (concentration, complexity, proxy-gaming, retread).
Both get `git diff HEAD` and scenario rosters in the prompt.
Forced structured output: per-check {status, evidence} fields;
stub evidence or missing checks block → auto-reject. Unanimity.
- plateau: effect-size aware. score > best + ε (ε = 0.01) required
to count as improvement. Noisy +0.001 bumps no longer keep dead-end
families alive indefinitely.
- schema.last_shipped_per_scenario: kept for backward-compat db load,
no longer written.
Docs: README iteration-flow diagram, gates section rewritten, deferred
plans updated. QUICKSTART step 6b, "what to expect", troubleshooting.
All 101 coordinator tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pre-run additions the panel review flagged as blockers for sound
post-run attribution:
(a) One-component-per-candidate invariant
- proposer.materialize_candidates: skip (with stderr) any proposal
that doesn't have exactly one target_component.
- seed_candidates._load_one: raise ValueError for 0 or 2+ components
— hard fail for hand-authored seeds.
- Rationale: each ship becomes a single git commit on one component.
Without this, later ships can modify code introduced by earlier
ships, making per-candidate marginal attribution unsound.
- Test: test_materialize_rejects_multi_component.
(b) reeval_ships.py — offline marginal re-evaluation
- For each shipped candidate: checkout ship sha + parent sha, run
q.eval-scenarios at N seeds on each, compute per-scenario marginal
ΔF1 with 95% CIs.
- Output: JSON report tagging each scenario as train/lockbox, with
summary counts of ships whose CI-low > 0.
- Success criterion: ≥3 shipped candidates with any lockbox
scenario's CI-low > 0 = real generalization. Below that, the
harness produced noise.
- Cost: n_ships × seeds × 2 shas × ~6min. Use --only to parallelise
across workspaces. --dry-run plans without spending cycles.
Docs:
- README: one-component callout on seed example, new "Post-run
audit: marginal re-evaluation" section.
- QUICKSTART: one-component invariant note, replaced the vague
"offline re-eval" placeholder with the actual reeval_ships.py
command and success criterion.
102/102 coordinator tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…list
4-persona panel reviewed the just-loosened exploration constraints and
returned unanimous NO-GO with concrete exploits. All 9 fixes below.
## Exploit closures (BLOCK tier)
1. **Ground-truth sealing** — `comp/observer/scenarios/` (labels,
disruption windows) is the cleanest reward-hack surface: the agent
legitimately operates there, commits it, and gates measure F1 from
those files. Closed three ways:
- `EXCLUDE_PATHS` applied via `git add -- ... :(exclude)...` in
commit_candidate, so scenarios edits never commit even if they
happen on disk.
- `PROTECTED_PATHS` content-hashed at iteration boundary via
`tree_hash`; mismatch raises `ProtectedPathTamperHalt`.
- `_make_write_guard` PreToolUse hook on Edit/Write blocks the
path before any edit happens.
2. **Edit/Write path whitelist** — previously only `git` was blocked
for Bash, and Edit/Write had NO filter. Agent could edit
`tasks/q.py`, `.git/hooks/`, `.coordinator/` — none reverted by
`revert_working_tree`. Now `_make_write_guard` enforces
`comp/observer/` (minus `scenarios/`) as the only writable tree;
`_make_bash_guard` blocks shell-write mechanisms (redirects, tee,
cp/mv, sed -i) targeting forbidden paths; git still blocked.
3. **Multi-detector evaluation** — `relevant_detectors(candidate)` =
intersection of `target_components` with `{bocpd, scanmw, scanwelch}`,
or all 3 if empty. Driver loops eval+score per detector and uses
`_merge_scorings` to aggregate gate-on-worst across them. Previously
a scanmw silent fallback meant candidates modifying bocpd internals
were scored against scanmw's unaffected output — ΔF1≈0 by
construction, every regression and "improvement" invisible.
4. **FP ceiling** — `total_fps > fp_ceiling_ratio × baseline_total_fps`
auto-rejects. Stops the "emit-everything" reward-hack where
rewriting a detector to fire aggressively boosts recall; per-scenario
F1 can look fine while total FPs triple.
5. **Relative catastrophe filter** — `obs.f1 < 0.5 × base.f1` where
`base.f1 ≥ 0.05`. The absolute `ΔF1 < -0.10` gate is blind on
low-baseline scenarios (a baseline of 0.08 can drop to zero and
never trip the absolute filter) — the relative cliff catches them.
6. **Stale interface reference in prompt** — agent prompt cited
`TimeSeriesAnalysis`, a type deleted in a prior consolidation.
Replaced with `SeriesDetector`/`Detector` from
`comp/observer/def/component.go` + a mandatory pre-edit discovery
checklist (AGENTS.md, interface, sibling detector, test file).
## Reviewer chain leakage closures (WARN tier)
7. **Lockbox names no longer shown to reviewers** — review prompts
now see the full scenario set with no train/lockbox partition label.
Previously the reviewer knew exactly which scenarios were held out
and that bias could leak into rationales.
8. **Scenario names redacted from review rationales before
persistence** — `_redact_scenario_names` replaces literal scenario
strings with `<scenario>`. Stops the chain where a rationale naming
"food_delivery_redis" gets rendered into future implementation
agent prompts via `_format_prior_work`, teaching the agent which
scenarios to "preserve."
9. **Algorithm Expert persona wired into Phase 1** — previously only
leakage_auditor + hack_detector; neither enforces house style
(interface compliance, filename/header, non-blocking Detect, state
key pattern, test coverage, helper reuse). Algorithm_expert existed
in reviewer.py as a Phase-2 persona but was unwired. Now required
at Phase 1 with structured evidence fields; unanimity across 3.
## Docs + spec
README iteration-flow diagram, "Gates and what they actually do"
section, QUICKSTART "What to expect" all updated to match. Allium spec
at `~/.claude/plans/ad-harness.allium` was mostly already in lockstep;
tend subagent extended `@guarantee ImplementationAgentWriteConfinement`
to cover the Bash guard.
## Test updates
- `test_config_values`: review_personas_phase1=3 (was 2).
- All 102 coordinator tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unblockers caught during the first live smoke test:
- db.empty_db started at Phase.ZERO but every proposer-generated
candidate is phase="1", and the scheduler filters by current_phase.
Fresh runs couldn't pick anything ("no candidates in phase 0; idle")
even after the proposer successfully generated 3 candidates.
Default is now Phase.ONE. Existing dbs are patched in-place.
- push_scratch_branch now passes --no-verify. Repo-level pre-push
hooks (go-test, invoke-based linters) are designed for real PRs
and take tens of minutes; they also fail on environment drift
(missing `invoke` module, etc.) that's irrelevant to a draft
scratch branch. claude/observer-improvements is an audit PR, not
a mergeable branch; real merge-path review happens via offline
reeval_ships → cherry-pick onto a clean branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught during the first live smoke test: the scratch branch claude/observer-improvements was rooted at q-branch-observer (which doesn't contain tasks/coordinator/). The driver checked itself out mid-iteration and its own proposer module vanished — lazy import crashed. The fix is structural: seed the scratch branch from ella/claude-coordinator-harness so the harness code is always present; upstream observer work arrives via sync_from_upstream merges every iteration. Doc now explains all four branches, who writes to each, the key invariant (scratch branch must contain tasks/coordinator/), the update-during-run workflow, and the end-of-run merge flow (reeval_ships → cherry-pick onto a clean branch → real PR against q-branch-observer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
proposer: widen research memory + additive-vs-replace hint First live iteration (bocpd-density-ratio-rulsif) was auto-rejected because the RuLSIF replacement catastrophically broke 703_shopify (F1 0.99→0.01) — catching exactly what the gate should catch. But the proposer's view of that outcome was nearly blind: it would see "density-ratio-detector, rejected, score_delta=+0.03" with no signal about which scenarios broke. Not useful for steering iter N+1. Fixes: 1. Experiment.impl_summary (new) — stores the implementation agent's DONE: line so the proposer sees WHAT was attempted, not just the family tag. 2. Experiment.auto_reject_reason (new) — stores the gate-reject string (e.g. "strict_regressions=['bocpd/703_shopify'] recall_violations= ['bocpd/059_fortnite', 'bocpd/703_shopify']") so the proposer can see which specific scenarios hit the gate. 3. proposer._top_scenario_deltas — surfaces the 5 biggest |ΔF1| swings per experiment in the proposer's prompt. Lets it see patterns like "RuLSIF helps low-baseline scenarios but kills 703_shopify" and steer accordingly. 4. Proposer guideline: "prefer additive over replace when the original has visible wins." Parallel/ensemble pattern (OR union, post-filter) instead of wholesale replacement when the recent experiments show the original is strong on any scenario. Only full-replace if the data shows the original is broadly weak. 5. Research-memory block now has a prose header telling the proposer WHAT to look for (per-scenario patterns, preserve-what-the-baseline- aced) rather than just dumping the YAML. Unrelated but caught: test_db + test_metrics asserts still expected Phase.ZERO as default (stale — we moved to Phase.ONE default two commits ago so the scheduler could pick proposer-generated candidates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous additive-vs-replace hint was too loose — "run both algorithms
in parallel and OR the outputs" is a 2× CPU production non-starter for
a streaming detector. Tightened all three prompt surfaces:
- Proposer: explicitly forbids "run both algos every tick" framing;
enumerates non-doubling patterns (post-filter on silent/fired, cheap
pre-gate selector, shared rolling stats + lightweight decision heads).
Requires every candidate description to state per-tick CPU+memory
budget relative to baseline ("~1.2× CPU, same memory" or "adds O(k)
per tick, k=64").
- Implementation agent: stated the 1.5× per-tick budget as a HARD
constraint; enumerates the same allowed patterns; requires DONE:
summary to include concrete per-tick cost figures.
- Algorithm-expert reviewer: new `per_tick_perf_budget` structured
check. Red-flags enumerated (Detect() running two full algorithms in
parallel; sliding window scans over all history; unbounded per-
stream buffers). Missing perf estimate in DONE: = auto-fail the
check.
This is guidance, not measurement — we don't have a production perf
harness wired in yet. Enforcement is via (a) explicit budgets in every
prompt layer, (b) reviewer check requiring quantification, (c) eventual
future perf-harness gate if empirical drift shows this isn't enough.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every coord-out comment now ends with a subdued signature line: "— Claude (coordinator harness) · <short-sha> · <timestamp>" Added four new iteration-level event types so the PR reads as a running log instead of only surfacing exceptions: - iter_start: candidate id, family, detectors being evaluated, first line of the description. - iter_shipped: commit sha, push status, mean F1 delta, total FPs delta, top 5 scenario wins, per-persona reviewer rationale. - iter_rejected (review-level): mean F1 delta, per-persona verdicts. - iter_eval_failed / iter_impl_failed: compact failure summary. Enhanced the existing strict_regression emit with the top 5 |ΔF1| scenarios + total FP delta so a reader knows WHY the gate fired without digging into journal.jsonl. Cadence: ~2 comments per iteration (start + end). At 6-15min/iter over 72h that's 280-720 comments over a full run — visible on mobile, readable in the PR timeline, but not firehose-level. Important events (strict_regression, budget_milestone, tripwire) are still distinguished by emoji and the msg_type code fence header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously only tracked aggregate input/output tokens with no model
split. Since Opus is ~5× Sonnet per token on list price, that's
meaningless for cost attribution on a multi-day run.
Changes:
1. sdk._TOKEN_COUNTER now tracks per-model-family (opus/sonnet/unknown)
× per-direction (in/out). `_collect_text` tags usage with the model
of the calling query; `_run_query` passes model down.
2. `estimate_cost(counter)` applies list prices ($15/$75 per M for
Opus, $3/$15 for Sonnet). Comment notes real bill runs higher
(cache writes, retries under-count in the SDK's ResultMessage.usage).
3. BudgetState gains opus_in/out, sonnet_in/out, unknown_in/out fields
(cumulative across the run). Serialized to db.yaml.
4. `peek_token_count()` — returns current counter without resetting, so
end-of-iter PR comments can include this-iteration usage without
stealing the count from the end-of-iteration journal+budget update.
5. Every end-of-iter PR comment (strict_regression, iter_shipped,
iter_rejected, iter_eval_failed, iter_impl_failed) now ends with:
**Budget**: This iter: X in / Y out (~$Z). Run total: T tokens
(~$T_cost) (P% of ceiling). Model mix: Opus O%, Sonnet S%.
6. `metrics.md` dashboard shows:
- estimated_cost_usd
- opus tokens in/out
- sonnet tokens in/out
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iter 3/4 and iter 6 both hit the same infinite-loop bug: SDK crashed
(env issue, then something else on iter 6), implementation_failed
didn't mark the candidate REJECTED, scheduler re-picked the same
candidate each iteration. Drains wall-hours for zero work.
Now implementation_failed marks the candidate REJECTED so the next
pick moves on. Also tidies budget footer for the 0-token case
("SDK call failed before API" instead of a divide-by-zero "0%/100%"
model mix), and adds a cumulative-only budget footer to iter_start
so PR observers see run-to-date spend without waiting for iter end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Token tracking rewrite
----------------------
Previous design kept a process-global dict accumulated across an
iteration and only persisted via save_db at end-of-iteration. Any
mid-iter crash/kill wiped the counter — every Opus token billed up
to that point vanished from the coordinator's view. Plus per-model
fields only existed after a late-hour addition, so iter 5 (the only
successful Opus call) tracked only legacy aggregate before being
wiped on restart.
New design: `.coordinator/tokens.jsonl`, append-only, one JSON line
per SDK call with {ts, iter, model, family, purpose, input, output,
success}. Source of truth, always durable, always accurate. Schema
fields removed: BudgetState.opus_in/out, sonnet_in/out, unknown_in/out
(replaced by token_log.sum_by_family). SDK counter state removed:
_TOKEN_COUNTER, consume_token_count, peek_token_count. One helper
module `token_log.py` (~180 lines) with read/filter/sum/cost helpers.
Cumulative tokens for the token-ceiling halt are now just the sum of
the log. db.budget.api_tokens_used is updated at iter end as a cached
total for display. `_budget_footer` reads the log directly — every PR
comment reflects true durable spend.
Full SDK error capture
----------------------
Every SDK failure now dumps complete exception context to
.coordinator/sdk-errors/<ts>-<purpose>.txt — the exception repr,
str, all subprocess-style attrs (cmd, args, returncode, stdout,
stderr), cause/context chain walked 5 levels, traceback. The
`iter_impl_failed` PR comment now includes the path so a human
can finally see WHY the claude-CLI subprocess is exiting with
code 1 instead of the cryptic "Check stderr output for details"
the SDK was passing through.
Callsite updates
----------------
_run_query now takes `purpose` + `root` + `iter_num` args so each
call's usage lands in the token log with full attribution. Callers
updated: implement_candidate (purpose=implement), review_experiment
(purpose=review:<persona>), proposer.propose (purpose=propose),
interpret_inbox_message (purpose=inbox). Driver passes iter_num
through so log records can be filtered per-iteration for the
"This iter:" footer line.
All 102 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iter 10 completed end-to-end on the new code but tokens.jsonl stayed empty — my token capture in _collect_text assumed every message exposed `.usage` as an object with `.input_tokens` / `.output_tokens`, but the claude-agent-sdk message shape varies: - ResultMessage: .usage (object with attrs) - AssistantMessage (some versions): .message.usage (nested) - Some versions: .usage is a dict, not an object - Cache-aware fields (cache_creation_input_tokens, cache_read_input_tokens) surface separately from input_tokens Now checks all three shapes, counts cache tokens too, and on the first message of each SDK call appends a line to .coordinator/sdk-msg-types.log showing type + attrs. One-shot per purpose so it doesn't spam. Lets us see what the SDK actually emits if token capture still comes up empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reframes budget control: hard limits are wall-clock time only; token spend is observed continuously and warned on, with auto-pause on a streak of expensive iterations. Tighter caps available per-run via CLI. Hard limits ----------- - wall_hours_ceiling (72h default) — only true halt - api_token_ceiling (500M default) — panic brake against runaway loops, not a budget. Tighten via --token-ceiling N for bounded smoke runs. Per-iter cost anomaly tripwire ------------------------------ After each iteration, sum the iter's tokens + cost from tokens.jsonl. Triggers on ANY of: - iter cost > cost_anomaly_vs_rolling_ratio × rolling-mean-of-last-N (default 2× last 5) - iter cost > cost_anomaly_absolute_usd (default $20) - iter tokens > cost_anomaly_absolute_tokens (default 5M) On any trigger: - Emits `cost_anomaly` PR comment with iter cost, rolling mean, all triggers, and current streak - Increments db.budget.consecutive_cost_anomalies (persisted) - Reset to 0 on a non-anomalous iter Auto-pause on streak -------------------- After cost_anomaly_pause_streak (3) consecutive anomalous iters, driver writes `.coordinator/pause`. Cost_anomaly comment becomes `requires_ack` so it lands on mobile push. Driver sleeps at next iteration boundary until user `rm`s the file. Nothing in flight gets wasted (vs Ctrl-C mid-iter which loses the impl work). Cooperative pause ----------------- - Manual: `touch ~/dd/datadog-agent/.coordinator/pause` - Resume: `rm ~/dd/datadog-agent/.coordinator/pause` - Driver checks at top of every --forever iteration; sleeps 30s when file present. Schema ------ - BudgetState gains `consecutive_cost_anomalies: int = 0`, persisted. - Existing dbs load cleanly (default 0). CLI flags (already landed) -------------------------- --token-ceiling N and --wall-hours-ceiling N override per-run AND persist to db.budget so subsequent restarts inherit (pass again to bump further). All 102 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-iter token accounting and cost-anomaly check both passed `len(db.iterations)` as the iteration number, but that's incremented BEFORE the check fires (inside _run_iteration_body). So every check queried an iteration that didn't exist yet, got zero tokens, and silently no-op'd. Iter 11 ($133, 8.6M tokens) and iter 12 ($108, 7M tokens) both should have fired cost_anomaly tripwires and didn't. Fix: compute `just_ran_iter = len(db.iterations) - 1` once and use it for both the tokens_used journal event and the anomaly check. Tokens were already being logged correctly (sdk.py uses the driver-passed iter_num from inside the iteration body, which is the right N). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ded) Previously: phase plateau → coord_out "write inbox.md to redirect" → break. That's wrong. Pivot should be AUTOPILOT — the coordinator is autonomous by design; inbox.md is optional steering, not a prerequisite. New flow on phase plateau (5 consecutive non-improving iters): 1. Collect approach_families from the last `plateau_pivot_lookback` (5) iterations. These just ran out of steam. 2. Add them to persistent `db.pivot_banned_families` (via schema). 3. Reset phase_state.plateau_counter (keep best_score so bar stays high). 4. Increment db.pivot_count. 5. Emit `phase_pivot` coord-out (informational, requires_ack=False). 6. Continue the --forever loop — next iter's scheduler sees the newly- banned families, next proposer call sees a PIVOT block in its prompt telling it "previous directions saturated, go structurally different." Scheduler `stuck_families()` now unions per-family-stuck detection with `db.pivot_banned_families` so banned families never get scheduled. Proposer prompt gains a `pivot_clause` rendered when pivot_banned is non-empty: explains that this call is a pivot, points at recent experiments for what the reviewers rejected and why, demands structurally different proposals. Hard runaway stop ----------------- If `pivot_count >= CONFIG.max_pivots_before_halt` (default 4) AND no candidate has shipped across the whole run, emit `phase_exit` with requires_ack=True and break. That's the signal "the problem is structurally un-improvable with this setup" — not a spurious plateau. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
$20 absolute trigger was noise on this workload — every Opus implementation iter on this codebase costs $80-150 because of claude-agent-sdk's per-turn full-context resend pattern. Every single iter was tripping the $20 trigger, diluting the real anomaly signal. Kept: - Relative: iter cost > 2× rolling mean (primary signal; what we actually want — "weirdly expensive vs peers") - Absolute tokens: iter tokens > 10M (hard ceiling for runaway iters; was 5M but nearly every iter exceeds that so it was also noisy) Removed: - CONFIG.cost_anomaly_absolute_usd - Corresponding trigger block in _check_cost_anomaly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Monitoring during a multi-day run was a grep-your-way-through exercise. Codifies the commands I've been handing back one-off: - process + branch state - last N events (journal tail) - inbox_ack (did the coordinator read my PR comment yet?) - phase_pivot (has autopilot fired?) - pivot_banned_families + pivot_count (db.yaml grep) - cost_anomaly events - last candidate's terminal state - cumulative + per-iter tokens (python inline sum over tokens.jsonl) - pause state - sdk-errors captures (for debugging impl_failed) - shipped candidates (coord: commits on scratch branch) - gh auth health Plus an interpretation note: during Opus impl (~20 min), tokens.jsonl updates every few seconds but journal stays quiet — that's normal. Both quiet for >15 min = stuck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hanically Architectural split based on iter-16's $291 matrix-profile crash. That crash happened because the implementer (Opus) received a vague candidate description, had to redesign in-flight, reached for the Task tool to delegate, and the sub-agent ran for 17 minutes before a silent exit-1. Root cause: implementer was doing design work that belongs to the proposer. Fix: move design into the proposer, make implementation mechanical. New split --------- - Proposer (Opus) now authors `candidate.implementation_plan` — a detailed 30-80 line plan covering: exact files to modify/create, interface to preserve, numbered algorithm steps, data structure shapes, integration points, test plan, perf budget. Proposer prompt rewritten to demand this with explicit sections. - Implementer now defaults to Sonnet (CONFIG.model_light) when a plan is present. max_turns=25 (vs 60). Prompt is prescriptive: "execute the plan mechanically, don't redesign; net-positive deviations allowed but must be documented in DONE." Falls back to Opus with the old design-and-implement prompt if no plan was authored (rare; proposer prompt demands one). - Task tool explicitly blocked via a new PreToolUse hook (`_block_task_tool_hook`). The SDK exposes Task implicitly; our allowed_tools whitelist doesn't stop it. The hook does. Prevents the sub-agent amplification pattern that crashed iter 16. Reviewer -------- - `review_experiment` now takes the candidate (for implementation_plan) and passes it into every persona's prompt. - hack_detector gains a `plan_fidelity` check: did the implementer execute the proposer's plan? If abandoned, is the deviation net- positive? Advisory, not auto-reject — a better-than-plan outcome still ships; plan abandonment that produces worse code doesn't. - leakage_auditor + algorithm_expert also see the plan (useful for comparing intent vs actual). Schema ------ - Candidate gains `implementation_plan: str = ""`. Loaded from db.yaml; proposer writes; implementer + reviewer consume. Cost expectation ---------------- Previously ~$100-150/iter on Opus implementation. Expected with Sonnet + detailed plan: $15-40/iter. Proposer grows maybe 1.3× (more detail in output). Net iter cost should halve. If Sonnet can't execute a plan faithfully, that's a proposer problem (vague plan) — feedback loop via plan_fidelity rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following CLAUDE.md #5 ("docs and specs stay in lockstep with code"), catching up QUICKSTART, README, and the allium spec for today's rapid-fire changes that drifted them. QUICKSTART: - 6f rewritten: api_token_ceiling is now a panic brake (500M default), not a budget. Real cost control is wall-hours + cost-anomaly tripwire + auto-pause on streak. Mentions --token-ceiling CLI flag. - Step 8: `-u` unbuffered flag added to launch command; CLI flag examples (--token-ceiling, --wall-hours-ceiling) shown. - New §8a "Autopilot behaviors" — summarizes phase pivot, cost anomaly tripwire, cooperative pause (.coordinator/pause), and sdk-errors capture. Things the user doesn't need to trigger but should know exist. README: - Model-routing table rewritten for the proposer-implementer split. Proposer (Opus) authors detailed implementation_plan; Implementer (Sonnet when plan present) executes mechanically with max_turns=25 and Task tool blocked. Reviewers see plan + diff, hack_detector has a plan_fidelity advisory check. - Added a prose block explaining WHY the split: iter 16 Opus-as- implementer reached for Task, spawned 17-min sub-agent, crashed with $291 burned. Moving design into proposer + Sonnet-executes makes it mechanical, bounded, ~5× cheaper. allium spec (via allium:tend subagent): - Config: plateau_pivot_lookback, max_pivots_before_halt, all cost_anomaly_* fields, api_token_ceiling 500M with panic-brake prose, CLI flag override semantics. - Candidate.implementation_plan; Budget.consecutive_cost_anomalies; Db.pivot_banned_families + pivot_count fields. - New rules: AutopilotPivotOnPhasePlateau, CheckCostAnomalyAt IterationEnd, WaitWhilePaused. - Modified rule guidance: UnanimousReviewRequired (plan passthrough), ProposeCandidatesWhenQueueDry (plan requirement + pivot_clause), DiversityBanStuckFamilies (persistent-ban union), PhaseExitOnPlateau (superseded note + hard-halt path retained). - New CoordinatorConsole guarantees: TaskToolBlockedForImplementer, ProposerImplementerSplit, DurableTokenLogging, SdkErrorsCaptured, LiveBudgetCeilingsFromDb, CooperativePause. - allium check: 0 errors, 28 pre-existing warnings. All 102 coordinator tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- git_ops.commit_candidate: --no-verify (was crashing iter 22 on a pre-commit hook import error, leaving SHIPPED+pending in db.yaml and a dirty tree that restart would revert) - sdk.py: treat "command failed with exit code" + "fatal error in message reader" as transient; 3x retry with exponential backoff - sdk.py: register ClaudeAgentOptions.stderr callback to buffer claude-CLI stderr into sdk-errors dumps for post-hoc diagnosis - metrics.py: build_f1_matrix renders per-detector x per-scenario F1 table from shipped experiments; written to .coordinator/ f1-matrix.md on every regenerate; compact block embedded in metrics.md and iter_shipped PR comments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parallel variant of the live run. Points the coordinator at ella/observer-blank as upstream and uses claude/observer-blank-improvements as the scratch branch, so this run cannot collide with the live q-branch-observer → claude/observer-improvements pipeline driven from ella/claude-coordinator-harness. Everything else (workspace_for_detector convention, .coordinator state paths, gates, reviewers) is identical and derives from runtime state.
Gitlab CI Configuration Changes
|
| Removed | Modified | Added | Renamed |
|---|---|---|---|
| 0 | 361 | 0 | 0 |
Updated: .gitlab/distribution.yml
Changes Summary
| Removed | Modified | Added | Renamed |
|---|---|---|---|
| 0 | 0 | 2 | 0 |
ℹ️ Diff available in the job log.
Go Package Import DifferencesBaseline: e5b320d
|
|
✅ hello from blank-slate coordinator — Claude (coordinator harness) · |
…ompt for blank slate Blank-slate variant: the observer has no detectors/correlators at launch. The hardcoded KNOWN_DETECTORS tuple (bocpd/scanmw/scanwelch) made every candidate on observer-blank fall through relevant_detectors() to a list of names that don't exist in the catalog, producing empty eval reports and nonsensical gate outcomes. - driver.py: replace module-level KNOWN_DETECTORS with known_detectors(db) that reads baseline.detectors.keys(). When baseline is empty (true blank slate), relevant_detectors() returns the candidate's own target_components so the eval still runs against the detector the candidate just created. Gains as baseline re-imports happen. - proposer.py: rewrite the three prompt paragraphs that named bocpd/scanmw/scanwelch as existing; they don't exist on this branch. Tell the agent it's inventing detectors from scratch, must register them in component_catalog.go, and must keep the name stable across iterations so baseline/gate lines up. Live run (q-branch-observer) uses ella/claude-coordinator-harness and is untouched.
…sing-baseline scoring With blank-slate upstream (ella/observer-blank) the agent invents detector names freely. Old policy in relevant_detectors() — 'fall back to all known detectors when target doesn't intersect known' — silently excluded the new detector the candidate just created, evaluating only stale ones. Plus score_against_baseline KeyError'd on any detector not yet baselined. Behavior now (for every iteration): detectors_to_eval = baseline.detectors.keys() ∪ candidate.target_components - Every baselined detector is evaluated every iter → catches 'did this candidate break an existing ship' across the whole admitted set, not just the ones the candidate explicitly targets. - The candidate's own target components are ALWAYS evaluated even if not yet baselined → the new detector's progress is measured and recorded per-iter. - score_against_baseline returns a no-gate ScoringResult when the detector is missing from baseline: raw F1/FPs populated, strict_regressions=[], recall_floor_violations=[], baseline_mean_f1=0. Gates only fire for baselined detectors. FP-ceiling already guards baseline_total_fps > 0 so it auto-skips for unbaselined detectors. Promotion flow: when iter N ships a good novel-vX detector, operator runs import_baseline --detector novel-vX=<iter N report path> to admit it. From then on future candidates are gated against novel-vX too. No rolling auto-ratchet (anti-noise-promotion); promotion is always a human decision.
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
On-wire sizes (compressed)
|
|
iter 0 · Evaluating against detectors:
Budget: Run total: 170,018 tokens (~$0.52) (0.0% of 500,000,000 ceiling). Model mix: Opus 0%, Sonnet 100%. — Claude (coordinator harness) · |
6b7e8df to
7ee60af
Compare
…/observer-blank-improvements
7ee60af to
d64758b
Compare
Summary
Parallel run-log draft PR for the blank-slate coordinator variant. Sibling of #49678 (the live q-branch-observer run).
ella/observer-blank(observer wiring only; all detectors/correlators removed — fresh canvas for novel ideas)claude/observer-blank-improvements(this PR's head)ella/claude-coordinator-harness-blank(fork ofella/claude-coordinator-harnesswithSCRATCH_BRANCH+UPSTREAM_BRANCHflipped)Do not merge
This PR is a long-lived audit log of coordinator activity. Every commit corresponds to one approved candidate. Final merge happens separately via
coordinator.reeval_ships+ cherry-pick onto a fresh branch (seetasks/coordinator/QUICKSTART.md§11).Use as interaction channel
Drop plain-English PR comments here to steer the coordinator (polled at iteration start, ACKed via outbound comment). Once the driver workspace is up and
COORD_GITHUB_PR_NUMBERpoints at this PR, GitHub mobile notifications work out of the box.