Skip to content

Commit 4863d2b

Browse files
committed
coordinator: derive KNOWN_DETECTORS from baseline; reword proposer prompt for blank slate
Blank-slate variant: the observer has no detectors/correlators at launch. The hardcoded KNOWN_DETECTORS tuple (bocpd/scanmw/scanwelch) made every candidate on observer-blank fall through relevant_detectors() to a list of names that don't exist in the catalog, producing empty eval reports and nonsensical gate outcomes. - driver.py: replace module-level KNOWN_DETECTORS with known_detectors(db) that reads baseline.detectors.keys(). When baseline is empty (true blank slate), relevant_detectors() returns the candidate's own target_components so the eval still runs against the detector the candidate just created. Gains as baseline re-imports happen. - proposer.py: rewrite the three prompt paragraphs that named bocpd/scanmw/scanwelch as existing; they don't exist on this branch. Tell the agent it's inventing detectors from scratch, must register them in component_catalog.go, and must keep the name stable across iterations so baseline/gate lines up. Live run (q-branch-observer) uses ella/claude-coordinator-harness and is untouched.
1 parent d2bb80a commit 4863d2b

2 files changed

Lines changed: 48 additions & 28 deletions

File tree

tasks/coordinator/driver.py

Lines changed: 28 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -555,39 +555,54 @@ def _recent_same_family(db: Db, candidate: Candidate, limit: int = 5) -> list[di
555555
return out
556556

557557

558-
KNOWN_DETECTORS = ("bocpd", "scanmw", "scanwelch")
558+
def known_detectors(db: Db) -> tuple[str, ...]:
559+
"""Detector names recognised by the coordinator for gating purposes.
560+
561+
Derived from baseline.detectors so the set grows as new detectors are
562+
imported via re-baselining — no hardcoded list to keep in sync with
563+
the catalog. A detector must have a baseline entry to be gate-eligible;
564+
score_against_baseline does detectors[name] lookup which would KeyError
565+
without one.
566+
"""
567+
if db.baseline is None or not db.baseline.detectors:
568+
return ()
569+
return tuple(db.baseline.detectors.keys())
559570

560571

561-
def relevant_detectors(candidate: Candidate) -> list[str]:
572+
def relevant_detectors(candidate: Candidate, db: Db) -> list[str]:
562573
"""Which detectors' F1 do we measure to decide if this candidate shipped?
563574
564575
A candidate can modify any file under comp/observer/. But scoring runs
565576
per-detector, and the panel review caught this: silently defaulting to
566-
ONE detector (previously scanmw) meant candidates modifying e.g. bocpd
567-
internals got scored against scanmw's unaffected output — ΔF1≈0 by
577+
ONE detector meant candidates modifying e.g. detector A's internals
578+
got scored against detector B's unaffected output — ΔF1≈0 by
568579
construction, "improvement" or "regression" both invisible.
569580
570581
Policy:
571-
- Intersect target_components with the 3 known detectors. If the
582+
- Intersect target_components with known (baselined) detectors. If the
572583
intersection is non-empty, eval each one in it and gate on the
573584
WORST ΔF1 across them.
574585
- If the intersection is empty (correlator changes, new features,
575-
pipeline-level work), eval ALL 3 detectors — we can't tell in
586+
pipeline-level work), eval ALL known detectors — we can't tell in
576587
advance which one the change affects, so measure them all.
577-
578-
Always returns a non-empty list.
588+
- If no detectors are baselined yet (blank-slate bootstrap), return
589+
target_components as-is. Scoring will treat empty baseline.scenarios
590+
as "no gate," and the per-detector report is still generated.
579591
"""
580-
named = [c for c in candidate.target_components if c in KNOWN_DETECTORS]
592+
known = known_detectors(db)
593+
if not known:
594+
return list(candidate.target_components) or ["unknown"]
595+
named = [c for c in candidate.target_components if c in known]
581596
if named:
582597
return named
583-
return list(KNOWN_DETECTORS)
598+
return list(known)
584599

585600

586-
def primary_detector(candidate: Candidate) -> str:
601+
def primary_detector(candidate: Candidate, db: Db) -> str:
587602
"""Deprecated single-detector view. Kept for callers that print a
588603
single string (metrics, log lines). Returns the first relevant detector.
589604
"""
590-
return relevant_detectors(candidate)[0]
605+
return relevant_detectors(candidate, db)[0]
591606

592607

593608
def _merge_scorings(scorings: dict, detectors: list[str]):
@@ -847,7 +862,7 @@ def _run_iteration_body(
847862
return
848863

849864
it.candidate_id = candidate.id
850-
detectors = relevant_detectors(candidate)
865+
detectors = relevant_detectors(candidate, db)
851866
# Why a list, not a single detector: a candidate modifying correlator
852867
# code or a shared feature affects MULTIPLE detectors. Gating on a
853868
# single "primary" detector was a panel-reviewed BLOCK — silent

tasks/coordinator/proposer.py

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -156,9 +156,11 @@ def build_proposer_prompt(
156156
novel** candidate changes that might improve anomaly detection on the
157157
observer pipeline.
158158
159-
This harness is explicitly for exploration. Threshold-tuning on the three
160-
existing detectors (bocpd / scanmw / scanwelch) is the LEAST interesting
161-
thing you can do here — it finds small local wins and saturates fast.
159+
This harness is explicitly for exploration. The observer pipeline currently
160+
has ZERO detectors and ZERO correlators — only wiring, storage, extractors,
161+
and interfaces. Your job is to invent the detectors/correlators from
162+
scratch. Shallow threshold-tuning on whatever you first land is the LEAST
163+
interesting thing you can do here — saturate fast and pivot.
162164
163165
What's actually interesting:
164166
@@ -173,15 +175,17 @@ def build_proposer_prompt(
173175
outputs differently, new emitter logic, new feature-engineering stages,
174176
seasonality-aware baseline windows, per-signal-class routing.
175177
176-
- **Replace an existing detector's internals.** bocpd/scanmw/scanwelch
177-
are starting points, not sacred. Keep the detector's registration and
178-
whichever interface from `comp/observer/def/component.go` it already
179-
implements (`SeriesDetector` or `Detector`), but swap the guts for a
180-
different algorithm entirely (e.g. replace BOCPD with a density-ratio
181-
detector while keeping the `bocpd` name).
182-
183-
- **Prefer non-doubling patterns over full replacement** when the
184-
original detector has visible wins. Wholesale replacement can
178+
- **Register detectors in `comp/observer/impl/component_catalog.go`.**
179+
Implement the `Detector` or `SeriesDetector` interface from
180+
`comp/observer/def/component.go`, give the detector a stable name,
181+
add an entry to `defaultCatalog()`. The name must match the
182+
candidate's `target_components[0]`; the coordinator uses that name
183+
for `q.eval-scenarios --only <name>` and for baseline gating. Pick a
184+
name and keep it stable across iterations of the same family.
185+
186+
- **Evolve or replace without doubling work.** Once a detector is
187+
shipped, later candidates can refine it in place or swap the guts
188+
while keeping the registered name. Wholesale replacement can
185189
catastrophically regress scenarios the original aced (see `recent
186190
experiments` — replacements tend to show big +ΔF1 on scenarios the
187191
original missed AND big -ΔF1 on scenarios it aced).
@@ -216,9 +220,10 @@ def build_proposer_prompt(
216220
217221
The eval framework is OFF LIMITS. Do NOT modify `tasks/q.py`,
218222
`tasks/libs/q`, `q.eval-scenarios` orchestration, or the testbench
219-
registry. The three detector names and scenario list are fixed
220-
evaluation boundaries. All innovation happens INSIDE `comp/observer/`,
221-
behind the three existing detector names.
223+
registry. The scenario list is a fixed evaluation boundary. All
224+
innovation happens INSIDE `comp/observer/`. Detector names are
225+
INVENTED by you — pick one and keep it stable across iterations so
226+
the coordinator's baseline/gate machinery lines up.
222227
223228
- **Adapted research from related systems.** Datadog's watchdog uses
224229
AnomalyRank. Netflix's SURUS does robust PCA on streams. NAB has a battery

0 commit comments

Comments
 (0)