coordinator: derive KNOWN_DETECTORS from baseline; reword proposer prompt for blank slate

ellataira · ellataira · commit 4863d2b5bd59 · 2026-04-24T16:03:25.000-04:00
Blank-slate variant: the observer has no detectors/correlators at launch.
The hardcoded KNOWN_DETECTORS tuple (bocpd/scanmw/scanwelch) made every
candidate on observer-blank fall through relevant_detectors() to a list
of names that don't exist in the catalog, producing empty eval reports
and nonsensical gate outcomes.

- driver.py: replace module-level KNOWN_DETECTORS with known_detectors(db)
  that reads baseline.detectors.keys(). When baseline is empty (true
  blank slate), relevant_detectors() returns the candidate's own
  target_components so the eval still runs against the detector the
  candidate just created. Gains as baseline re-imports happen.
- proposer.py: rewrite the three prompt paragraphs that named
  bocpd/scanmw/scanwelch as existing; they don't exist on this branch.
  Tell the agent it's inventing detectors from scratch, must register
  them in component_catalog.go, and must keep the name stable across
  iterations so baseline/gate lines up.

Live run (q-branch-observer) uses ella/claude-coordinator-harness and
is untouched.
diff --git a/tasks/coordinator/driver.py b/tasks/coordinator/driver.py
@@ -555,39 +555,54 @@ def _recent_same_family(db: Db, candidate: Candidate, limit: int = 5) -> list[di
     return out
 
 
-KNOWN_DETECTORS = ("bocpd", "scanmw", "scanwelch")
+def known_detectors(db: Db) -> tuple[str, ...]:
+    """Detector names recognised by the coordinator for gating purposes.
+
+    Derived from baseline.detectors so the set grows as new detectors are
+    imported via re-baselining — no hardcoded list to keep in sync with
+    the catalog. A detector must have a baseline entry to be gate-eligible;
+    score_against_baseline does detectors[name] lookup which would KeyError
+    without one.
+    """
+    if db.baseline is None or not db.baseline.detectors:
+        return ()
+    return tuple(db.baseline.detectors.keys())
 
 
-def relevant_detectors(candidate: Candidate) -> list[str]:
+def relevant_detectors(candidate: Candidate, db: Db) -> list[str]:
     """Which detectors' F1 do we measure to decide if this candidate shipped?
 
     A candidate can modify any file under comp/observer/. But scoring runs
     per-detector, and the panel review caught this: silently defaulting to
-    ONE detector (previously scanmw) meant candidates modifying e.g. bocpd
-    internals got scored against scanmw's unaffected output — ΔF1≈0 by
+    ONE detector meant candidates modifying e.g. detector A's internals
+    got scored against detector B's unaffected output — ΔF1≈0 by
     construction, "improvement" or "regression" both invisible.
 
     Policy:
-      - Intersect target_components with the 3 known detectors. If the
+      - Intersect target_components with known (baselined) detectors. If the
         intersection is non-empty, eval each one in it and gate on the
         WORST ΔF1 across them.
       - If the intersection is empty (correlator changes, new features,
-        pipeline-level work), eval ALL 3 detectors — we can't tell in
+        pipeline-level work), eval ALL known detectors — we can't tell in
         advance which one the change affects, so measure them all.
-
-    Always returns a non-empty list.
+      - If no detectors are baselined yet (blank-slate bootstrap), return
+        target_components as-is. Scoring will treat empty baseline.scenarios
+        as "no gate," and the per-detector report is still generated.
     """
-    named = [c for c in candidate.target_components if c in KNOWN_DETECTORS]
+    known = known_detectors(db)
+    if not known:
+        return list(candidate.target_components) or ["unknown"]
+    named = [c for c in candidate.target_components if c in known]
     if named:
         return named
-    return list(KNOWN_DETECTORS)
+    return list(known)
 
 
-def primary_detector(candidate: Candidate) -> str:
+def primary_detector(candidate: Candidate, db: Db) -> str:
     """Deprecated single-detector view. Kept for callers that print a
     single string (metrics, log lines). Returns the first relevant detector.
     """
-    return relevant_detectors(candidate)[0]
+    return relevant_detectors(candidate, db)[0]
 
 
 def _merge_scorings(scorings: dict, detectors: list[str]):
@@ -847,7 +862,7 @@ def _run_iteration_body(
         return
 
     it.candidate_id = candidate.id
-    detectors = relevant_detectors(candidate)
+    detectors = relevant_detectors(candidate, db)
     # Why a list, not a single detector: a candidate modifying correlator
     # code or a shared feature affects MULTIPLE detectors. Gating on a
     # single "primary" detector was a panel-reviewed BLOCK — silent
diff --git a/tasks/coordinator/proposer.py b/tasks/coordinator/proposer.py
@@ -156,9 +156,11 @@ def build_proposer_prompt(
 novel** candidate changes that might improve anomaly detection on the
 observer pipeline.
 
-This harness is explicitly for exploration. Threshold-tuning on the three
-existing detectors (bocpd / scanmw / scanwelch) is the LEAST interesting
-thing you can do here — it finds small local wins and saturates fast.
+This harness is explicitly for exploration. The observer pipeline currently
+has ZERO detectors and ZERO correlators — only wiring, storage, extractors,
+and interfaces. Your job is to invent the detectors/correlators from
+scratch. Shallow threshold-tuning on whatever you first land is the LEAST
+interesting thing you can do here — saturate fast and pivot.
 
 What's actually interesting:
 
@@ -173,15 +175,17 @@ def build_proposer_prompt(
   outputs differently, new emitter logic, new feature-engineering stages,
   seasonality-aware baseline windows, per-signal-class routing.
 
-- **Replace an existing detector's internals.** bocpd/scanmw/scanwelch
-  are starting points, not sacred. Keep the detector's registration and
-  whichever interface from `comp/observer/def/component.go` it already
-  implements (`SeriesDetector` or `Detector`), but swap the guts for a
-  different algorithm entirely (e.g. replace BOCPD with a density-ratio
-  detector while keeping the `bocpd` name).
-
-- **Prefer non-doubling patterns over full replacement** when the
-  original detector has visible wins. Wholesale replacement can
+- **Register detectors in `comp/observer/impl/component_catalog.go`.**
+  Implement the `Detector` or `SeriesDetector` interface from
+  `comp/observer/def/component.go`, give the detector a stable name,
+  add an entry to `defaultCatalog()`. The name must match the
+  candidate's `target_components[0]`; the coordinator uses that name
+  for `q.eval-scenarios --only <name>` and for baseline gating. Pick a
+  name and keep it stable across iterations of the same family.
+
+- **Evolve or replace without doubling work.** Once a detector is
+  shipped, later candidates can refine it in place or swap the guts
+  while keeping the registered name. Wholesale replacement can
   catastrophically regress scenarios the original aced (see `recent
   experiments` — replacements tend to show big +ΔF1 on scenarios the
   original missed AND big -ΔF1 on scenarios it aced).
@@ -216,9 +220,10 @@ def build_proposer_prompt(
 
 The eval framework is OFF LIMITS. Do NOT modify `tasks/q.py`,
 `tasks/libs/q`, `q.eval-scenarios` orchestration, or the testbench
-registry. The three detector names and scenario list are fixed
-evaluation boundaries. All innovation happens INSIDE `comp/observer/`,
-behind the three existing detector names.
+registry. The scenario list is a fixed evaluation boundary. All
+innovation happens INSIDE `comp/observer/`. Detector names are
+INVENTED by you — pick one and keep it stable across iterations so
+the coordinator's baseline/gate machinery lines up.
 
 - **Adapted research from related systems.** Datadog's watchdog uses
   AnomalyRank. Netflix's SURUS does robust PCA on streams. NAB has a battery