coordinator: eval union of baselined + target detectors; graceful missing-baseline scoring

ellataira · ellataira · commit 6906779a5759 · 2026-04-24T16:08:54.000-04:00
With blank-slate upstream (ella/observer-blank) the agent invents detector
names freely. Old policy in relevant_detectors() — 'fall back to all known
detectors when target doesn't intersect known' — silently excluded the new
detector the candidate just created, evaluating only stale ones. Plus
score_against_baseline KeyError'd on any detector not yet baselined.

Behavior now (for every iteration):
  detectors_to_eval = baseline.detectors.keys() ∪ candidate.target_components

- Every baselined detector is evaluated every iter → catches 'did this
  candidate break an existing ship' across the whole admitted set, not
  just the ones the candidate explicitly targets.
- The candidate's own target components are ALWAYS evaluated even if
  not yet baselined → the new detector's progress is measured and
  recorded per-iter.
- score_against_baseline returns a no-gate ScoringResult when the
  detector is missing from baseline: raw F1/FPs populated,
  strict_regressions=[], recall_floor_violations=[], baseline_mean_f1=0.
  Gates only fire for baselined detectors. FP-ceiling already guards
  baseline_total_fps &gt; 0 so it auto-skips for unbaselined detectors.

Promotion flow: when iter N ships a good novel-vX detector, operator
runs import_baseline --detector novel-vX=&lt;iter N report path&gt; to admit
it. From then on future candidates are gated against novel-vX too.
No rolling auto-ratchet (anti-noise-promotion); promotion is always a
human decision.
diff --git a/tasks/coordinator/driver.py b/tasks/coordinator/driver.py
@@ -570,32 +570,32 @@ def known_detectors(db: Db) -> tuple[str, ...]:
 
 
 def relevant_detectors(candidate: Candidate, db: Db) -> list[str]:
-    """Which detectors' F1 do we measure to decide if this candidate shipped?
-
-    A candidate can modify any file under comp/observer/. But scoring runs
-    per-detector, and the panel review caught this: silently defaulting to
-    ONE detector meant candidates modifying e.g. detector A's internals
-    got scored against detector B's unaffected output — ΔF1≈0 by
-    construction, "improvement" or "regression" both invisible.
-
-    Policy:
-      - Intersect target_components with known (baselined) detectors. If the
-        intersection is non-empty, eval each one in it and gate on the
-        WORST ΔF1 across them.
-      - If the intersection is empty (correlator changes, new features,
-        pipeline-level work), eval ALL known detectors — we can't tell in
-        advance which one the change affects, so measure them all.
-      - If no detectors are baselined yet (blank-slate bootstrap), return
-        target_components as-is. Scoring will treat empty baseline.scenarios
-        as "no gate," and the per-detector report is still generated.
+    """Which detectors' F1 do we measure for this candidate?
+
+    Policy: eval the UNION of (a) all baselined detectors — so every
+    candidate is checked against the "don't break existing wins" contract
+    for every baselined detector — and (b) the candidate's own
+    target_components, so a brand-new detector name the agent just
+    invented is still measured (its score enters the report but passes
+    no gates until a human admits it into the baseline via
+    import_baseline).
+
+    Rationale: on a blank-slate run the agent invents detector names
+    freely. The previous "fall back to all known if target doesn't
+    intersect known" policy evaluated the wrong detector — the new one
+    the candidate created was silently excluded. Including target_components
+    unconditionally fixes that; gates stay safe because
+    score_against_baseline treats an unbaselined detector as "no gate."
+
+    Always returns a non-empty list. Order: baselined first (stable),
+    then new target_components (in given order).
     """
     known = known_detectors(db)
-    if not known:
-        return list(candidate.target_components) or ["unknown"]
-    named = [c for c in candidate.target_components if c in known]
-    if named:
-        return named
-    return list(known)
+    ordered: list[str] = list(known)
+    for c in candidate.target_components:
+        if c not in ordered:
+            ordered.append(c)
+    return ordered or ["unknown"]
 
 
 def primary_detector(candidate: Candidate, db: Db) -> str:
diff --git a/tasks/coordinator/scoring.py b/tasks/coordinator/scoring.py
@@ -89,7 +89,28 @@ def score_against_baseline(
     "no-worse-than-baseline" contract, period.
     """
     mean_f1, observed = load_report(report_path)
-    bd = baseline.detectors[detector]
+    bd = baseline.detectors.get(detector)
+    if bd is None:
+        # Detector not baselined yet (blank-slate bootstrap or a newly
+        # invented name). Report the raw scores but skip every gate —
+        # there is nothing to regress against. A human admits this
+        # detector into the baseline via import_baseline once they
+        # decide it is promising enough to lock in as the new floor.
+        total_observed_fps = sum(s.num_baseline_fps for s in observed.values())
+        return ScoringResult(
+            detector=detector,
+            mean_f1=mean_f1,
+            total_fps=total_observed_fps,
+            per_scenario=observed,
+            baseline_mean_f1=0.0,
+            baseline_total_fps=0,
+            mean_df1=mean_f1,
+            total_dfps=total_observed_fps,
+            per_scenario_delta={},
+            strict_regressions=[],
+            recall_floor_violations=[],
+            fp_reduction_pct=0.0,
+        )
 
     deltas: dict[str, ScenarioDelta] = {}
     strict_regressions = []