Skip to content

Commit 6906779

Browse files
committed
coordinator: eval union of baselined + target detectors; graceful missing-baseline scoring
With blank-slate upstream (ella/observer-blank) the agent invents detector names freely. Old policy in relevant_detectors() — 'fall back to all known detectors when target doesn't intersect known' — silently excluded the new detector the candidate just created, evaluating only stale ones. Plus score_against_baseline KeyError'd on any detector not yet baselined. Behavior now (for every iteration): detectors_to_eval = baseline.detectors.keys() ∪ candidate.target_components - Every baselined detector is evaluated every iter → catches 'did this candidate break an existing ship' across the whole admitted set, not just the ones the candidate explicitly targets. - The candidate's own target components are ALWAYS evaluated even if not yet baselined → the new detector's progress is measured and recorded per-iter. - score_against_baseline returns a no-gate ScoringResult when the detector is missing from baseline: raw F1/FPs populated, strict_regressions=[], recall_floor_violations=[], baseline_mean_f1=0. Gates only fire for baselined detectors. FP-ceiling already guards baseline_total_fps > 0 so it auto-skips for unbaselined detectors. Promotion flow: when iter N ships a good novel-vX detector, operator runs import_baseline --detector novel-vX=<iter N report path> to admit it. From then on future candidates are gated against novel-vX too. No rolling auto-ratchet (anti-noise-promotion); promotion is always a human decision.
1 parent 4863d2b commit 6906779

2 files changed

Lines changed: 46 additions & 25 deletions

File tree

tasks/coordinator/driver.py

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -570,32 +570,32 @@ def known_detectors(db: Db) -> tuple[str, ...]:
570570

571571

572572
def relevant_detectors(candidate: Candidate, db: Db) -> list[str]:
573-
"""Which detectors' F1 do we measure to decide if this candidate shipped?
574-
575-
A candidate can modify any file under comp/observer/. But scoring runs
576-
per-detector, and the panel review caught this: silently defaulting to
577-
ONE detector meant candidates modifying e.g. detector A's internals
578-
got scored against detector B's unaffected output — ΔF1≈0 by
579-
construction, "improvement" or "regression" both invisible.
580-
581-
Policy:
582-
- Intersect target_components with known (baselined) detectors. If the
583-
intersection is non-empty, eval each one in it and gate on the
584-
WORST ΔF1 across them.
585-
- If the intersection is empty (correlator changes, new features,
586-
pipeline-level work), eval ALL known detectors — we can't tell in
587-
advance which one the change affects, so measure them all.
588-
- If no detectors are baselined yet (blank-slate bootstrap), return
589-
target_components as-is. Scoring will treat empty baseline.scenarios
590-
as "no gate," and the per-detector report is still generated.
573+
"""Which detectors' F1 do we measure for this candidate?
574+
575+
Policy: eval the UNION of (a) all baselined detectors — so every
576+
candidate is checked against the "don't break existing wins" contract
577+
for every baselined detector — and (b) the candidate's own
578+
target_components, so a brand-new detector name the agent just
579+
invented is still measured (its score enters the report but passes
580+
no gates until a human admits it into the baseline via
581+
import_baseline).
582+
583+
Rationale: on a blank-slate run the agent invents detector names
584+
freely. The previous "fall back to all known if target doesn't
585+
intersect known" policy evaluated the wrong detector — the new one
586+
the candidate created was silently excluded. Including target_components
587+
unconditionally fixes that; gates stay safe because
588+
score_against_baseline treats an unbaselined detector as "no gate."
589+
590+
Always returns a non-empty list. Order: baselined first (stable),
591+
then new target_components (in given order).
591592
"""
592593
known = known_detectors(db)
593-
if not known:
594-
return list(candidate.target_components) or ["unknown"]
595-
named = [c for c in candidate.target_components if c in known]
596-
if named:
597-
return named
598-
return list(known)
594+
ordered: list[str] = list(known)
595+
for c in candidate.target_components:
596+
if c not in ordered:
597+
ordered.append(c)
598+
return ordered or ["unknown"]
599599

600600

601601
def primary_detector(candidate: Candidate, db: Db) -> str:

tasks/coordinator/scoring.py

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,28 @@ def score_against_baseline(
8989
"no-worse-than-baseline" contract, period.
9090
"""
9191
mean_f1, observed = load_report(report_path)
92-
bd = baseline.detectors[detector]
92+
bd = baseline.detectors.get(detector)
93+
if bd is None:
94+
# Detector not baselined yet (blank-slate bootstrap or a newly
95+
# invented name). Report the raw scores but skip every gate —
96+
# there is nothing to regress against. A human admits this
97+
# detector into the baseline via import_baseline once they
98+
# decide it is promising enough to lock in as the new floor.
99+
total_observed_fps = sum(s.num_baseline_fps for s in observed.values())
100+
return ScoringResult(
101+
detector=detector,
102+
mean_f1=mean_f1,
103+
total_fps=total_observed_fps,
104+
per_scenario=observed,
105+
baseline_mean_f1=0.0,
106+
baseline_total_fps=0,
107+
mean_df1=mean_f1,
108+
total_dfps=total_observed_fps,
109+
per_scenario_delta={},
110+
strict_regressions=[],
111+
recall_floor_violations=[],
112+
fp_reduction_pct=0.0,
113+
)
93114

94115
deltas: dict[str, ScenarioDelta] = {}
95116
strict_regressions = []

0 commit comments

Comments
 (0)