Skip to content

Commit e2c3e03

Browse files
Edwardf0t1claude
andauthored
Add day0-release orchestration skill with enforced gates (#1596)
### What does this PR do? Type of change: new feature (Claude agent skill) Adds a `day0-release` orchestration skill that chains the existing domain skills (`ptq` → `deployment` → `evaluation` → `compare-results`) into the day-0 release happy path, with **code-enforced gates** between stages. **The problem it solves.** Today the day-0 release sequence is *model-driven* — the agent re-decides the PTQ → deploy → eval → compare order each session, and can silently skip a gate (e.g. report scores from an incomplete eval, or hand off a checkpoint that missed quantization coverage). These are failure modes we hit repeatedly in live trials. `day0-release` makes the sequence and the gates deterministic so the common case is repeatable. **What it is — and isn't.** It's a *conductor*, not a new instrument. It owns the **sequence, the gates, and the accept/retry decision**; the domain skills still own the actual work. It is **not** for single-stage asks ("just quantize X" → `ptq`; "run MMLU on this endpoint" → `evaluation`) — its negative trigger excludes those. It fires only for the full goal-driven release. **Goal it drives toward** (the documented day-0 criterion): a quantized checkpoint smaller than the original, with <1% accuracy drop on the standard set vs the matching baseline, plus a publish recommendation. **Contents:** - `.claude/skills/day0-release/SKILL.md` — the chain + gate calls + decision logic. - `.claude/skills/day0-release/scripts/gate_ptq.py`, `gate_run.py`, `gate_compare.py` — deterministic gate scripts. Each is a pure decision function (`evaluate_*`) plus a thin file-reading `main`, returning JSON `{pass, failure_class, detail, ...}`. The `failure_class` values match the `modelopt-agent-protocols` strawman (`MODEL_UNSUPPORTED`, `QUANT_COVERAGE_FAILURE`, `EVAL_JUDGE_FAILED`, `SAMPLE_ACCOUNTING_FAILED`, …). - `.claude/skills/day0-release/tests/test_gates.py` — 21 unit tests for the gate decision functions (GPU-free, no cluster). - `.claude/skills/day0-release/tests/evals.json` — 5 routing + behavior assertions. **As-built note — `gate_ptq.py` input.** v1 takes a `--summary <validation.json>` (size scan + hf_ptq quant summary: `source_bytes`, `output_bytes`, `recipe`, `layer_precision_counts`, `metadata_diffs`) rather than reading the safetensors checkpoint directly. The `--checkpoint/--source/--recipe` args are reserved stubs; wiring the gate to build that summary from the exported checkpoint is a follow-up. `gate_run.py` and `gate_compare.py` likewise read small JSON summaries the agent produces from the run artifacts. ### Usage Ask Claude Code: ``` Release `<org>/<model>` at day-0: quantize to NVFP4, validate accuracy is within 1% of the BF16 baseline on the AA suite, and tell me if it's publishable. Run on <cluster>. ``` The skill then runs the chain, enforcing a gate after each stage: ```text setup ─▶ PTQ ─▶ deploy ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout │ │ │ │ │ gate_ptq health gate_run gate_run gate_compare ``` | After stage | Gate | Pass condition | On fail | |---|---|---|---| | Setup | reachability | creds present, cluster SSH ok | SYSTEMIC → abort | | PTQ | `gate_ptq.py` | size ratio <1, layer coverage matches recipe, metadata consistent | triage → PATCH / skip recipe / abort | | Deploy | health | endpoint up + 1 successful generation | triage → PATCH flags/TP / skip | | Each eval | `gate_run.py` | complete, all samples scored, no judge/parse failure | retry / triage | | Compare | `gate_compare.py` | every task within <1% drop | REGRESSION → report; ANOMALOUS → human | It returns a **decision**, not a raw artifact: `ACCEPT` (publishable, with report) / `REGRESSION` (which tasks failed the threshold) / `ANOMALOUS` / `INFEASIBLE` — plus the workspace path and MLflow run IDs. ### Testing Tested in two layers — deterministic control flow (CI-able) separate from the agentic stages (integration-level): - **Gate-script unit tests** — `tests/test_gates.py`, **21 cases, all passing** (`python -m pytest .claude/skills/day0-release/tests/test_gates.py`). Covers the pass path plus each `failure_class` branch for all three gates: `gate_compare` (ACCEPT / REGRESSION / ANOMALOUS-on-implausible-gain / ANOMALOUS-out-of-range / mismatched-task-sets / relative-threshold / non-numeric-score), `gate_run` (valid / dropped-samples / judge-error / missing-score / non-numeric-score / timeout-non-terminal / running-non-terminal / no-tasks), `gate_ptq` (pass / not-smaller / zero-coverage→MODEL_UNSUPPORTED / unexpected-unquantized / metadata-diff / unknown-recipe). No GPU or cluster needed. - **Routing assertions** (`tests/evals.json`, 5 cases): documents expected routing — fires on "release model X at day-0", does **not** fire on "just quantize X" / "run MMLU on this endpoint", and the gate-blocking / regression-reports-and-stops behaviors. These are behavior specs for manual QA; there is no automated skill-routing harness yet (tracked as "Remaining Work" in the design doc). - **End-to-end integration (run on aws-pdx / B300, 2026-06-04)** — ✅ the full chain and **all five gate types** were exercised on **real cluster artifacts** (Qwen3-0.6B). Results below. #### End-to-end integration results Every gate fired correctly on real PTQ / serve / eval outputs, and **both** failure-routing branches were exercised: | Stage | Gate | Real artifact | Verdict | ✓ | |---|---|---|---|---| | Setup | reachability | aws-pdx SSH + SLURM | PASS | ✅ | | PTQ (happy) | `gate_ptq` | FP8 ckpt, 0.50×, FP8=196 layers, unexpected=0 | `pass:true` | ✅ | | **PTQ (failure fixture)** | `gate_ptq` | `nvfp4_experts_only` on a **dense** model → NVFP4=0, 196 silently-unquantized | **`MODEL_UNSUPPORTED`** → chain **STOPS** before deploy | ✅ | | Deploy | health | 2 vLLM endpoints, `/health`=200 + generation | PASS | ✅ | | Eval ×2 | `gate_run` | real run-summaries, 300/300 scored, SUCCESS | `pass:true` | ✅ | | Compare | `gate_compare` | arc_easy BF16=57.0 vs FP8=54.0 | **`REGRESSION`** (drop 3.0 > 1pt) | ✅ | - **Failure fixture (the key control-flow proof):** an experts-only recipe on a dense model matched 0 modules, so PTQ "succeeded" in 3.5s and exported a checkpoint with `quant_algo:null` / `quantized_layers:{}` — every linear silently BF16. `gate_ptq` caught the zero-coverage and routed to `MODEL_UNSUPPORTED`, so the chain **did not** deploy/eval the bad checkpoint. This is exactly the silent coverage-miss the gate exists to stop. - **Caveat — ACCEPT not reached on real data:** the happy path returned `REGRESSION`, which is the **correct** verdict — Qwen3-0.6B FP8 genuinely loses ~3pt on arc_easy (real, ~2 SE at 300 samples; weight-only FP8 = 54.0 was no better than KV-FP8 = 54.67, so the KV-cache wasn't the cause). Tiny models are quantization-fragile and don't meet the <1% criterion; the gate correctly refused to rubber-stamp it. The threshold was **not** loosened to manufacture a pass. The `ACCEPT` terminal + publish-recommendation path therefore remains validated by unit test (`test_compare_accept_within_threshold`) only, not end-to-end — demonstrating it on real data needs a larger model (e.g. Qwen3-4B/8B) where FP8 is near-lossless, deferred as optional follow-up. - **Infra bugs shaken out (incidental, not in this skill's code):** (1) the ModelOpt launcher's `/hf-local` bind mount has no host dir on aws-pdx → set `SLURM_HF_LOCAL=<lustre dir>`; (2) `lm-eval` is unusable inside `vllm/vllm-openai:cu130-nightly` (ships transformers 5.x, which removed `AutoModelForVision2Seq` that lm-eval imports at load) → used a dependency-free `/v1/completions` log-likelihood scorer. ### Scope **In v1:** the linear chain + gate scripts + `ACCEPT`/fail-with-report outcomes. On `REGRESSION`, v1 *reports* "recipe R regressed on tasks [...]" and stops. **Deferred (follow-up PR):** the evaluator-optimizer recipe loop (compare → pick next recipe → re-PTQ), which needs the `bigpareto` integration and the shared `modelopt-agent-protocols` schema adopted on both sides. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (new skill; no change to existing skills) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A (no new deps) - Did you write any new necessary tests?: ✅ (21 gate-script unit tests + 5 routing evals; end-to-end integration run on aws-pdx — see Testing) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: ⬜ (run `/claude review` before marking ready) ### Additional Information Design rationale: ModelOpt Agent Skills Design doc — this skill implements the "deterministic day-0 chain driver" (prompt-chaining-with-gates, the code-driven orchestration pattern from Anthropic's *Building Effective Agents*). The gate scripts double as the data source for the Observability stage metrics, and `gate_compare.py`'s verdict is the entry point for the deferred evaluator-optimizer recipe loop. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a deterministic day0-release skill: linear gated flow (setup → PTQ → deploy → eval → compare) that halts on failed gates. * **New Tools** * Added CLI gates for PTQ checkpoint validation, run/evaluation validation, and baseline-vs-candidate comparison (scale-aware thresholds, anomaly detection, clear verdicts and exit codes). * **Tests** * Added unit tests covering pass/regress/anomaly outcomes, failure-class triage, and gate scenarios. * **Documentation** * Added spec and changelog entry describing inputs, gating rules, triage table, and closeout reporting. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 56c4af2 commit e2c3e03

9 files changed

Lines changed: 1097 additions & 2 deletions

File tree

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
name: day0-release
3+
description: Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).
4+
license: Apache-2.0
5+
---
6+
7+
# Day-0 Release
8+
9+
Drive a model from a pretrained checkpoint to a publish decision for a quantized
10+
checkpoint, in a fixed sequence with a gate after every stage. This skill is a
11+
**conductor**: it sequences the existing domain skills and enforces the gates —
12+
it does not re-implement quantization, serving, evaluation, or comparison.
13+
14+
**Goal (the default day-0 criterion):** a quantized checkpoint smaller than the
15+
source, with accuracy drop within the threshold (default <1%) on the standard
16+
benchmark set versus the matching baseline, plus a publish recommendation.
17+
18+
## When to use
19+
20+
Use only for the full goal-driven release. For a single stage, route to the
21+
domain skill directly: quantize → **ptq**, serve → **deployment**, evaluate →
22+
**evaluation**, compare two existing runs → **compare-results**.
23+
24+
## Inputs
25+
26+
Resolve these before starting (ask the user for anything missing):
27+
28+
- **Model** — HF handle or checkpoint path.
29+
- **Recipe / qformat** — e.g. `nvfp4`, `fp8`, or a recipe path. One candidate for v1.
30+
- **Cluster / launcher** — from `clusters.yaml` (see `skills/common/environment-setup.md`).
31+
- **Eval set** — defaults to the AA suite (`evaluation/recipes/tasks/aa/`).
32+
- **Threshold** — max accuracy drop; default `0.01` (1%).
33+
34+
## The chain
35+
36+
```text
37+
setup ─▶ PTQ ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
38+
│ │ │ │
39+
gate_ptq gate_run gate_run gate_compare
40+
```
41+
42+
The **evaluation** skill deploys the model it evaluates (it stands up its own
43+
endpoint per run), so there is no separate deploy stage — a serving failure
44+
surfaces through the eval stage's gate (`DEPLOYMENT_HEALTH_FAILED`) and triages
45+
to the **deployment** skill to debug serving in isolation (see Step 4).
46+
47+
Run each stage by invoking the domain skill, then run its gate before
48+
proceeding. **Do not advance past a failed gate.** Copy this checklist and track
49+
progress:
50+
51+
```text
52+
- [ ] Step 0: Resolve inputs; confirm threshold and eval set
53+
- [ ] Step 1: Setup gate — creds present, cluster reachable
54+
- [ ] Step 2: PTQ (ptq skill) → gate_ptq.py
55+
- [ ] Step 3: Baseline eval (evaluation skill, deploys source) → gate_run.py [skip if cached, see below]
56+
- [ ] Step 4: Quantized eval (evaluation skill, deploys candidate) → gate_run.py
57+
- [ ] Step 5: Compare (compare-results skill) → gate_compare.py → decision
58+
- [ ] Step 6: Closeout — report + publish recommendation
59+
```
60+
61+
### Step 1 — Setup gate
62+
63+
Confirm credentials (`skills/common/credentials.md`) and cluster reachability
64+
(`skills/common/remote-execution.md`). If either fails, stop with
65+
`SYSTEMIC` — do not start PTQ.
66+
67+
### Step 2 — PTQ
68+
69+
Invoke the **ptq** skill to produce the quantized checkpoint. Then gate:
70+
71+
```bash
72+
# The ptq skill's post-PTQ validation produces a validation-summary JSON (size
73+
# ratio + layer-precision counts + metadata diffs; see
74+
# ptq/references/checkpoint-validation.md). v1 gates on that summary:
75+
python .agents/skills/day0-release/scripts/gate_ptq.py --summary <validation-summary.json>
76+
# add `--recipe <qformat>` to override the recipe recorded in the summary
77+
```
78+
79+
`gate_ptq.py` returns JSON `{pass, failure_class, detail}`. On `pass: false`,
80+
branch on `failure_class` (see **Triage** below). Do not evaluate an
81+
unvalidated checkpoint.
82+
83+
### Step 3 — Baseline eval
84+
85+
The baseline is the **source** (pre-quantization) model on the same task set and
86+
sampling params. **Look it up first** — if a matching baseline run already
87+
exists in MLflow (same model, task set, sampling params), reuse it and skip this
88+
stage. Otherwise run it via the **evaluation** skill (which deploys the source
89+
model itself). Gate with `gate_run.py`.
90+
91+
### Step 4 — Quantized eval
92+
93+
Invoke the **evaluation** skill on the quantized checkpoint, matching the
94+
baseline's task set and sampling params. The evaluation skill stands up the
95+
serving endpoint itself (it builds the `deployment.command`, e.g. a
96+
`vllm serve …`), so a serving failure surfaces here as a failed `gate_run.py`
97+
with `DEPLOYMENT_HEALTH_FAILED`. When that happens, **drop to the deployment
98+
skill** to reproduce and debug serving in isolation (serve the checkpoint
99+
standalone, confirm `/health` + one generation, iterate on flags / TP / image /
100+
env vars) rather than burning full eval cycles on a broken endpoint — then carry
101+
the working command back into NEL's `deployment.command` and resume the eval. If
102+
the checkpoint genuinely can't serve, `POINT_INFEASIBLE`. Gate:
103+
104+
```bash
105+
python .agents/skills/day0-release/scripts/gate_run.py --run <run-summary.json>
106+
```
107+
108+
A `pass: false` here means the run is incomplete or invalid (judge/parse error,
109+
dropped samples) — do **not** compare scores from it.
110+
111+
### Step 5 — Compare
112+
113+
Invoke the **compare-results** skill to produce per-task deltas, then gate:
114+
115+
```bash
116+
python .agents/skills/day0-release/scripts/gate_compare.py \
117+
--baseline <baseline_scores.json> --candidate <candidate_scores.json> \
118+
--threshold 0.01
119+
```
120+
121+
The threshold is a fraction of each task's score scale. Most AA tasks report
122+
0-100, but some (e.g. `tau2_bench_telecom` `Result`) report 0-1; the gate infers
123+
each task's scale (0-1 if both scores are within [0, 1], else 0-100) and
124+
normalizes the drop accordingly, so `--threshold 0.01` means "≤1 pt on a 0-100
125+
task / ≤0.01 on a 0-1 task" uniformly. Pass `--scales '{"task": max}'` to
126+
override inference if a task's scores happen to fall in an ambiguous range.
127+
128+
Decision from `gate_compare.py`:
129+
130+
- **ACCEPT** — every task within threshold → go to Step 6.
131+
- **REGRESSION** — one or more tasks exceed threshold. **v1 stops here and
132+
reports** which tasks regressed by how much. (Picking the next recipe and
133+
re-running is deferred — see Scope.)
134+
- **ANOMALOUS** — scores present but implausible (e.g. baseline lower than
135+
candidate by a large margin, or a task score outside its valid range) →
136+
surface to the user.
137+
138+
### Step 6 — Closeout
139+
140+
Report the decision with: source vs output size + ratio, per-task baseline /
141+
candidate / delta / within-threshold, MLflow run IDs, and a publish
142+
recommendation (publish / do-not-publish / needs-human). Archive artifacts to
143+
the workspace.
144+
145+
## Triage (gate failure → decision)
146+
147+
Map a gate's `failure_class` to the next action:
148+
149+
| `failure_class` | Action |
150+
| --- | --- |
151+
| `INFRA_TRANSIENT` | Retry the stage once; if it recurs, `SYSTEMIC`. |
152+
| `MODEL_UNSUPPORTED` | PATCH: fix the recipe pattern / add model support (ptq skill owns the patch loop), then retry. If unpatchable, `POINT_INFEASIBLE`. |
153+
| `QUANT_COVERAGE_FAILURE` | PATCH: fix the recipe wildcard so intended layers are covered; re-run PTQ. |
154+
| `DEPLOYMENT_HEALTH_FAILED` | Drop to the **deployment** skill: reproduce serving standalone (`/health` + one generation), debug flags / image / TP / env, then carry the working command into NEL's `deployment.command` and retry the eval. If it can't serve, `POINT_INFEASIBLE`. |
155+
| `EVAL_JUDGE_FAILED` | Usually transient (auth / rate limit) — wait and retry. |
156+
| `SAMPLE_ACCOUNTING_FAILED` | Investigate dropped/failed samples before trusting scores. |
157+
| `USER_CONFIG_ERROR` | Stop and ask the user. |
158+
| `UNKNOWN` | Stop and surface to the user (`NEEDS_HUMAN`). |
159+
160+
`SYSTEMIC` (cluster down, dataset unavailable) aborts the whole run.
161+
`POINT_INFEASIBLE` means this (model, recipe) can't work as configured.
162+
163+
## Output
164+
165+
Return a decision, not a raw artifact:
166+
167+
- `ACCEPT` + report + publish recommendation
168+
- `REGRESSION` + which tasks failed the threshold and by how much
169+
- `ANOMALOUS` / `INFEASIBLE` / `NEEDS_HUMAN` + reason
170+
- Always: workspace path + MLflow run IDs for traceability
171+
172+
## Scope (v1)
173+
174+
In v1: the linear chain + gates + report. On `REGRESSION`, v1 reports and stops.
175+
Deferred to a follow-up: the evaluator-optimizer recipe loop (compare → pick the
176+
next recipe → re-run PTQ), which needs the bigpareto integration and a shared
177+
config/result schema.
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
"""Day-0 compare gate.
17+
18+
Decides whether a quantized candidate is within the accuracy threshold of its
19+
baseline, per task. Pure decision logic in ``evaluate_comparison`` (unit-tested
20+
without GPU/cluster); ``main`` reads score JSON files and prints the verdict.
21+
22+
Score files are ``{task_name: score}`` dicts. Most AA task references report
23+
``*_avg_of_N`` on a 0-100 scale, but some tasks (e.g. ``tau2_bench_telecom``
24+
``Result``) report on a 0-1 scale. The gate is therefore scale-aware: each
25+
task's scale is inferred per task (0-1 if both scores are within [0, 1], else
26+
0-100) or supplied explicitly via ``--scales``, and the drop is normalized to a
27+
fraction of that scale so the threshold applies uniformly. The drop is an
28+
absolute (scale-normalized) delta unless ``--relative`` is passed.
29+
"""
30+
31+
from __future__ import annotations
32+
33+
import argparse
34+
import json
35+
import math
36+
import sys
37+
38+
39+
def _is_valid_score(val):
40+
"""True only for a finite real number in [_SCORE_MIN, _SCORE_MAX] (not bool)."""
41+
return (
42+
isinstance(val, (int, float))
43+
and not isinstance(val, bool)
44+
and math.isfinite(val)
45+
and _SCORE_MIN <= val <= _SCORE_MAX
46+
)
47+
48+
49+
# Decisions
50+
ACCEPT = "ACCEPT"
51+
REGRESSION = "REGRESSION"
52+
ANOMALOUS = "ANOMALOUS"
53+
54+
# Plausibility bounds. Scores may be on a 0-1 or 0-100 scale (see _infer_scale);
55+
# the upper bound is the larger of the two so both are accepted.
56+
_SCORE_MIN = 0.0
57+
_SCORE_MAX = 100.0
58+
# A candidate scoring this fraction of its scale ABOVE baseline is implausible
59+
# for quantization (quantization should not meaningfully improve accuracy); flag
60+
# it rather than silently passing. 0.05 = 5 pts on a 0-100 task, 0.05 on a 0-1 task.
61+
_IMPLAUSIBLE_GAIN_FRAC = 0.05
62+
63+
64+
def _infer_scale(*vals):
65+
"""Infer a task's score scale: 1.0 if every score is within [0, 1], else 100.0.
66+
67+
Most AA tasks report 0-100; a few (e.g. ``tau2_bench_telecom``) report 0-1.
68+
Without scale metadata in the score files, we treat a task as 0-1 only when
69+
every score for it fits in [0, 1] — a 0-100 task with sub-1.0 accuracy is
70+
degenerate and caught elsewhere. Pass an explicit scale to override.
71+
"""
72+
return 1.0 if all(0.0 <= v <= 1.0 for v in vals) else 100.0
73+
74+
75+
def evaluate_comparison(baseline, candidate, threshold=0.01, relative=False, scales=None):
76+
"""Compare candidate vs baseline scores per task.
77+
78+
Args:
79+
baseline: dict ``{task: score}``.
80+
candidate: dict ``{task: score}``.
81+
threshold: max allowed drop, as a fraction of the task's scale
82+
(0.01 = 1 percentage point on a 0-100 task / 0.01 on a 0-1 task,
83+
or 1% relative if ``relative``).
84+
relative: if True, drop is measured relative to the baseline score
85+
(scale-invariant).
86+
scales: optional dict ``{task: max_scale}`` to override per-task scale
87+
inference (e.g. ``{"tau2_bench_telecom": 1.0}``).
88+
89+
Returns:
90+
dict ``{pass, decision, failure_class, detail, per_task}``.
91+
"""
92+
scales = scales or {}
93+
if not isinstance(scales, dict) or any(
94+
not (isinstance(v, (int, float)) and not isinstance(v, bool) and math.isfinite(v) and v > 0)
95+
for v in scales.values()
96+
):
97+
return {
98+
"pass": False,
99+
"decision": ANOMALOUS,
100+
"failure_class": "USER_CONFIG_ERROR",
101+
"detail": f"invalid scales: expected {{task: positive finite number}}, got {scales!r}",
102+
"per_task": {},
103+
}
104+
missing = sorted((set(baseline) | set(candidate)) - (set(baseline) & set(candidate)))
105+
if missing:
106+
return {
107+
"pass": False,
108+
"decision": ANOMALOUS,
109+
"failure_class": "SAMPLE_ACCOUNTING_FAILED",
110+
"detail": f"task sets differ; missing on one side: {missing}",
111+
"per_task": {},
112+
}
113+
if not baseline:
114+
return {
115+
"pass": False,
116+
"decision": ANOMALOUS,
117+
"failure_class": "USER_CONFIG_ERROR",
118+
"detail": "no tasks to compare",
119+
"per_task": {},
120+
}
121+
122+
per_task = {}
123+
regressed = []
124+
anomalies = []
125+
for task in sorted(baseline):
126+
b, c = baseline[task], candidate[task]
127+
invalid = False
128+
for label, val in (("baseline", b), ("candidate", c)):
129+
if not _is_valid_score(val):
130+
anomalies.append(f"{task}: {label} score {val!r} not a finite number in [0, 100]")
131+
invalid = True
132+
if invalid:
133+
# Don't compute deltas on non-numeric/out-of-range scores (would raise
134+
# TypeError); record the anomaly and move on — the run is ANOMALOUS.
135+
per_task[task] = {
136+
"baseline": b,
137+
"candidate": c,
138+
"drop": None,
139+
"within_threshold": False,
140+
}
141+
continue
142+
scale = scales.get(task) or _infer_scale(b, c)
143+
delta = b - c # native units, for reporting
144+
if relative:
145+
drop = delta / b if b else 0.0 # fraction of baseline (scale-invariant)
146+
else:
147+
drop = delta / scale # fraction of the task's scale
148+
within = drop <= threshold
149+
gain = (c - b) / scale
150+
if gain > _IMPLAUSIBLE_GAIN_FRAC:
151+
anomalies.append(
152+
f"{task}: candidate exceeds baseline by {c - b:.4g} ({gain:.1%} of scale, implausible)"
153+
)
154+
per_task[task] = {
155+
"baseline": b,
156+
"candidate": c,
157+
"drop": round(delta, 4),
158+
"drop_fraction": round(drop, 4),
159+
"scale": scale,
160+
"within_threshold": within,
161+
}
162+
if not within:
163+
regressed.append(task)
164+
165+
if anomalies:
166+
return {
167+
"pass": False,
168+
"decision": ANOMALOUS,
169+
"failure_class": "UNKNOWN",
170+
"detail": "; ".join(anomalies),
171+
"per_task": per_task,
172+
}
173+
if regressed:
174+
return {
175+
"pass": False,
176+
"decision": REGRESSION,
177+
"failure_class": None,
178+
"detail": f"tasks exceeding threshold ({threshold}): {regressed}",
179+
"per_task": per_task,
180+
}
181+
return {
182+
"pass": True,
183+
"decision": ACCEPT,
184+
"failure_class": None,
185+
"detail": f"all {len(per_task)} task(s) within threshold {threshold}",
186+
"per_task": per_task,
187+
}
188+
189+
190+
def main(argv=None):
191+
"""CLI entry point: read baseline/candidate score JSON and print the verdict."""
192+
p = argparse.ArgumentParser(description="Day-0 compare gate")
193+
p.add_argument("--baseline", required=True, help="baseline score JSON {task: score}")
194+
p.add_argument("--candidate", required=True, help="candidate score JSON {task: score}")
195+
p.add_argument("--threshold", type=float, default=0.01, help="max drop fraction (default 0.01)")
196+
p.add_argument("--relative", action="store_true", help="measure drop relative to baseline")
197+
p.add_argument(
198+
"--scales",
199+
help="optional JSON {task: max_scale} to override per-task scale inference",
200+
)
201+
args = p.parse_args(argv)
202+
203+
try:
204+
with open(args.baseline) as f:
205+
baseline = json.load(f)
206+
with open(args.candidate) as f:
207+
candidate = json.load(f)
208+
scales = json.loads(args.scales) if args.scales else None
209+
except (OSError, json.JSONDecodeError) as e:
210+
print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)}))
211+
return 2
212+
213+
result = evaluate_comparison(baseline, candidate, args.threshold, args.relative, scales)
214+
print(json.dumps(result, indent=2))
215+
return 0 if result["pass"] else 1
216+
217+
218+
if __name__ == "__main__":
219+
sys.exit(main())

0 commit comments

Comments
 (0)