Commit e2c3e03
Add day0-release orchestration skill with enforced gates (#1596)
### What does this PR do?
Type of change: new feature (Claude agent skill)
Adds a `day0-release` orchestration skill that chains the existing
domain skills (`ptq` → `deployment` → `evaluation` → `compare-results`)
into the day-0 release happy path, with **code-enforced gates** between
stages.
**The problem it solves.** Today the day-0 release sequence is
*model-driven* — the agent re-decides the PTQ → deploy → eval → compare
order each session, and can silently skip a gate (e.g. report scores
from an incomplete eval, or hand off a checkpoint that missed
quantization coverage). These are failure modes we hit repeatedly in
live trials. `day0-release` makes the sequence and the gates
deterministic so the common case is repeatable.
**What it is — and isn't.** It's a *conductor*, not a new instrument. It
owns the **sequence, the gates, and the accept/retry decision**; the
domain skills still own the actual work. It is **not** for single-stage
asks ("just quantize X" → `ptq`; "run MMLU on this endpoint" →
`evaluation`) — its negative trigger excludes those. It fires only for
the full goal-driven release.
**Goal it drives toward** (the documented day-0 criterion): a quantized
checkpoint smaller than the original, with <1% accuracy drop on the
standard set vs the matching baseline, plus a publish recommendation.
**Contents:**
- `.claude/skills/day0-release/SKILL.md` — the chain + gate calls +
decision logic.
- `.claude/skills/day0-release/scripts/gate_ptq.py`, `gate_run.py`,
`gate_compare.py` — deterministic gate scripts. Each is a pure decision
function (`evaluate_*`) plus a thin file-reading `main`, returning JSON
`{pass, failure_class, detail, ...}`. The `failure_class` values match
the `modelopt-agent-protocols` strawman (`MODEL_UNSUPPORTED`,
`QUANT_COVERAGE_FAILURE`, `EVAL_JUDGE_FAILED`,
`SAMPLE_ACCOUNTING_FAILED`, …).
- `.claude/skills/day0-release/tests/test_gates.py` — 21 unit tests for
the gate decision functions (GPU-free, no cluster).
- `.claude/skills/day0-release/tests/evals.json` — 5 routing + behavior
assertions.
**As-built note — `gate_ptq.py` input.** v1 takes a `--summary
<validation.json>` (size scan + hf_ptq quant summary: `source_bytes`,
`output_bytes`, `recipe`, `layer_precision_counts`, `metadata_diffs`)
rather than reading the safetensors checkpoint directly. The
`--checkpoint/--source/--recipe` args are reserved stubs; wiring the
gate to build that summary from the exported checkpoint is a follow-up.
`gate_run.py` and `gate_compare.py` likewise read small JSON summaries
the agent produces from the run artifacts.
### Usage
Ask Claude Code:
```
Release `<org>/<model>` at day-0: quantize to NVFP4, validate accuracy is within
1% of the BF16 baseline on the AA suite, and tell me if it's publishable.
Run on <cluster>.
```
The skill then runs the chain, enforcing a gate after each stage:
```text
setup ─▶ PTQ ─▶ deploy ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
│ │ │ │ │
gate_ptq health gate_run gate_run gate_compare
```
| After stage | Gate | Pass condition | On fail |
|---|---|---|---|
| Setup | reachability | creds present, cluster SSH ok | SYSTEMIC →
abort |
| PTQ | `gate_ptq.py` | size ratio <1, layer coverage matches recipe,
metadata consistent | triage → PATCH / skip recipe / abort |
| Deploy | health | endpoint up + 1 successful generation | triage →
PATCH flags/TP / skip |
| Each eval | `gate_run.py` | complete, all samples scored, no
judge/parse failure | retry / triage |
| Compare | `gate_compare.py` | every task within <1% drop | REGRESSION
→ report; ANOMALOUS → human |
It returns a **decision**, not a raw artifact: `ACCEPT` (publishable,
with report) / `REGRESSION` (which tasks failed the threshold) /
`ANOMALOUS` / `INFEASIBLE` — plus the workspace path and MLflow run IDs.
### Testing
Tested in two layers — deterministic control flow (CI-able) separate
from the agentic stages (integration-level):
- **Gate-script unit tests** — `tests/test_gates.py`, **21 cases, all
passing** (`python -m pytest
.claude/skills/day0-release/tests/test_gates.py`). Covers the pass path
plus each `failure_class` branch for all three gates: `gate_compare`
(ACCEPT / REGRESSION / ANOMALOUS-on-implausible-gain /
ANOMALOUS-out-of-range / mismatched-task-sets / relative-threshold /
non-numeric-score), `gate_run` (valid / dropped-samples / judge-error /
missing-score / non-numeric-score / timeout-non-terminal /
running-non-terminal / no-tasks), `gate_ptq` (pass / not-smaller /
zero-coverage→MODEL_UNSUPPORTED / unexpected-unquantized / metadata-diff
/ unknown-recipe). No GPU or cluster needed.
- **Routing assertions** (`tests/evals.json`, 5 cases): documents
expected routing — fires on "release model X at day-0", does **not**
fire on "just quantize X" / "run MMLU on this endpoint", and the
gate-blocking / regression-reports-and-stops behaviors. These are
behavior specs for manual QA; there is no automated skill-routing
harness yet (tracked as "Remaining Work" in the design doc).
- **End-to-end integration (run on aws-pdx / B300, 2026-06-04)** — ✅ the
full chain and **all five gate types** were exercised on **real cluster
artifacts** (Qwen3-0.6B). Results below.
#### End-to-end integration results
Every gate fired correctly on real PTQ / serve / eval outputs, and
**both** failure-routing branches were exercised:
| Stage | Gate | Real artifact | Verdict | ✓ |
|---|---|---|---|---|
| Setup | reachability | aws-pdx SSH + SLURM | PASS | ✅ |
| PTQ (happy) | `gate_ptq` | FP8 ckpt, 0.50×, FP8=196 layers,
unexpected=0 | `pass:true` | ✅ |
| **PTQ (failure fixture)** | `gate_ptq` | `nvfp4_experts_only` on a
**dense** model → NVFP4=0, 196 silently-unquantized |
**`MODEL_UNSUPPORTED`** → chain **STOPS** before deploy | ✅ |
| Deploy | health | 2 vLLM endpoints, `/health`=200 + generation | PASS
| ✅ |
| Eval ×2 | `gate_run` | real run-summaries, 300/300 scored, SUCCESS |
`pass:true` | ✅ |
| Compare | `gate_compare` | arc_easy BF16=57.0 vs FP8=54.0 |
**`REGRESSION`** (drop 3.0 > 1pt) | ✅ |
- **Failure fixture (the key control-flow proof):** an experts-only
recipe on a dense model matched 0 modules, so PTQ "succeeded" in 3.5s
and exported a checkpoint with `quant_algo:null` / `quantized_layers:{}`
— every linear silently BF16. `gate_ptq` caught the zero-coverage and
routed to `MODEL_UNSUPPORTED`, so the chain **did not** deploy/eval the
bad checkpoint. This is exactly the silent coverage-miss the gate exists
to stop.
- **Caveat — ACCEPT not reached on real data:** the happy path returned
`REGRESSION`, which is the **correct** verdict — Qwen3-0.6B FP8
genuinely loses ~3pt on arc_easy (real, ~2 SE at 300 samples;
weight-only FP8 = 54.0 was no better than KV-FP8 = 54.67, so the
KV-cache wasn't the cause). Tiny models are quantization-fragile and
don't meet the <1% criterion; the gate correctly refused to rubber-stamp
it. The threshold was **not** loosened to manufacture a pass. The
`ACCEPT` terminal + publish-recommendation path therefore remains
validated by unit test (`test_compare_accept_within_threshold`) only,
not end-to-end — demonstrating it on real data needs a larger model
(e.g. Qwen3-4B/8B) where FP8 is near-lossless, deferred as optional
follow-up.
- **Infra bugs shaken out (incidental, not in this skill's code):** (1)
the ModelOpt launcher's `/hf-local` bind mount has no host dir on
aws-pdx → set `SLURM_HF_LOCAL=<lustre dir>`; (2) `lm-eval` is unusable
inside `vllm/vllm-openai:cu130-nightly` (ships transformers 5.x, which
removed `AutoModelForVision2Seq` that lm-eval imports at load) → used a
dependency-free `/v1/completions` log-likelihood scorer.
### Scope
**In v1:** the linear chain + gate scripts + `ACCEPT`/fail-with-report
outcomes. On `REGRESSION`, v1 *reports* "recipe R regressed on tasks
[...]" and stops.
**Deferred (follow-up PR):** the evaluator-optimizer recipe loop
(compare → pick next recipe → re-PTQ), which needs the `bigpareto`
integration and the shared `modelopt-agent-protocols` schema adopted on
both sides.
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: ✅ (new skill; no change to
existing skills)
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A (no new
deps)
- Did you write any new necessary tests?: ✅ (21 gate-script unit tests +
5 routing evals; end-to-end integration run on aws-pdx — see Testing)
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
- Did you get Claude approval on this PR?: ⬜ (run `/claude review`
before marking ready)
### Additional Information
Design rationale: ModelOpt Agent Skills Design doc — this skill
implements the "deterministic day-0 chain driver"
(prompt-chaining-with-gates, the code-driven orchestration pattern from
Anthropic's *Building Effective Agents*). The gate scripts double as the
data source for the Observability stage metrics, and `gate_compare.py`'s
verdict is the entry point for the deferred evaluator-optimizer recipe
loop.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added a deterministic day0-release skill: linear gated flow (setup →
PTQ → deploy → eval → compare) that halts on failed gates.
* **New Tools**
* Added CLI gates for PTQ checkpoint validation, run/evaluation
validation, and baseline-vs-candidate comparison (scale-aware
thresholds, anomaly detection, clear verdicts and exit codes).
* **Tests**
* Added unit tests covering pass/regress/anomaly outcomes, failure-class
triage, and gate scenarios.
* **Documentation**
* Added spec and changelog entry describing inputs, gating rules, triage
table, and closeout reporting.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 56c4af2 commit e2c3e03
9 files changed
Lines changed: 1097 additions & 2 deletions
File tree
- .agents/skills/day0-release
- scripts
- tests
- .github/workflows
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
0 commit comments