You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #1067 hit two CI failures the per-item phases could not see:
- Two CLI integration tests asserted a model-validator behaviour that
the new OLA -> SequencingToMinimizeWeightedCompletionTime reduction
intentionally relaxed; closed-loop and unit tests for the new rule
all passed.
- `make paper` blew up on `intersect` (Typst expected `inter`), an
orphan bib key, and a typo'd key; no phase ran the Typst compile.
Adds an orchestrator-owned Step 2.5 that runs
`cargo test --workspace --features "ilp-highs example-db"` and
`make paper` on PR HEAD between Phase 2 (run-pipeline) and Phase 3
(review-pipeline). Any failure parks the card on OnHold for human
triage — codex rescue is not appropriate because the failing artefact
lives outside the issue's files.
In batch mode (multiple issues stacked on one branch with the PR
opened at the end), this gate is the only thing that catches
accumulated cross-item breakage before review.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: .claude/skills/auto-pipeline/SKILL.md
+85-3Lines changed: 85 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ description: Use when you want to take a Backlog issue all the way to Final revi
7
7
8
8
Take **one** Backlog issue all the way from quality gate to **Final review** without human intervention. The merge step itself is still left to the human (see `/final-review`).
9
9
10
-
This skill is an **orchestrator**: it never runs the heavy work itself. Each phase is delegated to a fresh-context subagent that invokes the relevant existing skill (`check-issue`, `fix-issue`, `run-pipeline`, `review-pipeline`). The only thing the main agent does directly is:
10
+
This skill is an **orchestrator**: it never runs the heavy work itself. Each phase is delegated to a fresh-context subagent. Most phases invoke an existing skill (`check-issue`, `fix-issue`, `run-pipeline`, `review-pipeline`); Phase 2.5 is owned by the orchestrator and runs raw `cargo test --workspace` + `make paper` to catch breakage the per-item sub-skills cannot see. The only thing the main agent does directly is:
11
11
12
12
1. pick the issue,
13
13
2. read structured reports from subagents,
@@ -68,6 +68,7 @@ digraph auto_pipeline {
68
68
"Move to OnHold + comment" [shape=box, style=filled, fillcolor="#ffcccc"];
"Phase 3: review-pipeline (subagent)" -> "Final report";
89
92
}
90
93
```
@@ -328,7 +331,7 @@ Return ONLY this JSON shape:
328
331
329
332
When the subagent returns:
330
333
331
-
-**`outcome == "success"`** → continue to Step 3.
334
+
-**`outcome == "success"`** → continue to Step 2.5.
332
335
-**`outcome == "failure"`** → STOP. The `run-pipeline` skill already moves the card to OnHold and posts a diagnostic comment, so we do not duplicate. Print:
333
336
334
337
```
@@ -341,6 +344,83 @@ When the subagent returns:
341
344
342
345
Do NOT call codex to rescue here — implementation failures are CI/code-shape problems that need human eyes.
`run-pipeline` and the sub-skills it invokes (`add-model`, `add-rule`, `review-structural`) test each item **in isolation** — the new rule's closed-loop, the new model's unit tests, the rule's own paper entry. None of them runs the full workspace test suite or `make paper`. Two classes of breakage slip past:
350
+
351
+
-**Cross-crate test regressions.** A model change (validator relaxation, schema field rename, removed enum variant) can break pre-existing tests in `problemreductions-cli/tests/` or other crates that the per-item tests never exercise.
352
+
-**Paper compile errors.** A typo'd math-mode token (`intersect` vs Typst's `inter`), an orphan bib key (cited but absent from `references.bib`), or a stale key (`@lawler1978a` vs `@lawler1978`) is invisible until `typst compile` resolves references — neither `add-rule` nor `review-structural` runs `make paper`.
353
+
354
+
CI catches both, but only after the PR opens. In **batch mode** (multiple issues stacked on a shared branch with one PR opened at the end — outside this skill's `one-issue-one-PR` default), the breakage accumulates silently across N commits. Running this gate after every Phase 2 success keeps the cadence right regardless of PR strategy.
355
+
356
+
### 2.5a. Dispatch the integration-gate subagent
357
+
358
+
Use the `Agent` tool with `subagent_type=general-purpose`. The orchestrator main agent does NOT own a worktree, so this subagent enters the PR branch itself. The subagent is NOT invoking an existing skill — these are raw commands.
359
+
360
+
**Prompt template:**
361
+
362
+
```
363
+
Run the auto-pipeline integration gate on PR #<PR> at HEAD.
HEAD: \`$HEAD_SHA\`. Per-item closed-loop tests passed, but workspace-wide tests or the Typst paper compile broke. Common causes: validator change relaxes a constraint that a CLI integration test asserted; new \`reduction-rule\` cites a bib key not in \`references.bib\`; math-mode uses an English word Typst does not know (e.g. \`intersect\` should be \`inter\`).
uv run --project scripts scripts/pipeline_board.py move "$ITEM_ID" on-hold
408
+
```
409
+
410
+
```
411
+
Auto-pipeline halted at integration gate:
412
+
Issue: #<ISSUE>
413
+
PR: #<PR>
414
+
Tests: <pass|fail> — <first_failure>
415
+
Paper: <pass|fail> — <first_failure>
416
+
Board: Review pool -> OnHold
417
+
Next: human triage; cross-crate regressions and paper-compile bugs
418
+
are not eligible for codex rescue because the failing artefact
419
+
lives outside the new issue's files.
420
+
```
421
+
422
+
Do NOT auto-dispatch `codex:codex-rescue` here. The failure mode is "this issue's implementation broke something the per-item tests do not cover," which is a focused investigative task that benefits from human eyes (and the offending code usually lives in a file the issue body does not reference, so the codex prompt would be misleading).
Dispatch the existing `review-pipeline` skill against the PR:
@@ -383,3 +463,5 @@ Auto-pipeline complete:
383
463
| Letting the codex subagent edit GitHub | The orchestrator owns all `gh issue edit` calls — codex only returns text |
384
464
| Treating implementation failures as substantive issue problems | Step 2 failures go straight to a stop; they are not eligible for codex rescue |
385
465
| Picking from a non-Backlog column when no issue number is given | Auto-pick must read from Backlog only — never from OnHold, Ready, or elsewhere |
466
+
| Skipping Step 2.5 because Phase 2 reported `success` | Phase 2's success is scoped to the new item's own tests. Workspace-wide regressions (e.g. CLI integration tests asserting a model behaviour the new rule just relaxed) and paper compile errors (`intersect` vs `inter`, orphan bib keys) are only visible after `cargo test --workspace` and `make paper`. Always run Step 2.5 before Step 3. |
467
+
| Treating Step 2.5 failures as eligible for codex rescue | The failing artefact lives outside the new issue's files (CLI tests in another crate, a bib key elsewhere in`references.bib`, a Typst symbol in an unrelated proof). Codex prompts seeded from the issue body would be misleading. Halt + human triage. |
0 commit comments