Skip to content

Commit 5a52fb7

Browse files
isPANNclaude
andcommitted
Add auto-pipeline integration gate (Step 2.5)
PR #1067 hit two CI failures the per-item phases could not see: - Two CLI integration tests asserted a model-validator behaviour that the new OLA -> SequencingToMinimizeWeightedCompletionTime reduction intentionally relaxed; closed-loop and unit tests for the new rule all passed. - `make paper` blew up on `intersect` (Typst expected `inter`), an orphan bib key, and a typo'd key; no phase ran the Typst compile. Adds an orchestrator-owned Step 2.5 that runs `cargo test --workspace --features "ilp-highs example-db"` and `make paper` on PR HEAD between Phase 2 (run-pipeline) and Phase 3 (review-pipeline). Any failure parks the card on OnHold for human triage — codex rescue is not appropriate because the failing artefact lives outside the issue's files. In batch mode (multiple issues stacked on one branch with the PR opened at the end), this gate is the only thing that catches accumulated cross-item breakage before review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent f4bad09 commit 5a52fb7

1 file changed

Lines changed: 85 additions & 3 deletions

File tree

.claude/skills/auto-pipeline/SKILL.md

Lines changed: 85 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ description: Use when you want to take a Backlog issue all the way to Final revi
77

88
Take **one** Backlog issue all the way from quality gate to **Final review** without human intervention. The merge step itself is still left to the human (see `/final-review`).
99

10-
This skill is an **orchestrator**: it never runs the heavy work itself. Each phase is delegated to a fresh-context subagent that invokes the relevant existing skill (`check-issue`, `fix-issue`, `run-pipeline`, `review-pipeline`). The only thing the main agent does directly is:
10+
This skill is an **orchestrator**: it never runs the heavy work itself. Each phase is delegated to a fresh-context subagent. Most phases invoke an existing skill (`check-issue`, `fix-issue`, `run-pipeline`, `review-pipeline`); Phase 2.5 is owned by the orchestrator and runs raw `cargo test --workspace` + `make paper` to catch breakage the per-item sub-skills cannot see. The only thing the main agent does directly is:
1111

1212
1. pick the issue,
1313
2. read structured reports from subagents,
@@ -68,6 +68,7 @@ digraph auto_pipeline {
6868
"Move to OnHold + comment" [shape=box, style=filled, fillcolor="#ffcccc"];
6969
"Move to Ready" [shape=box];
7070
"Phase 2: run-pipeline (subagent)" [shape=box, style=filled, fillcolor="#cce0ff"];
71+
"Phase 2.5: integration gate (subagent)" [shape=box, style=filled, fillcolor="#cce0ff"];
7172
"Phase 3: review-pipeline (subagent)" [shape=box, style=filled, fillcolor="#cce0ff"];
7273
"Final report" [shape=box, style=filled, fillcolor="#ccffcc"];
7374
@@ -83,8 +84,10 @@ digraph auto_pipeline {
8384
"Substantive loop counter" -> "Phase 1: check-issue (subagent)" [label="< 2 retries"];
8485
"Substantive loop counter" -> "Move to OnHold + comment" [label=">= 2 retries"];
8586
"Move to Ready" -> "Phase 2: run-pipeline (subagent)";
86-
"Phase 2: run-pipeline (subagent)" -> "Phase 3: review-pipeline (subagent)" [label="success"];
87+
"Phase 2: run-pipeline (subagent)" -> "Phase 2.5: integration gate (subagent)" [label="success"];
8788
"Phase 2: run-pipeline (subagent)" -> "Final report" [label="fail (stop)"];
89+
"Phase 2.5: integration gate (subagent)" -> "Phase 3: review-pipeline (subagent)" [label="all pass"];
90+
"Phase 2.5: integration gate (subagent)" -> "Move to OnHold + comment" [label="any fail"];
8891
"Phase 3: review-pipeline (subagent)" -> "Final report";
8992
}
9093
```
@@ -328,7 +331,7 @@ Return ONLY this JSON shape:
328331

329332
When the subagent returns:
330333

331-
- **`outcome == "success"`** → continue to Step 3.
334+
- **`outcome == "success"`** → continue to Step 2.5.
332335
- **`outcome == "failure"`** → STOP. The `run-pipeline` skill already moves the card to OnHold and posts a diagnostic comment, so we do not duplicate. Print:
333336

334337
```
@@ -341,6 +344,83 @@ When the subagent returns:
341344

342345
Do NOT call codex to rescue here — implementation failures are CI/code-shape problems that need human eyes.
343346

347+
## Step 2.5: Integration Gate (orchestrator-owned)
348+
349+
`run-pipeline` and the sub-skills it invokes (`add-model`, `add-rule`, `review-structural`) test each item **in isolation** — the new rule's closed-loop, the new model's unit tests, the rule's own paper entry. None of them runs the full workspace test suite or `make paper`. Two classes of breakage slip past:
350+
351+
- **Cross-crate test regressions.** A model change (validator relaxation, schema field rename, removed enum variant) can break pre-existing tests in `problemreductions-cli/tests/` or other crates that the per-item tests never exercise.
352+
- **Paper compile errors.** A typo'd math-mode token (`intersect` vs Typst's `inter`), an orphan bib key (cited but absent from `references.bib`), or a stale key (`@lawler1978a` vs `@lawler1978`) is invisible until `typst compile` resolves references — neither `add-rule` nor `review-structural` runs `make paper`.
353+
354+
CI catches both, but only after the PR opens. In **batch mode** (multiple issues stacked on a shared branch with one PR opened at the end — outside this skill's `one-issue-one-PR` default), the breakage accumulates silently across N commits. Running this gate after every Phase 2 success keeps the cadence right regardless of PR strategy.
355+
356+
### 2.5a. Dispatch the integration-gate subagent
357+
358+
Use the `Agent` tool with `subagent_type=general-purpose`. The orchestrator main agent does NOT own a worktree, so this subagent enters the PR branch itself. The subagent is NOT invoking an existing skill — these are raw commands.
359+
360+
**Prompt template:**
361+
362+
```
363+
Run the auto-pipeline integration gate on PR #<PR> at HEAD.
364+
365+
1. Enter the PR branch in a fresh worktree:
366+
WT=$(python3 scripts/pipeline_worktree.py enter \
367+
--name "auto-pipeline-gate-<PR>" --format json \
368+
| python3 -c "import sys,json; print(json.load(sys.stdin)['worktree_dir'])")
369+
cd "$WT"
370+
gh pr checkout <PR>
371+
372+
2. Run these two commands sequentially. Do NOT modify any files.
373+
Capture the FIRST failure message from each, if any:
374+
a. cargo test --workspace --features "ilp-highs example-db"
375+
(same flags the CI Test job uses)
376+
b. make paper
377+
378+
3. Clean up:
379+
cd <REPO_ROOT>
380+
python3 scripts/pipeline_worktree.py cleanup --worktree "$WT"
381+
382+
Return ONLY this JSON shape (no prose):
383+
{
384+
"head_sha": "<git rev-parse HEAD>",
385+
"tests": {"outcome": "pass" | "fail", "first_failure": "<test name + assertion, or empty>"},
386+
"paper": {"outcome": "pass" | "fail", "first_failure": "<typst/make error + file:line, or empty>"}
387+
}
388+
```
389+
390+
### 2.5b. Branch on the report
391+
392+
- **Both `tests.outcome` and `paper.outcome == "pass"`** → continue to Step 3.
393+
- **Either fails** → STOP. Post a diagnostic PR comment with both `first_failure` strings, move the project card to OnHold, and print:
394+
395+
```bash
396+
COMMENT_FILE=$(mktemp)
397+
cat > "$COMMENT_FILE" <<EOF
398+
**auto-pipeline integration gate failed**
399+
400+
- \`cargo test --workspace\`: \`$TESTS_OUTCOME\`$TESTS_FIRST_FAILURE
401+
- \`make paper\`: \`$PAPER_OUTCOME\`$PAPER_FIRST_FAILURE
402+
403+
HEAD: \`$HEAD_SHA\`. Per-item closed-loop tests passed, but workspace-wide tests or the Typst paper compile broke. Common causes: validator change relaxes a constraint that a CLI integration test asserted; new \`reduction-rule\` cites a bib key not in \`references.bib\`; math-mode uses an English word Typst does not know (e.g. \`intersect\` should be \`inter\`).
404+
EOF
405+
python3 scripts/pipeline_pr.py comment --repo "$REPO" --pr "$PR" --body-file "$COMMENT_FILE"
406+
rm -f "$COMMENT_FILE"
407+
uv run --project scripts scripts/pipeline_board.py move "$ITEM_ID" on-hold
408+
```
409+
410+
```
411+
Auto-pipeline halted at integration gate:
412+
Issue: #<ISSUE>
413+
PR: #<PR>
414+
Tests: <pass|fail> — <first_failure>
415+
Paper: <pass|fail> — <first_failure>
416+
Board: Review pool -> OnHold
417+
Next: human triage; cross-crate regressions and paper-compile bugs
418+
are not eligible for codex rescue because the failing artefact
419+
lives outside the new issue's files.
420+
```
421+
422+
Do NOT auto-dispatch `codex:codex-rescue` here. The failure mode is "this issue's implementation broke something the per-item tests do not cover," which is a focused investigative task that benefits from human eyes (and the offending code usually lives in a file the issue body does not reference, so the codex prompt would be misleading).
423+
344424
## Step 3: Agentic Review (`review-pipeline` subagent)
345425
346426
Dispatch the existing `review-pipeline` skill against the PR:
@@ -383,3 +463,5 @@ Auto-pipeline complete:
383463
| Letting the codex subagent edit GitHub | The orchestrator owns all `gh issue edit` calls — codex only returns text |
384464
| Treating implementation failures as substantive issue problems | Step 2 failures go straight to a stop; they are not eligible for codex rescue |
385465
| Picking from a non-Backlog column when no issue number is given | Auto-pick must read from Backlog only — never from OnHold, Ready, or elsewhere |
466+
| Skipping Step 2.5 because Phase 2 reported `success` | Phase 2's success is scoped to the new item's own tests. Workspace-wide regressions (e.g. CLI integration tests asserting a model behaviour the new rule just relaxed) and paper compile errors (`intersect` vs `inter`, orphan bib keys) are only visible after `cargo test --workspace` and `make paper`. Always run Step 2.5 before Step 3. |
467+
| Treating Step 2.5 failures as eligible for codex rescue | The failing artefact lives outside the new issue's files (CLI tests in another crate, a bib key elsewhere in `references.bib`, a Typst symbol in an unrelated proof). Codex prompts seeded from the issue body would be misleading. Halt + human triage. |

0 commit comments

Comments
 (0)