Skip to content

Commit 9a4f728

Browse files
isPANNclaude
andcommitted
Slim down auto-pipeline Step 2.5
CI-class failures (stale tests, typo'd bib keys, math-mode typos) are small and mechanical; hand them straight to codex-rescue with a one-line failure summary rather than walking the orchestrator through a long JSON contract, PR-comment template, and explicit OnHold dance. Re-run Step 2.5 once after codex; OnHold only if still failing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 5a52fb7 commit 9a4f728

1 file changed

Lines changed: 11 additions & 71 deletions

File tree

.claude/skills/auto-pipeline/SKILL.md

Lines changed: 11 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -346,80 +346,21 @@ When the subagent returns:
346346

347347
## Step 2.5: Integration Gate (orchestrator-owned)
348348

349-
`run-pipeline` and the sub-skills it invokes (`add-model`, `add-rule`, `review-structural`) test each item **in isolation** — the new rule's closed-loop, the new model's unit tests, the rule's own paper entry. None of them runs the full workspace test suite or `make paper`. Two classes of breakage slip past:
349+
The per-item sub-skills only test the new item in isolation, so cross-crate regressions (e.g. a relaxed model validator breaking pre-existing CLI tests) and paper-compile errors (orphan bib keys, math-mode typos like `intersect` vs Typst's `inter`) slip through Phase 2 and Phase 3. CI catches both, but in batch mode (many issues on one branch) breakage accumulates silently. Running this gate after every Phase 2 success closes the loop.
350350

351-
- **Cross-crate test regressions.** A model change (validator relaxation, schema field rename, removed enum variant) can break pre-existing tests in `problemreductions-cli/tests/` or other crates that the per-item tests never exercise.
352-
- **Paper compile errors.** A typo'd math-mode token (`intersect` vs Typst's `inter`), an orphan bib key (cited but absent from `references.bib`), or a stale key (`@lawler1978a` vs `@lawler1978`) is invisible until `typst compile` resolves references — neither `add-rule` nor `review-structural` runs `make paper`.
351+
Dispatch a fresh subagent (`subagent_type=general-purpose`, not invoking any existing skill):
353352

354-
CI catches both, but only after the PR opens. In **batch mode** (multiple issues stacked on a shared branch with one PR opened at the end — outside this skill's `one-issue-one-PR` default), the breakage accumulates silently across N commits. Running this gate after every Phase 2 success keeps the cadence right regardless of PR strategy.
355-
356-
### 2.5a. Dispatch the integration-gate subagent
357-
358-
Use the `Agent` tool with `subagent_type=general-purpose`. The orchestrator main agent does NOT own a worktree, so this subagent enters the PR branch itself. The subagent is NOT invoking an existing skill — these are raw commands.
359-
360-
**Prompt template:**
361-
362-
```
363-
Run the auto-pipeline integration gate on PR #<PR> at HEAD.
364-
365-
1. Enter the PR branch in a fresh worktree:
366-
WT=$(python3 scripts/pipeline_worktree.py enter \
367-
--name "auto-pipeline-gate-<PR>" --format json \
368-
| python3 -c "import sys,json; print(json.load(sys.stdin)['worktree_dir'])")
369-
cd "$WT"
370-
gh pr checkout <PR>
371-
372-
2. Run these two commands sequentially. Do NOT modify any files.
373-
Capture the FIRST failure message from each, if any:
374-
a. cargo test --workspace --features "ilp-highs example-db"
375-
(same flags the CI Test job uses)
376-
b. make paper
377-
378-
3. Clean up:
379-
cd <REPO_ROOT>
380-
python3 scripts/pipeline_worktree.py cleanup --worktree "$WT"
381-
382-
Return ONLY this JSON shape (no prose):
383-
{
384-
"head_sha": "<git rev-parse HEAD>",
385-
"tests": {"outcome": "pass" | "fail", "first_failure": "<test name + assertion, or empty>"},
386-
"paper": {"outcome": "pass" | "fail", "first_failure": "<typst/make error + file:line, or empty>"}
387-
}
388353
```
354+
Run the auto-pipeline integration gate on PR #<PR>. Check out the PR
355+
branch in a fresh worktree, run `make check` then `make paper`, clean up.
356+
Do not modify files. Return ONLY:
389357
390-
### 2.5b. Branch on the report
391-
392-
- **Both `tests.outcome` and `paper.outcome == "pass"`** → continue to Step 3.
393-
- **Either fails** → STOP. Post a diagnostic PR comment with both `first_failure` strings, move the project card to OnHold, and print:
394-
395-
```bash
396-
COMMENT_FILE=$(mktemp)
397-
cat > "$COMMENT_FILE" <<EOF
398-
**auto-pipeline integration gate failed**
399-
400-
- \`cargo test --workspace\`: \`$TESTS_OUTCOME\`$TESTS_FIRST_FAILURE
401-
- \`make paper\`: \`$PAPER_OUTCOME\`$PAPER_FIRST_FAILURE
402-
403-
HEAD: \`$HEAD_SHA\`. Per-item closed-loop tests passed, but workspace-wide tests or the Typst paper compile broke. Common causes: validator change relaxes a constraint that a CLI integration test asserted; new \`reduction-rule\` cites a bib key not in \`references.bib\`; math-mode uses an English word Typst does not know (e.g. \`intersect\` should be \`inter\`).
404-
EOF
405-
python3 scripts/pipeline_pr.py comment --repo "$REPO" --pr "$PR" --body-file "$COMMENT_FILE"
406-
rm -f "$COMMENT_FILE"
407-
uv run --project scripts scripts/pipeline_board.py move "$ITEM_ID" on-hold
408-
```
409-
410-
```
411-
Auto-pipeline halted at integration gate:
412-
Issue: #<ISSUE>
413-
PR: #<PR>
414-
Tests: <pass|fail> — <first_failure>
415-
Paper: <pass|fail> — <first_failure>
416-
Board: Review pool -> OnHold
417-
Next: human triage; cross-crate regressions and paper-compile bugs
418-
are not eligible for codex rescue because the failing artefact
419-
lives outside the new issue's files.
420-
```
358+
{"tests": "pass" | "fail", "paper": "pass" | "fail",
359+
"first_failure": "<first failing test or typst error, or empty>"}
360+
```
421361

422-
Do NOT auto-dispatch `codex:codex-rescue` here. The failure mode is "this issue's implementation broke something the per-item tests do not cover," which is a focused investigative task that benefits from human eyes (and the offending code usually lives in a file the issue body does not reference, so the codex prompt would be misleading).
362+
- Both `pass` → continue to Step 3.
363+
- Either `fail` → hand the `first_failure` to `codex:codex-rescue` for a fix-it pass (CI-class problems are usually small: deleting a stale test, fixing a typo'd bib key, swapping `intersect` for `inter`). After codex returns, re-run Step 2.5 once. If still failing, park on OnHold.
423364

424365
## Step 3: Agentic Review (`review-pipeline` subagent)
425366

@@ -463,5 +404,4 @@ Auto-pipeline complete:
463404
| Letting the codex subagent edit GitHub | The orchestrator owns all `gh issue edit` calls — codex only returns text |
464405
| Treating implementation failures as substantive issue problems | Step 2 failures go straight to a stop; they are not eligible for codex rescue |
465406
| Picking from a non-Backlog column when no issue number is given | Auto-pick must read from Backlog only — never from OnHold, Ready, or elsewhere |
466-
| Skipping Step 2.5 because Phase 2 reported `success` | Phase 2's success is scoped to the new item's own tests. Workspace-wide regressions (e.g. CLI integration tests asserting a model behaviour the new rule just relaxed) and paper compile errors (`intersect` vs `inter`, orphan bib keys) are only visible after `cargo test --workspace` and `make paper`. Always run Step 2.5 before Step 3. |
467-
| Treating Step 2.5 failures as eligible for codex rescue | The failing artefact lives outside the new issue's files (CLI tests in another crate, a bib key elsewhere in `references.bib`, a Typst symbol in an unrelated proof). Codex prompts seeded from the issue body would be misleading. Halt + human triage. |
407+
| Skipping Step 2.5 because Phase 2 reported `success` | Phase 2 success is scoped to the new item's own tests; workspace-wide regressions and paper-compile bugs are only visible from `make check` + `make paper`. |

0 commit comments

Comments
 (0)