|
| 1 | +--- |
| 2 | +description: Triage failing Claude-authored PRs — read sweep logs, debug, and push a candidate fix per PR |
| 3 | +--- |
| 4 | + |
| 5 | +For each open Claude-authored PR (`claude/*` branch) whose full-sweep validation produced at least one **FAILED** check, fetch the failing run's logs, diagnose the root cause, and push a candidate fix to the PR's branch. |
| 6 | + |
| 7 | +This command modifies remote PR branches. **Pause for user confirmation** after listing the candidate PRs and again before pushing each fix. |
| 8 | + |
| 9 | +## Step 1 — find failing `claude/*` PRs whose sweep actually ran |
| 10 | + |
| 11 | +A PR qualifies only if: |
| 12 | +- `headRefName` starts with `claude/` |
| 13 | +- At least one `Run Sweep` check has conclusion `SUCCESS` **or** `FAILURE` (i.e. the sweep was enabled and produced real results — not all skipped) |
| 14 | +- At least one check has conclusion `FAILURE`, `CANCELLED`, or `TIMED_OUT` |
| 15 | + |
| 16 | +`gh pr list --json statusCheckRollup` truncates rollups, so enumerate candidates first, then re-query each PR individually. |
| 17 | + |
| 18 | +```bash |
| 19 | +gh pr list --repo SemiAnalysisAI/InferenceX --state open --limit 200 \ |
| 20 | + --json number,title,headRefName \ |
| 21 | + --jq '.[] | select(.headRefName | startswith("claude/")) | "\(.number)\t\(.headRefName)\t\(.title)"' \ |
| 22 | + > /tmp/claude_pr_candidates.tsv |
| 23 | + |
| 24 | +: > /tmp/claude_prs_failing.tsv |
| 25 | +while IFS=$'\t' read -r pr branch title; do |
| 26 | + qualifies=$(gh pr view "$pr" --repo SemiAnalysisAI/InferenceX --json statusCheckRollup --jq ' |
| 27 | + def state: if (.conclusion // "") != "" then .conclusion else .status end; |
| 28 | + . as $p |
| 29 | + | ($p.statusCheckRollup | any(.workflowName == "Run Sweep" and (state == "SUCCESS" or state == "FAILURE"))) as $swept |
| 30 | + | ($p.statusCheckRollup | any(state == "FAILURE" or state == "CANCELLED" or state == "TIMED_OUT")) as $failed |
| 31 | + | ($swept and $failed)') |
| 32 | + if [ "$qualifies" = "true" ]; then |
| 33 | + printf '%s\t%s\t%s\n' "$pr" "$branch" "$title" >> /tmp/claude_prs_failing.tsv |
| 34 | + fi |
| 35 | +done < /tmp/claude_pr_candidates.tsv |
| 36 | +cat /tmp/claude_prs_failing.tsv |
| 37 | +``` |
| 38 | + |
| 39 | +Render the candidates as a markdown table with clickable PR links and **stop**. Confirm with the user which PRs to attempt fixes on (default: all). If none qualify, print a short no-results message and exit. |
| 40 | + |
| 41 | +## Step 2 — per-PR diagnosis & fix loop |
| 42 | + |
| 43 | +For each confirmed PR, run the following loop. Do **not** parallelize — keep state local and obvious. |
| 44 | + |
| 45 | +### 2a. Check out the PR branch in a worktree |
| 46 | + |
| 47 | +Use a worktree so the loop never disturbs the user's working tree: |
| 48 | + |
| 49 | +```bash |
| 50 | +git fetch origin "$BRANCH" |
| 51 | +WT="/tmp/fix-pr-$PR" |
| 52 | +rm -rf "$WT" |
| 53 | +git worktree add "$WT" "origin/$BRANCH" |
| 54 | +``` |
| 55 | + |
| 56 | +### 2b. Identify and download the failing run's logs |
| 57 | + |
| 58 | +The `Run Sweep` workflow produces many matrix jobs. Find the failing job(s) on the PR's head SHA, then download just the failing step logs to keep context small: |
| 59 | + |
| 60 | +```bash |
| 61 | +HEAD_SHA=$(gh pr view "$PR" --repo SemiAnalysisAI/InferenceX --json headRefOid --jq .headRefOid) |
| 62 | + |
| 63 | +# Find the most recent Run Sweep run for this commit |
| 64 | +RUN_ID=$(gh run list --repo SemiAnalysisAI/InferenceX --workflow "Run Sweep" \ |
| 65 | + --commit "$HEAD_SHA" --limit 1 --json databaseId --jq '.[0].databaseId') |
| 66 | + |
| 67 | +# Failing job IDs + names |
| 68 | +gh run view "$RUN_ID" --repo SemiAnalysisAI/InferenceX --json jobs \ |
| 69 | + --jq '.jobs[] | select(.conclusion == "failure") | "\(.databaseId)\t\(.name)"' \ |
| 70 | + > /tmp/failed_jobs_$PR.tsv |
| 71 | + |
| 72 | +# Failure-only log dump (uses --log-failed) |
| 73 | +gh run view "$RUN_ID" --repo SemiAnalysisAI/InferenceX --log-failed \ |
| 74 | + > /tmp/sweep_failed_log_$PR.txt |
| 75 | +wc -l /tmp/sweep_failed_log_$PR.txt |
| 76 | +``` |
| 77 | + |
| 78 | +If the log file is very large (>2000 lines), grep it for the actual error signatures before reading — common patterns: `Error`, `Traceback`, `RuntimeError`, `CUDA`, `HIP`, `OOM`, `assert`, `KeyError`, `ModuleNotFound`, `connection refused`, `exit code`, `failed to launch`. Read the surrounding context (~50 lines) around each hit. |
| 79 | + |
| 80 | +### 2c. Diagnose |
| 81 | + |
| 82 | +Inspect the PR diff (`git -C "$WT" diff origin/main...HEAD`) and the failing-log excerpts together. Most `claude/issue-1154-*` PRs are image-bump PRs that touch a `*.yaml` recipe — failures are usually: |
| 83 | + |
| 84 | +- Image tag typo / unavailable tag → fix the image reference. |
| 85 | +- Engine arg incompatibility with new image version → add/remove the affected flag in the recipe. |
| 86 | +- New required env var or container path → patch the recipe. |
| 87 | +- Resource ask too high for the runner → drop concurrency or tp. |
| 88 | +- Flaky infra (network, runner pickup) → not a code fix; flag and skip. |
| 89 | + |
| 90 | +State the suspected root cause in one or two sentences before proposing any edit. |
| 91 | + |
| 92 | +### 2d. Apply a minimal fix, then push |
| 93 | + |
| 94 | +Make the smallest possible edit that addresses the diagnosed cause. Run any local validation that's cheap (e.g. `python -c "import yaml; yaml.safe_load(open('<file>'))"`). Then: |
| 95 | + |
| 96 | +```bash |
| 97 | +git -C "$WT" add -A |
| 98 | +git -C "$WT" -c user.name="claude-fix-bot" -c user.email="claude-fix-bot@local" commit -m "fix(<recipe>): <one-line root cause summary>" |
| 99 | +``` |
| 100 | + |
| 101 | +**Show the diff to the user and ask for confirmation before pushing.** On confirm: |
| 102 | + |
| 103 | +```bash |
| 104 | +git -C "$WT" push origin "HEAD:$BRANCH" |
| 105 | +git worktree remove "$WT" |
| 106 | +``` |
| 107 | + |
| 108 | +If diagnosis is inconclusive (e.g. infra flake, unclear log, or the fix would be too large to be a one-shot patch), do **not** push. Record the PR as "needs human triage" with a short note on why. |
| 109 | + |
| 110 | +## Step 3 — final report |
| 111 | + |
| 112 | +Print a summary table: |
| 113 | + |
| 114 | +| PR | Action | Note | |
| 115 | +|---|---|---| |
| 116 | +| [#NNNN](https://github.com/SemiAnalysisAI/InferenceX/pull/NNNN) | fix pushed (`<sha>`) | one-line diagnosis | |
| 117 | +| [#NNNN](https://github.com/SemiAnalysisAI/InferenceX/pull/NNNN) | skipped | reason (flake / unclear / too large) | |
| 118 | + |
| 119 | +Do **not** merge anything. The pushed commit will re-trigger sweep on the PR; review results via `/list-claude-pr-status` later. |
0 commit comments