Skip to content

Commit 3db0363

Browse files
authored
Merge branch 'main' into wzhao/update-dsv4-b200
2 parents 11d0585 + c9798a7 commit 3db0363

176 files changed

Lines changed: 18599 additions & 3015 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
description: Find Claude-authored PRs with all-green full-sweep validation and confirm before merging
3+
---
4+
5+
Find open PRs authored by Claude (branches starting with `claude/`) whose full-sweep validation has completed all-green, then prompt the user before merging.
6+
7+
## Step 1 — list candidate `claude/*` PRs
8+
9+
`gh pr list --json statusCheckRollup` truncates each PR's rollup, so it can't be trusted for the per-check filter. Use it only to get the candidate numbers, then re-query each PR individually.
10+
11+
```bash
12+
gh pr list --repo SemiAnalysisAI/InferenceX --state open --limit 200 \
13+
--json number,title,headRefName \
14+
--jq '.[] | select(.headRefName | startswith("claude/")) | .number' \
15+
> /tmp/claude_pr_candidates.txt
16+
```
17+
18+
## Step 2 — per-PR all-green check
19+
20+
For each candidate, fetch the full rollup with `gh pr view`. A PR qualifies only if **all** of the following hold:
21+
22+
- No check has conclusion `FAILURE`, `CANCELLED`, or `TIMED_OUT`
23+
- No check has status `QUEUED`, `IN_PROGRESS`, or `PENDING` (sweep finished, not still running)
24+
- At least one `Run Sweep` check has conclusion `SUCCESS` (sweep actually ran — not all skipped)
25+
26+
Note: `gh` returns `conclusion: ""` (empty string, not `null`) for in-flight checks, so jq's `//` operator does **not** fall through to `.status`. Each check's effective state must be computed as `if conclusion is non-empty then conclusion else status`.
27+
28+
```bash
29+
: > /tmp/claude_prs_green.txt
30+
while read -r pr; do
31+
is_green=$(gh pr view "$pr" --repo SemiAnalysisAI/InferenceX --json statusCheckRollup --jq '
32+
def state: if (.conclusion // "") != "" then .conclusion else .status end;
33+
. as $p
34+
| ([$p.statusCheckRollup[] | state]) as $s
35+
| ($s | any(. == "FAILURE" or . == "CANCELLED" or . == "TIMED_OUT" or . == "QUEUED" or . == "IN_PROGRESS" or . == "PENDING")) as $bad
36+
| ([$p.statusCheckRollup[] | select(.workflowName == "Run Sweep" and (state) == "SUCCESS")] | length > 0) as $swept
37+
| (($bad | not) and $swept)')
38+
if [ "$is_green" = "true" ]; then
39+
title=$(gh pr view "$pr" --repo SemiAnalysisAI/InferenceX --json title --jq '.title')
40+
printf '%s\thttps://github.com/SemiAnalysisAI/InferenceX/pull/%s\t%s\n' "$pr" "$pr" "$title" >> /tmp/claude_prs_green.txt
41+
fi
42+
done < /tmp/claude_pr_candidates.txt
43+
cat /tmp/claude_prs_green.txt
44+
```
45+
46+
## Step 3 — present the list and ask for confirmation
47+
48+
Show the matching PRs as a table with PR number, link, and title. Then **stop and ask the user to confirm** before doing anything else. Do not auto-merge.
49+
50+
If the user confirms, invoke `/merge-prs <pr-numbers...>` with the confirmed PR numbers.
51+
If the user declines or wants a subset, run `/merge-prs` only on the subset they specify.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
description: Triage failing Claude-authored PRs — read sweep logs, debug, and push a candidate fix per PR
3+
---
4+
5+
For each open Claude-authored PR (`claude/*` branch) whose full-sweep validation produced at least one **FAILED** check, fetch the failing run's logs, diagnose the root cause, and push a candidate fix to the PR's branch.
6+
7+
This command modifies remote PR branches. **Pause for user confirmation** after listing the candidate PRs and again before pushing each fix.
8+
9+
## Step 1 — find failing `claude/*` PRs whose sweep actually ran
10+
11+
A PR qualifies only if:
12+
- `headRefName` starts with `claude/`
13+
- At least one `Run Sweep` check has conclusion `SUCCESS` **or** `FAILURE` (i.e. the sweep was enabled and produced real results — not all skipped)
14+
- At least one check has conclusion `FAILURE`, `CANCELLED`, or `TIMED_OUT`
15+
16+
`gh pr list --json statusCheckRollup` truncates rollups, so enumerate candidates first, then re-query each PR individually.
17+
18+
```bash
19+
gh pr list --repo SemiAnalysisAI/InferenceX --state open --limit 200 \
20+
--json number,title,headRefName \
21+
--jq '.[] | select(.headRefName | startswith("claude/")) | "\(.number)\t\(.headRefName)\t\(.title)"' \
22+
> /tmp/claude_pr_candidates.tsv
23+
24+
: > /tmp/claude_prs_failing.tsv
25+
while IFS=$'\t' read -r pr branch title; do
26+
qualifies=$(gh pr view "$pr" --repo SemiAnalysisAI/InferenceX --json statusCheckRollup --jq '
27+
def state: if (.conclusion // "") != "" then .conclusion else .status end;
28+
. as $p
29+
| ($p.statusCheckRollup | any(.workflowName == "Run Sweep" and (state == "SUCCESS" or state == "FAILURE"))) as $swept
30+
| ($p.statusCheckRollup | any(state == "FAILURE" or state == "CANCELLED" or state == "TIMED_OUT")) as $failed
31+
| ($swept and $failed)')
32+
if [ "$qualifies" = "true" ]; then
33+
printf '%s\t%s\t%s\n' "$pr" "$branch" "$title" >> /tmp/claude_prs_failing.tsv
34+
fi
35+
done < /tmp/claude_pr_candidates.tsv
36+
cat /tmp/claude_prs_failing.tsv
37+
```
38+
39+
Render the candidates as a markdown table with clickable PR links and **stop**. Confirm with the user which PRs to attempt fixes on (default: all). If none qualify, print a short no-results message and exit.
40+
41+
## Step 2 — per-PR diagnosis & fix loop
42+
43+
For each confirmed PR, run the following loop. Do **not** parallelize — keep state local and obvious.
44+
45+
### 2a. Check out the PR branch in a worktree
46+
47+
Use a worktree so the loop never disturbs the user's working tree:
48+
49+
```bash
50+
git fetch origin "$BRANCH"
51+
WT="/tmp/fix-pr-$PR"
52+
rm -rf "$WT"
53+
git worktree add "$WT" "origin/$BRANCH"
54+
```
55+
56+
### 2b. Identify and download the failing run's logs
57+
58+
The `Run Sweep` workflow produces many matrix jobs. Find the failing job(s) on the PR's head SHA, then download just the failing step logs to keep context small:
59+
60+
```bash
61+
HEAD_SHA=$(gh pr view "$PR" --repo SemiAnalysisAI/InferenceX --json headRefOid --jq .headRefOid)
62+
63+
# Find the most recent Run Sweep run for this commit
64+
RUN_ID=$(gh run list --repo SemiAnalysisAI/InferenceX --workflow "Run Sweep" \
65+
--commit "$HEAD_SHA" --limit 1 --json databaseId --jq '.[0].databaseId')
66+
67+
# Failing job IDs + names
68+
gh run view "$RUN_ID" --repo SemiAnalysisAI/InferenceX --json jobs \
69+
--jq '.jobs[] | select(.conclusion == "failure") | "\(.databaseId)\t\(.name)"' \
70+
> /tmp/failed_jobs_$PR.tsv
71+
72+
# Failure-only log dump (uses --log-failed)
73+
gh run view "$RUN_ID" --repo SemiAnalysisAI/InferenceX --log-failed \
74+
> /tmp/sweep_failed_log_$PR.txt
75+
wc -l /tmp/sweep_failed_log_$PR.txt
76+
```
77+
78+
If the log file is very large (>2000 lines), grep it for the actual error signatures before reading — common patterns: `Error`, `Traceback`, `RuntimeError`, `CUDA`, `HIP`, `OOM`, `assert`, `KeyError`, `ModuleNotFound`, `connection refused`, `exit code`, `failed to launch`. Read the surrounding context (~50 lines) around each hit.
79+
80+
### 2c. Diagnose
81+
82+
Inspect the PR diff (`git -C "$WT" diff origin/main...HEAD`) and the failing-log excerpts together. Most `claude/issue-1154-*` PRs are image-bump PRs that touch a `*.yaml` recipe — failures are usually:
83+
84+
- Image tag typo / unavailable tag → fix the image reference.
85+
- Engine arg incompatibility with new image version → add/remove the affected flag in the recipe.
86+
- New required env var or container path → patch the recipe.
87+
- Resource ask too high for the runner → drop concurrency or tp.
88+
- Flaky infra (network, runner pickup) → not a code fix; flag and skip.
89+
90+
State the suspected root cause in one or two sentences before proposing any edit.
91+
92+
### 2d. Apply a minimal fix, then push
93+
94+
Make the smallest possible edit that addresses the diagnosed cause. Run any local validation that's cheap (e.g. `python -c "import yaml; yaml.safe_load(open('<file>'))"`). Then:
95+
96+
```bash
97+
git -C "$WT" add -A
98+
git -C "$WT" -c user.name="claude-fix-bot" -c user.email="claude-fix-bot@local" commit -m "fix(<recipe>): <one-line root cause summary>"
99+
```
100+
101+
**Show the diff to the user and ask for confirmation before pushing.** On confirm:
102+
103+
```bash
104+
git -C "$WT" push origin "HEAD:$BRANCH"
105+
git worktree remove "$WT"
106+
```
107+
108+
If diagnosis is inconclusive (e.g. infra flake, unclear log, or the fix would be too large to be a one-shot patch), do **not** push. Record the PR as "needs human triage" with a short note on why.
109+
110+
## Step 3 — final report
111+
112+
Print a summary table:
113+
114+
| PR | Action | Note |
115+
|---|---|---|
116+
| [#NNNN](https://github.com/SemiAnalysisAI/InferenceX/pull/NNNN) | fix pushed (`<sha>`) | one-line diagnosis |
117+
| [#NNNN](https://github.com/SemiAnalysisAI/InferenceX/pull/NNNN) | skipped | reason (flake / unclear / too large) |
118+
119+
Do **not** merge anything. The pushed commit will re-trigger sweep on the PR; review results via `/list-claude-pr-status` later.

0 commit comments

Comments
 (0)