Skip to content

Commit 048f38f

Browse files
fix(workflows): make impl pipeline resilient to transient Claude failures (#5410)
## Summary The implementation pipeline was leaving PRs and issues stuck after a single Claude Code Action hiccup. Three fixes restore self-healing behavior: - **`impl-generate.yml`**: cap raised from **2 → 3** generation attempts, aligning with the existing `impl:{lib}:failed` label description (*"max retries exhausted (3 attempts)"*) and the repair phase's 3-attempt budget. Failure comments now read `Attempt N/3`. - **`impl-repair.yml`**: previously had no failure handler — when the Claude Code Action itself crashed, the workflow ended with `ai-rejected` already removed and re-review never fired, leaving the PR silently stuck. Added a `Handle repair failure` step that restores `ai-rejected` and auto-retries the same attempt **once** via a marker comment, then falls back to manual. - **`impl-review.yml`**: both failure paths (Claude crash → `Handle review failure`, and score=0 from missing `quality_score.txt` → `Validate review output`) immediately surfaced `ai-review-failed`, requiring manual rerun. Both now auto-retry **once** via `repository_dispatch` with a shared marker comment before giving up. The `>=50% after 3 attempts` merge logic in `impl-review.yml` was already correct and is unchanged — these fixes only ensure PRs reach that gate instead of stalling earlier. ## Concrete trigger (not added to the PR but motivated it) Issue #5365 (`area-mountain-panorama`) had **4/9 libraries hard-failed** without ever creating a PR (transient Claude crashes during generate, capped at 2 attempts), **1 PR stuck** with `ai-review-failed` (plotnine #5372), and **1 PR stuck** mid-repair (altair #5370 — repair workflow itself crashed on attempt 1). Manual recovery was triggered earlier in the conversation. ## Test plan - [ ] Trigger a generate that fails twice (e.g., simulate or wait for transient flake) — should auto-retry to attempt 3 instead of stopping at 2 - [ ] Trigger a repair where Claude Code Action crashes — should restore `ai-rejected` and auto-retry the same attempt once via marker comment - [ ] Trigger a review where Claude crashes — should auto-retry via `repository_dispatch` once before adding `ai-review-failed` - [ ] Trigger a review where Claude runs but writes no `quality_score.txt` — same auto-retry behavior - [ ] Verify markers prevent infinite retry loops (each marker only allows one auto-retry) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 76c2880 commit 048f38f

3 files changed

Lines changed: 137 additions & 29 deletions

File tree

.github/workflows/impl-generate.yml

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -755,16 +755,22 @@ jobs:
755755
run: |
756756
echo "::notice::Handling generation failure for $LIBRARY/$SPEC_ID"
757757
758-
# Count previous failures via hidden marker comments (more reliable than workflow runs)
758+
# Count previous failures via hidden marker comments (more reliable than workflow runs).
759+
# Paginate so the marker is found even on issues with >30 comments
760+
# (which is common because all 9 library impls land on the same issue).
759761
MARKER="<!-- impl-fail:${SPEC_ID}:${LIBRARY} -->"
760-
FAILURE_COUNT=$(gh api "repos/${{ github.repository }}/issues/${ISSUE}/comments" \
761-
--jq "[.[] | select(.body | contains(\"$MARKER\"))] | length" 2>/dev/null || echo "0")
762+
FAILURE_COUNT=$(gh api --paginate "repos/${{ github.repository }}/issues/${ISSUE}/comments?per_page=100" \
763+
--jq "[.[] | select(.body != null and (.body | contains(\"$MARKER\")))] | length" 2>/dev/null || echo "0")
762764
763765
echo "::notice::Previous failures for ${LIBRARY}/${SPEC_ID}: $FAILURE_COUNT"
764766
765-
# After 1 previous failure (= this is attempt 2) → mark as failed
766-
if [ "$FAILURE_COUNT" -ge 1 ]; then
767-
echo "::warning::Marking $LIBRARY as failed after 2 generation attempts"
767+
# FAILURE_COUNT counts marker comments BEFORE this run.
768+
# 0 → this is attempt 1 fail, 1 → attempt 2 fail, 2 → attempt 3 fail.
769+
ATTEMPT=$((FAILURE_COUNT + 1))
770+
771+
# After 2 previous failures (= this is attempt 3) → mark as failed
772+
if [ "$FAILURE_COUNT" -ge 2 ]; then
773+
echo "::warning::Marking $LIBRARY as failed after 3 generation attempts"
768774
769775
# Create failed label if needed
770776
gh label create "impl:${LIBRARY}:failed" --color "d73a4a" \
@@ -777,9 +783,9 @@ jobs:
777783
778784
# Post final failure comment with marker
779785
gh issue comment "$ISSUE" --body "${MARKER}
780-
## :x: ${LIBRARY} Failed (Attempt 2/2)
786+
## :x: ${LIBRARY} Failed (Attempt 3/3)
781787
782-
The **${LIBRARY}** implementation for \`${SPEC_ID}\` failed after 2 attempts.
788+
The **${LIBRARY}** implementation for \`${SPEC_ID}\` failed after 3 attempts.
783789
784790
**Reason:** Claude Code failed to create the implementation file.
785791
@@ -792,11 +798,11 @@ jobs:
792798
:robot: *[impl-generate](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
793799
794800
else
795-
# First failure → post comment with marker and auto-retry
801+
# Attempt 1 or 2 failed → post comment with marker and auto-retry
796802
gh issue comment "$ISSUE" --body "${MARKER}
797-
## :warning: ${LIBRARY} Generation Failed (Attempt 1/2)
803+
## :warning: ${LIBRARY} Generation Failed (Attempt ${ATTEMPT}/3)
798804
799-
First attempt failed. Automatically retrying...
805+
Attempt ${ATTEMPT} failed. Automatically retrying...
800806
801807
---
802808
:robot: *[impl-generate](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
@@ -810,5 +816,5 @@ jobs:
810816
-f library="${LIBRARY}" \
811817
-f issue_number="${ISSUE}"
812818
813-
echo "::notice::Triggered automatic retry for ${LIBRARY}/${SPEC_ID}"
819+
echo "::notice::Triggered automatic retry for ${LIBRARY}/${SPEC_ID} (attempt $((ATTEMPT + 1)))"
814820
fi

.github/workflows/impl-repair.yml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,3 +239,57 @@ jobs:
239239
-f event_type=review-pr \
240240
-f "client_payload[pr_number]=$PR_NUM"
241241
echo "::notice::Triggered impl-review.yml via repository_dispatch for PR #$PR_NUM"
242+
243+
# ========================================================================
244+
# Failure handling: when the repair workflow itself crashes (e.g. Claude
245+
# Code Action transient failure), restore the ai-rejected label and
246+
# auto-retry once. After one auto-retry, fall back to manual.
247+
# ========================================================================
248+
- name: Handle repair failure
249+
if: failure()
250+
env:
251+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
252+
PR_NUM: ${{ inputs.pr_number }}
253+
SPEC_ID: ${{ inputs.specification_id }}
254+
LIBRARY: ${{ inputs.library }}
255+
ATTEMPT: ${{ inputs.attempt }}
256+
run: |
257+
# Restore ai-rejected label that was removed at start of repair so the
258+
# PR state stays consistent (otherwise it looks "stuck approved").
259+
gh pr edit "$PR_NUM" --add-label "ai-rejected" 2>/dev/null || true
260+
261+
MARKER="<!-- repair-retry:${SPEC_ID}:${LIBRARY}:attempt-${ATTEMPT} -->"
262+
# Paginate so the marker is found even on PRs with >30 comments.
263+
RETRY_COUNT=$(gh api --paginate "repos/${{ github.repository }}/issues/${PR_NUM}/comments?per_page=100" \
264+
--jq "[.[] | select(.body != null and (.body | contains(\"$MARKER\")))] | length" 2>/dev/null || echo "0")
265+
266+
if [ "$RETRY_COUNT" -ge 1 ]; then
267+
echo "::error::Repair attempt ${ATTEMPT} crashed twice — giving up"
268+
gh pr comment "$PR_NUM" --body "${MARKER}
269+
## :x: Repair Workflow Crashed (Attempt ${ATTEMPT}/3, retry exhausted)
270+
271+
The repair workflow itself failed twice for this attempt — likely a persistent Claude Code Action issue.
272+
273+
**Manual restart:**
274+
\`\`\`
275+
gh workflow run impl-repair.yml -f pr_number=${PR_NUM} -f specification_id=${SPEC_ID} -f library=${LIBRARY} -f attempt=${ATTEMPT}
276+
\`\`\`
277+
278+
---
279+
:robot: *[impl-repair](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
280+
else
281+
echo "::warning::Repair attempt ${ATTEMPT} crashed — auto-retrying once"
282+
gh pr comment "$PR_NUM" --body "${MARKER}
283+
## :wrench: Repair Workflow Crashed (Attempt ${ATTEMPT}/3) — Auto-Retrying
284+
285+
The repair workflow failed (probably a transient Claude Code Action issue). Automatically re-triggering this attempt...
286+
287+
---
288+
:robot: *[impl-repair](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
289+
290+
gh workflow run impl-repair.yml \
291+
-f pr_number="$PR_NUM" \
292+
-f specification_id="$SPEC_ID" \
293+
-f library="$LIBRARY" \
294+
-f attempt="$ATTEMPT"
295+
fi

.github/workflows/impl-review.yml

Lines changed: 65 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -172,33 +172,53 @@ jobs:
172172
env:
173173
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
174174
PR_NUM: ${{ steps.pr.outputs.pr_number }}
175+
SPEC_ID: ${{ steps.pr.outputs.specification_id }}
176+
LIBRARY: ${{ steps.pr.outputs.library }}
175177
run: |
176-
echo "::error::AI Review did not produce valid output files"
177-
echo "::error::Expected quality_score.txt but got score=0"
178-
echo "::error::This indicates Claude Code Action ran but didn't complete the review task"
178+
echo "::error::AI Review did not produce valid output files (score=0)"
179179
180-
# Add ai-review-failed label so it's visible
181-
gh pr edit "$PR_NUM" --add-label "ai-review-failed" 2>/dev/null || true
180+
MARKER="<!-- review-retry:${SPEC_ID}:${LIBRARY} -->"
181+
# Paginate so the marker is found even on PRs with >30 comments.
182+
RETRY_COUNT=$(gh api --paginate "repos/${{ github.repository }}/issues/${PR_NUM}/comments?per_page=100" \
183+
--jq "[.[] | select(.body != null and (.body | contains(\"$MARKER\")))] | length" 2>/dev/null || echo "0")
182184
183-
# Post error comment on PR
184-
gh pr comment "$PR_NUM" --body "## :x: AI Review Failed
185+
if [ "$RETRY_COUNT" -ge 1 ]; then
186+
# Already auto-retried once → final fail, require manual rerun
187+
gh pr edit "$PR_NUM" --add-label "ai-review-failed" 2>/dev/null || true
188+
gh pr comment "$PR_NUM" --body "${MARKER}
189+
## :x: AI Review Failed (auto-retry exhausted)
185190
186-
The AI review action completed but did not produce valid output files.
191+
The AI review action completed but did not produce valid output files. Auto-retry already tried once.
187192
188193
**What happened:**
189194
- The Claude Code Action ran
190195
- No \`quality_score.txt\` file was created
191-
- No review data was extracted
192196
193-
**Action required:**
194-
Re-run the impl-review workflow manually:
197+
**Manual rerun:**
195198
\`\`\`
196199
gh workflow run impl-review.yml -f pr_number=$PR_NUM
197200
\`\`\`
198201
199202
---
200203
:robot: *[impl-review](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
204+
exit 1
205+
fi
201206
207+
# First failure — post marker and auto-retry via repository_dispatch
208+
gh pr comment "$PR_NUM" --body "${MARKER}
209+
## :wrench: AI Review Produced No Score — Auto-Retrying
210+
211+
The Claude Code Action ran but didn't write \`quality_score.txt\`. Auto-retrying review once...
212+
213+
---
214+
:robot: *[impl-review](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
215+
216+
gh api repos/${{ github.repository }}/dispatches \
217+
-f event_type=review-pr \
218+
-f "client_payload[pr_number]=$PR_NUM"
219+
echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM"
220+
# Mark this run as failed so the run status reflects that no verdict
221+
# was produced. The auto-retry runs in a separate workflow run.
202222
exit 1
203223
204224
- name: Add quality score label
@@ -448,19 +468,47 @@ jobs:
448468
env:
449469
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
450470
PR_NUM: ${{ steps.pr.outputs.pr_number }}
471+
SPEC_ID: ${{ steps.pr.outputs.specification_id }}
472+
LIBRARY: ${{ steps.pr.outputs.library }}
451473
run: |
452-
gh pr edit "$PR_NUM" --add-label "ai-review-failed"
453-
gh pr comment "$PR_NUM" --body "## :warning: AI Review Failed
474+
MARKER="<!-- review-retry:${SPEC_ID}:${LIBRARY} -->"
475+
# Paginate so the marker is found even on PRs with >30 comments.
476+
RETRY_COUNT=$(gh api --paginate "repos/${{ github.repository }}/issues/${PR_NUM}/comments?per_page=100" \
477+
--jq "[.[] | select(.body != null and (.body | contains(\"$MARKER\")))] | length" 2>/dev/null || echo "0")
454478
455-
The AI review action failed or timed out.
479+
if [ "$RETRY_COUNT" -ge 1 ]; then
480+
gh pr edit "$PR_NUM" --add-label "ai-review-failed" 2>/dev/null || true
481+
gh pr comment "$PR_NUM" --body "${MARKER}
482+
## :x: AI Review Failed (auto-retry exhausted)
456483
457-
**Options:**
458-
1. Re-run the workflow manually
459-
2. Request manual human review
484+
The AI review action failed or timed out twice in a row.
485+
486+
**Manual rerun:**
487+
\`\`\`
488+
gh workflow run impl-review.yml -f pr_number=$PR_NUM
489+
\`\`\`
490+
491+
---
492+
:robot: *[impl-review](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
493+
exit 0
494+
fi
495+
496+
gh pr comment "$PR_NUM" --body "${MARKER}
497+
## :wrench: AI Review Crashed — Auto-Retrying
498+
499+
The Claude Code Action failed or timed out. Auto-retrying review once...
460500
461501
---
462502
:robot: *[impl-review](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})*"
463503
504+
gh api repos/${{ github.repository }}/dispatches \
505+
-f event_type=review-pr \
506+
-f "client_payload[pr_number]=$PR_NUM"
507+
echo "::notice::Auto-re-triggered impl-review.yml for PR #$PR_NUM"
508+
# Mark this run as failed so the run status reflects that no verdict
509+
# was produced. The auto-retry runs in a separate workflow run.
510+
exit 1
511+
464512
- name: Add verdict label and take action
465513
if: steps.review.conclusion == 'success' && steps.score.outputs.score != '0'
466514
env:

0 commit comments

Comments
 (0)