Skip to content

Commit 68c7766

Browse files
[github-actions] Retry transient workflow failures (#175) (#176)
* [github-actions] Retry transient workflow failures (#175) * Update wiki submodule pointer for PR #176 --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 02c737d commit 68c7766

5 files changed

Lines changed: 202 additions & 2 deletions

File tree

.github/wiki

Submodule wiki updated from 2a1fdf6 to 250cdc0
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
name: Retry Transient Workflow Failures
2+
3+
on:
4+
workflow_run:
5+
workflows:
6+
- Project Board Automation
7+
- Changelog Automation
8+
- Pull Request Label Sync
9+
- Generate Reports and Deploy to GitHub Pages
10+
- Rigorous Pull Request Review
11+
- Run PHPUnit Tests
12+
- Maintain Wiki
13+
- Maintain Wiki Publication
14+
- Update Wiki Preview
15+
- Update Wiki
16+
types:
17+
- completed
18+
19+
permissions:
20+
actions: write
21+
contents: read
22+
23+
concurrency:
24+
group: retry-transient-run-${{ github.event.workflow_run.id }}
25+
cancel-in-progress: false
26+
27+
jobs:
28+
retry:
29+
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
30+
name: Retry Failed Jobs When GitHub Infrastructure Looks Transient
31+
runs-on: ubuntu-latest
32+
33+
steps:
34+
- id: retry
35+
uses: actions/github-script@v8
36+
with:
37+
github-token: ${{ secrets.GITHUB_TOKEN }}
38+
script: |
39+
const transientPatterns = [
40+
/RPC failed; HTTP 5\d\d/i,
41+
/expected flush after ref listing/i,
42+
/expected 'packfile'/i,
43+
/remote:\s+Internal Server Error/i,
44+
/requested URL returned error:\s*5\d\d/i,
45+
/fatal:\s+unable to access 'https:\/\/github\.com\/.*': The requested URL returned error:\s*5\d\d/i,
46+
];
47+
48+
const [owner, repo] = process.env.GITHUB_REPOSITORY.split('/');
49+
const runId = Number.parseInt(`${{ github.event.workflow_run.id }}`, 10);
50+
const runAttempt = Number.parseInt(`${{ github.event.workflow_run.run_attempt }}`, 10);
51+
const workflowName = `${{ github.event.workflow_run.name }}`;
52+
const maxRunAttempts = 2;
53+
54+
const buildSummary = ({ status, failedJobs = [], matchedJobs = [] }) => {
55+
const lines = [
56+
'## Transient Failure Retry Summary',
57+
'',
58+
`- Workflow: \`${workflowName}\``,
59+
`- Run ID: \`${runId}\``,
60+
`- Run attempt: \`${runAttempt}\``,
61+
`- Retry status: \`${status}\``,
62+
];
63+
64+
if (failedJobs.length > 0) {
65+
lines.push(`- Failed jobs inspected: ${failedJobs.map((job) => `\`${job}\``).join(', ')}`);
66+
}
67+
68+
if (matchedJobs.length > 0) {
69+
lines.push(`- Jobs with transient GitHub failure signatures: ${matchedJobs.map((job) => `\`${job}\``).join(', ')}`);
70+
}
71+
72+
if (status === 'rerun-requested') {
73+
lines.push('- Action: Requested a rerun of failed jobs because every failed job matched transient GitHub-side error signatures.');
74+
}
75+
76+
if (status === 'skipped-run-attempt-limit') {
77+
lines.push('- Action: Skipped rerun because the run already reached the configured retry limit.');
78+
}
79+
80+
if (status === 'skipped-no-failed-jobs') {
81+
lines.push('- Action: Skipped rerun because the workflow reported failure without failed jobs to inspect.');
82+
}
83+
84+
if (status === 'skipped-no-transient-match') {
85+
lines.push('- Action: Skipped rerun because at least one failed job did not match the transient GitHub-side signatures.');
86+
}
87+
88+
return lines.join('\n');
89+
};
90+
91+
if (runAttempt >= maxRunAttempts) {
92+
const summary = buildSummary({ status: 'skipped-run-attempt-limit' });
93+
core.setOutput('status', 'skipped-run-attempt-limit');
94+
core.setOutput('summary', summary);
95+
96+
return;
97+
}
98+
99+
const jobsResponse = await github.rest.actions.listJobsForWorkflowRun({
100+
owner,
101+
repo,
102+
run_id: runId,
103+
per_page: 100,
104+
});
105+
106+
const failedJobs = jobsResponse.data.jobs.filter((job) => job.conclusion === 'failure');
107+
108+
if (failedJobs.length === 0) {
109+
const summary = buildSummary({ status: 'skipped-no-failed-jobs' });
110+
core.setOutput('status', 'skipped-no-failed-jobs');
111+
core.setOutput('summary', summary);
112+
113+
return;
114+
}
115+
116+
const matchedJobs = [];
117+
118+
for (const job of failedJobs) {
119+
const logsResponse = await fetch(`https://api.github.com/repos/${owner}/${repo}/actions/jobs/${job.id}/logs`, {
120+
headers: {
121+
Accept: 'application/vnd.github+json',
122+
Authorization: `Bearer ${process.env.GITHUB_TOKEN}`,
123+
'X-GitHub-Api-Version': '2022-11-28',
124+
},
125+
redirect: 'follow',
126+
});
127+
128+
if (!logsResponse.ok) {
129+
throw new Error(`Failed to download logs for job ${job.name}: ${logsResponse.status} ${logsResponse.statusText}`);
130+
}
131+
132+
const logText = await logsResponse.text();
133+
const hasTransientMatch = transientPatterns.some((pattern) => pattern.test(logText));
134+
135+
if (!hasTransientMatch) {
136+
const summary = buildSummary({
137+
status: 'skipped-no-transient-match',
138+
failedJobs: failedJobs.map((failedJob) => failedJob.name),
139+
matchedJobs,
140+
});
141+
142+
core.setOutput('status', 'skipped-no-transient-match');
143+
core.setOutput('summary', summary);
144+
145+
return;
146+
}
147+
148+
matchedJobs.push(job.name);
149+
}
150+
151+
await github.request('POST /repos/{owner}/{repo}/actions/runs/{run_id}/rerun-failed-jobs', {
152+
owner,
153+
repo,
154+
run_id: runId,
155+
});
156+
157+
const summary = buildSummary({
158+
status: 'rerun-requested',
159+
failedJobs: failedJobs.map((job) => job.name),
160+
matchedJobs,
161+
});
162+
163+
core.setOutput('status', 'rerun-requested');
164+
core.setOutput('summary', summary);
165+
166+
- name: Write step summary
167+
env:
168+
RETRY_SUMMARY: ${{ steps.retry.outputs.summary }}
169+
run: printf '%s\n' "$RETRY_SUMMARY" >> "$GITHUB_STEP_SUMMARY"

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1313

1414
### Changed
1515

16+
- Retry failed GitHub Actions jobs once when failed workflow logs match transient GitHub-side checkout or transport errors (#175)
1617
- Teach the review and pull-request agent skills to treat workflow-managed wiki pointer updates as expected state and to prefer fresh follow-up issues plus PRs over reviving closed deleted branches (#147)
1718
- Require GitHub issue write readback verification in the github-issues skill (#165)
1819
- Standardize cache flags and nested cache-dir propagation across cache-aware commands (#162)

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,11 @@ and logged failures emit native workflow error annotations, including file and
201201
line metadata when commands provide it. The packaged tests, reports, wiki, and
202202
changelog workflows also append concise Markdown outcomes to
203203
`GITHUB_STEP_SUMMARY` so maintainers can scan versions, URLs, preview refs,
204-
verification status, and release results without expanding full logs.
204+
verification status, and release results without expanding full logs. This
205+
repository also keeps a bounded retry workflow that reruns failed jobs once
206+
when failed job logs match transient GitHub-side checkout or transport errors
207+
such as HTTP 500 fetch failures, while leaving genuine logic and quality
208+
failures untouched.
205209

206210
When the packaged changelog workflow is synchronized into a consumer
207211
repository, pull requests are expected to add a notable changelog entry before

docs/usage/github-actions.rst

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -249,3 +249,29 @@ Maintenance Workflows
249249
* Promotes all ``Release Prepared`` work into ``Released`` when the release-preparation pull request is merged and the GitHub release is published.
250250
* Uses the built-in workflow token for project updates.
251251
* **Label Sync**: Synchronizes repository labels with ecosystem standards.
252+
253+
Transient Failure Retry
254+
-----------------------
255+
256+
This repository also keeps a local ``retry-transient-failures.yml`` workflow
257+
that watches completed workflow runs and decides whether a failed run looks
258+
like a transient GitHub-side infrastructure problem rather than a logic bug in
259+
the workflow itself.
260+
261+
**Behavior:**
262+
* Runs only after one of the repository's core workflows finishes with a
263+
failure.
264+
* Inspects failed job logs for transient GitHub-side signatures such as
265+
checkout or fetch HTTP 500 failures, Git transport RPC errors, and related
266+
internal-server-error patterns.
267+
* Requests a rerun of failed jobs only when every failed job matches those
268+
transient signatures.
269+
* Stops after one rerun attempt, so repeated failures still surface clearly
270+
to maintainers.
271+
* Appends a deterministic summary describing whether a rerun was requested or
272+
skipped.
273+
274+
**Non-goals:**
275+
* It does not retry PHPUnit failures, lint failures, changelog validation,
276+
or other logic or quality-signal regressions.
277+
* It does not introduce unbounded rerun loops.

0 commit comments

Comments
 (0)