[AIROCMLIR-376] Auto re-kick nightly/weekly builds on transient failures#2409
[AIROCMLIR-376] Auto re-kick nightly/weekly builds on transient failures#2409bogdan-petkovic wants to merge 7 commits into
Conversation
Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR adds automatic “re-kick” behavior to the MLIR nightly/weekly Jenkins pipeline so that builds which fail due to transient hardware/infrastructure issues can be retried automatically (up to a capped number of attempts), while preserving the existing behavior for genuine test/logic failures. It also surfaces the re-kick status in the Teams notification.
Changes:
- Add a log-tail classifier (
isTransientHardwareFailure) to distinguish transient infra/hardware failures from real test/logic failures. - Introduce
rekickEnabled(kill-switch) andrekickAttempt(counter) parameters, plus aMAX_REKICK_ATTEMPTScap, and trigger an asynchronousbuild job: ...re-run when eligible. - Add a breadcrumb in
withHealthyNodeand include “Auto re-kick” details in the Teams adaptive card.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Verdict: REQUEST_CHANGES -- submitted as COMMENT (automated reviews are advisory) · Findings: 1 (0 Critical, 1 Major, 0 Minor)
Scope
Single-file change to mlir/utils/jenkins/Jenkinsfile adding auto re-kick of official nightly/weekly builds that fail for transient hardware/infra reasons: a new isTransientHardwareFailure() log-tail classifier (pre-veto hardware loss / veto genuine failures / post-veto transient infra), a rekickParameters() forwarder, two @Field constants, a rekickEnabled kill-switch + internal rekickAttempt param, a Teams-card line, and the gated build job: re-trigger in the post block.
Findings
mlir/utils/jenkins/Jenkinsfile:2337— re-kick usesbuild job: jobNamewherejobName == env.JOB_NAMEis the full path (MLIR/mlir-nightly-all); thebuildstep resolves a non-absolute name relative to the current job's folder (MLIR/), so the full path likely resolves toMLIR/MLIR/mlir-nightly-alland fails to find the job. See inline comment.
Notes
- Self-termination is sound:
MAX_REKICK_ATTEMPTS = 2withrekickAttemptforwarded and incremented bounds the chain (0→1→2, then2 < 2is false). Re-kick is correctly gated onresult == 'FAILURE',rekickEnabled,isOfficialNightlyOrWeekly, andnightly||weekly, all of which also gatelogTextpopulation, so the classifier always sees the log. rekickParameters()forwards all 24 declared pipeline parameters — verified one-to-one against theparameters {}block (lines 1234-1300).- The
[withHealthyNode] TRANSIENTbreadcrumb (line 1228) matches the lowercased'[withhealthynode] transient'hardware signal, so the "no healthy node" cause is classifiable from the post block. - The veto-before-transient ordering is intentional (documented as collateral-failure defense); the pre-veto hardware list is the deliberate exception. No correctness issue, just a design tradeoff worth watching during live monitoring.
- Test plan is post-merge live monitoring only (no pre-merge validation), which raises the cost of the
build job:path issue above — worth confirming before relying on the feature.
CI status
No non-self CI checks report fail/cancel; the substantive jobs (Build and Test, MIGraphX, Jenkins, etc.) are still pending. The auto-review pipeline's own check being in-progress is expected.
…tify Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>
There was a problem hiding this comment.
Verdict: APPROVE -- submitted as COMMENT (automated reviews are advisory) · New findings: 0 (0 Critical, 0 Major, 0 Minor)
Scope
Adds auto re-kick of official nightly/weekly Jenkins builds (MLIR/mlir-nightly-all, MLIR/mlir-weekly) on transient hardware/infra failures, in mlir/utils/jenkins/Jenkinsfile. Introduces an isTransientHardwareFailure classifier (pre-veto hardware loss / veto genuine failures / post-veto transient infra), a bounded rekickAttempt counter (cap MAX_REKICK_ATTEMPTS = 2), a rekickEnabled kill-switch, a Teams-card re-kick line, and a withHealthyNode breadcrumb.
Findings
No blocking issues found.
The previously flagged issue (re-kick used a folder-relative job: path that would resolve to MLIR/MLIR/... and never re-trigger) is resolved: line 2344 now uses the absolute build job: "/${jobName}".
The three Copilot-flagged items were also addressed in this revision:
- Negative
rekickAttemptis now clamped viaMath.max(0, ...)(line 2314). - The Teams
node('build-only')block is wrapped in try/catch so a notification failure no longer blocks the re-kick (lines 2333-2339). FAILURE_LOG_TAIL_BYTESwas renamed toFAILURE_LOG_TAIL_CHARSwith a corrected comment (line 14), matching its character-basedsubstringusage.
Notes
- Spot-checked classifier wiring:
willRekickis only evaluated forresult == 'FAILURE'whenlogTextis populated (lines 2292-2316); the breadcrumb echo[withHealthyNode] TRANSIENTlowercases to the signal[withhealthynode] transient, so the "no healthy node" cause is classifiable. Veto ordering (hardware-loss before genuine-failure before transient) matches the PR description. isRetriableScmCheckoutErrorreused by the classifier (line 76) is pre-existing (referenced at line 204), so the reuse is sound and DRY.- The bulk of the PR review checklist targets C++/MLIR sources and does not apply to this Groovy pipeline change.
CI status
No non-self checks report fail/cancel; premerge checks were still in progress at review time. The pipeline's own review check being in-progress is expected.
justinrosner
left a comment
There was a problem hiding this comment.
Does this mean that we are now going to have multiple notifications in the CI channel for the same nightly? Since we have CI notifications for rocmlirTriton and rocMLIR, this could get very polluted very quickly. Would it make sense to only post the teams notification once (i.e., the result of the re-kicked run)?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #2409 +/- ##
===========================================
+ Coverage 82.57% 82.68% +0.12%
===========================================
Files 120 120
Lines 42852 42828 -24
Branches 7110 7106 -4
===========================================
+ Hits 35381 35411 +30
+ Misses 4815 4790 -25
+ Partials 2656 2627 -29
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Right now it'd fire a card for the run that fails and then another for the re-kicked one, so 2-3 per nightly, maybe too noisy. I'll just skip the notification when a run is going to be re-kicked and only let the final run post, so it's one card. I'll still add a short note on it like "re-kicked once after a GPU hang, passed now" so we don't lose why it happened. Same thing for weekly. What do you think? |
Yeah, I think that would be great! |
Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>
There was a problem hiding this comment.
Verdict: APPROVE -- submitted as COMMENT (automated reviews are advisory) · New findings: 0 (0 Critical, 0 Major, 0 Minor)
Scope
Updates mlir/utils/jenkins/Jenkinsfile to auto re-kick official nightly/weekly builds (MLIR/mlir-nightly-all, MLIR/mlir-weekly) that fail for transient hardware/infra reasons, while vetoing genuine test/logic failures. Adds an isTransientHardwareFailure classifier, a bounded re-kick (MAX_REKICK_ATTEMPTS = 2) with a forwarded rekickAttempt counter, a rekickEnabled kill-switch, and a Teams-card re-kick breadcrumb.
Findings
No blocking issues found. All four findings from the previous review round are addressed in this revision:
- Re-kick now targets an absolute job path (
build job: "/${jobName}", line 2331), fixing the folder-relative resolution bug. rekickAttemptis clamped non-negative (Math.max(0, ...), line 2320), closing the negative-counter cap bypass.- The Teams-notification
node('build-only')block is wrapped in try/catch (lines 2355-2361) so notification failures no longer block the re-kick. - The tail-window constant was renamed
FAILURE_LOG_TAIL_CHARS(line 14) to match its character-based usage.
Notes
rekickParametersforwards all 25 declared pipeline parameters, so a re-kicked run preserves the original configuration; no parameter silently resets to its default.- The classifier's tail is lowercased before matching, and
isRetriableScmCheckoutErrormatches lowercased signals, so case handling is consistent. - Suppression logic yields exactly one Teams card per logical build chain: cards are suppressed only on a successful re-kick handoff, and the attempt cap guarantees a terminal card once
rekickAttemptreaches the limit. - This is a Jenkins/Groovy CI script, so the C++/MLIR-specific checklist items (license headers, op verifiers, Lit/E2E coverage) do not apply.
CI status
No non-self checks report fail or cancel; premerge C/C++, Python format/lint, and Python performance script checks pass, with the remaining Jenkins/Windows checks still pending.
|
|
||
| // Forwards all current parameters unchanged, with the incremented re-kick attempt counter. | ||
| List rekickParameters(int nextAttempt, String reason) { | ||
| return [ |
There was a problem hiding this comment.
Looks like it will rekick entire job instead of just rekicking on the server on which it is failing on. Would it be possible to just rekick on the failing server ?
There was a problem hiding this comment.
I pulled the full nightly/weekly console logs via the Jenkins API and went through the failures. They split into two groups:
- Per-server transients (occur inside a single matrix row, on one node):
- GPU lost/hung: hipErrorNoDevice, GPU Hang, unable to reset GPU, dead-GPU garbage arch (unsupported hip gpu architecture: n/a)
- Node/agent died mid-run: removed or offline, AgentOfflineException, launcher creation, broken pipe/closed channel, exit code -1/-2
- Docker/OOM on the node: failed to run image, OutOfMemoryError, ninja exit 137
- SCM checkout transients: connection reset, curl 18, transfer closed, sideband, bad pack header, clone/fetch errors
- Whole-job transients (no single server to retry on):
- no healthy node found for an arch (withHealthyNode exhausted all candidate nodes)
- Controller / queue / pipeline-level issues, not tied to a specific row
Your suggestion, retry only on the failing server is doable for the first group: we'd extend withHealthyNode to treat those as node-specific, blacklist the node and re-run that arch on a fresh one, which also avoids failFast killing the other arches. For the second group there's no per-server retry possible, so a whole-job re-kick stays the only option.
I'd propose a hybrid, per-server retry for the first group, and keep the whole-job re-kick as the fallback for the second group.
Also, re #1806 I think I originally scoped it as whole-job re-kick on any transient, this per-server refinement is better, so what do you think about that hybrid approach?
Motivation
Nightly and weekly CI builds frequently fail for reasons that have nothing to do with the code under test — a GPU drops off the bus, an agent goes offline, a docker pull or SCM checkout times out, a node OOMs. Today these spurious failures require a human to notice and manually re-trigger the job, which wastes time and pollutes the failure signal. This PR automatically re-kicks nightly/weekly builds that fail for transient hardware/infrastructure reasons, while leaving genuine test/logic failures untouched.
Technical Details
New isTransientHardwareFailure(logText) classifier scans the last 1 MB of the build log (the decisive cause sits at the end) using a three-tier rule:
Re-kick is gated to the official MLIR/mlir-nightly-all and MLIR/mlir-weekly jobs only, bounded to MAX_REKICK_ATTEMPTS = 2 via a forwarded rekickAttempt counter parameter.
rekickEnabled boolean parameter (default true) is a kill-switch to disable re-kicks from the Jenkins UI without a revert.
The re-kick decision is surfaced on the Teams notification card.
A breadcrumb echo in withHealthyNode ensures the "no healthy node" cause is in the log before the stage fails, so the post block can classify it.
Test Plan
Test Result
Submission Checklist