[AIROCMLIR-376] Auto re-kick nightly/weekly builds on transient failures by bogdan-petkovic · Pull Request #2409 · ROCm/rocMLIR

bogdan-petkovic · 2026-06-15T10:50:39Z

Motivation

Nightly and weekly CI builds frequently fail for reasons that have nothing to do with the code under test — a GPU drops off the bus, an agent goes offline, a docker pull or SCM checkout times out, a node OOMs. Today these spurious failures require a human to notice and manually re-trigger the job, which wastes time and pollutes the failure signal. This PR automatically re-kicks nightly/weekly builds that fail for transient hardware/infrastructure reasons, while leaving genuine test/logic failures untouched.

Technical Details

New isTransientHardwareFailure(logText) classifier scans the last 1 MB of the build log (the decisive cause sits at the end) using a three-tier rule:
- Pre-veto (hardware loss): dead GPU, hipErrorNoDevice, GPU reset failures, no healthy node found, etc. These win even over the veto, because a lost GPU also makes lit report spurious failures.
- Veto (genuine failures): lit failed tests, FileCheck error: no match found / filecheck error, conv/perf sweep summaries, tuning detected errors, invalid MLIR — never re-kicked.
- Post-veto (transient infra): agent offline, docker failed to run image, OOM (137), broken pipe, GPU hang / HW exception, and retriable SCM checkout errors.
Re-kick is gated to the official MLIR/mlir-nightly-all and MLIR/mlir-weekly jobs only, bounded to MAX_REKICK_ATTEMPTS = 2 via a forwarded rekickAttempt counter parameter.
rekickEnabled boolean parameter (default true) is a kill-switch to disable re-kicks from the Jenkins UI without a revert.
The re-kick decision is surfaced on the Teams notification card.
A breadcrumb echo in withHealthyNode ensures the "no healthy node" cause is in the log before the stage fails, so the post block can classify it.

Test Plan

Post-merge: once merged to develop, the feature runs on the real MLIR/mlir-nightly-all and MLIR/mlir-weekly jobs. We will monitor the next nightly/weekly cycles via the Teams card — which now shows the re-kick decision and attempt count — to confirm that only genuine transient failures get re-kicked and that the attempt cap (MAX_REKICK_ATTEMPTS = 2) holds.
Kill-switch: the rekickEnabled parameter (default true) is intentionally exposed so that, if a misclassification or any other problem shows up during live monitoring, re-kicks can be disabled immediately from the Jenkins UI (uncheck the parameter on the next run) without reverting this PR.

Test Result

Live results to be observed on the first nightly/weekly cycles after merge; rekickEnabled provides an immediate off-switch if any adjustment is needed.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>

Copilot

Pull request overview

This PR adds automatic “re-kick” behavior to the MLIR nightly/weekly Jenkins pipeline so that builds which fail due to transient hardware/infrastructure issues can be retried automatically (up to a capped number of attempts), while preserving the existing behavior for genuine test/logic failures. It also surfaces the re-kick status in the Teams notification.

Changes:

Add a log-tail classifier (isTransientHardwareFailure) to distinguish transient infra/hardware failures from real test/logic failures.
Introduce rekickEnabled (kill-switch) and rekickAttempt (counter) parameters, plus a MAX_REKICK_ATTEMPTS cap, and trigger an asynchronous build job: ... re-run when eligible.
Add a breadcrumb in withHealthyNode and include “Auto re-kick” details in the Teams adaptive card.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rocmlir-pr-reviewer

Verdict: REQUEST_CHANGES -- submitted as COMMENT (automated reviews are advisory) · Findings: 1 (0 Critical, 1 Major, 0 Minor)

Scope

Single-file change to mlir/utils/jenkins/Jenkinsfile adding auto re-kick of official nightly/weekly builds that fail for transient hardware/infra reasons: a new isTransientHardwareFailure() log-tail classifier (pre-veto hardware loss / veto genuine failures / post-veto transient infra), a rekickParameters() forwarder, two @Field constants, a rekickEnabled kill-switch + internal rekickAttempt param, a Teams-card line, and the gated build job: re-trigger in the post block.

Findings

mlir/utils/jenkins/Jenkinsfile:2337 — re-kick uses build job: jobName where jobName == env.JOB_NAME is the full path (MLIR/mlir-nightly-all); the build step resolves a non-absolute name relative to the current job's folder (MLIR/), so the full path likely resolves to MLIR/MLIR/mlir-nightly-all and fails to find the job. See inline comment.

Notes

Self-termination is sound: MAX_REKICK_ATTEMPTS = 2 with rekickAttempt forwarded and incremented bounds the chain (0→1→2, then 2 < 2 is false). Re-kick is correctly gated on result == 'FAILURE', rekickEnabled, isOfficialNightlyOrWeekly, and nightly||weekly, all of which also gate logText population, so the classifier always sees the log.
rekickParameters() forwards all 24 declared pipeline parameters — verified one-to-one against the parameters {} block (lines 1234-1300).
The [withHealthyNode] TRANSIENT breadcrumb (line 1228) matches the lowercased '[withhealthynode] transient' hardware signal, so the "no healthy node" cause is classifiable from the post block.
The veto-before-transient ordering is intentional (documented as collateral-failure defense); the pre-veto hardware list is the deliberate exception. No correctness issue, just a design tradeoff worth watching during live monitoring.
Test plan is post-merge live monitoring only (no pre-merge validation), which raises the cost of the build job: path issue above — worth confirming before relying on the feature.

CI status

No non-self CI checks report fail/cancel; the substantive jobs (Build and Test, MIGraphX, Jenkins, etc.) are still pending. The auto-review pipeline's own check being in-progress is expected.

…tify Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>

rocmlir-pr-reviewer

Verdict: APPROVE -- submitted as COMMENT (automated reviews are advisory) · New findings: 0 (0 Critical, 0 Major, 0 Minor)

Scope

Adds auto re-kick of official nightly/weekly Jenkins builds (MLIR/mlir-nightly-all, MLIR/mlir-weekly) on transient hardware/infra failures, in mlir/utils/jenkins/Jenkinsfile. Introduces an isTransientHardwareFailure classifier (pre-veto hardware loss / veto genuine failures / post-veto transient infra), a bounded rekickAttempt counter (cap MAX_REKICK_ATTEMPTS = 2), a rekickEnabled kill-switch, a Teams-card re-kick line, and a withHealthyNode breadcrumb.

Findings

No blocking issues found.

The previously flagged issue (re-kick used a folder-relative job: path that would resolve to MLIR/MLIR/... and never re-trigger) is resolved: line 2344 now uses the absolute build job: "/${jobName}".

The three Copilot-flagged items were also addressed in this revision:

Negative rekickAttempt is now clamped via Math.max(0, ...) (line 2314).
The Teams node('build-only') block is wrapped in try/catch so a notification failure no longer blocks the re-kick (lines 2333-2339).
FAILURE_LOG_TAIL_BYTES was renamed to FAILURE_LOG_TAIL_CHARS with a corrected comment (line 14), matching its character-based substring usage.

Notes

Spot-checked classifier wiring: willRekick is only evaluated for result == 'FAILURE' when logText is populated (lines 2292-2316); the breadcrumb echo [withHealthyNode] TRANSIENT lowercases to the signal [withhealthynode] transient, so the "no healthy node" cause is classifiable. Veto ordering (hardware-loss before genuine-failure before transient) matches the PR description.
isRetriableScmCheckoutError reused by the classifier (line 76) is pre-existing (referenced at line 204), so the reuse is sound and DRY.
The bulk of the PR review checklist targets C++/MLIR sources and does not apply to this Groovy pipeline change.

CI status

No non-self checks report fail/cancel; premerge checks were still in progress at review time. The pipeline's own review check being in-progress is expected.

justinrosner

Does this mean that we are now going to have multiple notifications in the CI channel for the same nightly? Since we have CI notifications for rocmlirTriton and rocMLIR, this could get very polluted very quickly. Would it make sense to only post the teams notification once (i.e., the result of the re-kicked run)?

codecov · 2026-06-15T17:51:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2409      +/-   ##
===========================================
+ Coverage    82.57%   82.68%   +0.12%     
===========================================
  Files          120      120              
  Lines        42852    42828      -24     
  Branches      7110     7106       -4     
===========================================
+ Hits         35381    35411      +30     
+ Misses        4815     4790      -25     
+ Partials      2656     2627      -29

Flag	Coverage Δ
gfx120x	`82.57% <ø> (+0.04%)`	⬆️
gfx950	`82.42% <ø> (+0.07%)`	⬆️
mfma	`82.47% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.
see 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bogdan-petkovic · 2026-06-16T09:20:17Z

Does this mean that we are now going to have multiple notifications in the CI channel for the same nightly? Since we have CI notifications for rocmlirTriton and rocMLIR, this could get very polluted very quickly. Would it make sense to only post the teams notification once (i.e., the result of the re-kicked run)?

Right now it'd fire a card for the run that fails and then another for the re-kicked one, so 2-3 per nightly, maybe too noisy. I'll just skip the notification when a run is going to be re-kicked and only let the final run post, so it's one card. I'll still add a short note on it like "re-kicked once after a GPU hang, passed now" so we don't lose why it happened. Same thing for weekly. What do you think?

justinrosner · 2026-06-16T12:44:32Z

Does this mean that we are now going to have multiple notifications in the CI channel for the same nightly? Since we have CI notifications for rocmlirTriton and rocMLIR, this could get very polluted very quickly. Would it make sense to only post the teams notification once (i.e., the result of the re-kicked run)?

Right now it'd fire a card for the run that fails and then another for the re-kicked one, so 2-3 per nightly, maybe too noisy. I'll just skip the notification when a run is going to be re-kicked and only let the final run post, so it's one card. I'll still add a short note on it like "re-kicked once after a GPU hang, passed now" so we don't lose why it happened. Same thing for weekly. What do you think?

Yeah, I think that would be great!

Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>

…tly-weekly

rocmlir-pr-reviewer

Verdict: APPROVE -- submitted as COMMENT (automated reviews are advisory) · New findings: 0 (0 Critical, 0 Major, 0 Minor)

Scope

Updates mlir/utils/jenkins/Jenkinsfile to auto re-kick official nightly/weekly builds (MLIR/mlir-nightly-all, MLIR/mlir-weekly) that fail for transient hardware/infra reasons, while vetoing genuine test/logic failures. Adds an isTransientHardwareFailure classifier, a bounded re-kick (MAX_REKICK_ATTEMPTS = 2) with a forwarded rekickAttempt counter, a rekickEnabled kill-switch, and a Teams-card re-kick breadcrumb.

Findings

No blocking issues found. All four findings from the previous review round are addressed in this revision:

Re-kick now targets an absolute job path (build job: "/${jobName}", line 2331), fixing the folder-relative resolution bug.
rekickAttempt is clamped non-negative (Math.max(0, ...), line 2320), closing the negative-counter cap bypass.
The Teams-notification node('build-only') block is wrapped in try/catch (lines 2355-2361) so notification failures no longer block the re-kick.
The tail-window constant was renamed FAILURE_LOG_TAIL_CHARS (line 14) to match its character-based usage.

Notes

rekickParameters forwards all 25 declared pipeline parameters, so a re-kicked run preserves the original configuration; no parameter silently resets to its default.
The classifier's tail is lowercased before matching, and isRetriableScmCheckoutError matches lowercased signals, so case handling is consistent.
Suppression logic yields exactly one Teams card per logical build chain: cards are suppressed only on a successful re-kick handoff, and the attempt cap guarantees a terminal card once rekickAttempt reaches the limit.
This is a Jenkins/Groovy CI script, so the C++/MLIR-specific checklist items (license headers, op verifiers, Lit/E2E coverage) do not apply.

CI status

No non-self checks report fail or cancel; premerge C/C++, Python format/lint, and Python performance script checks pass, with the remaining Jenkins/Windows checks still pending.

…tly-weekly

umangyadav · 2026-06-17T16:10:59Z

+
+// Forwards all current parameters unchanged, with the incremented re-kick attempt counter.
+List rekickParameters(int nextAttempt, String reason) {
+    return [


Looks like it will rekick entire job instead of just rekicking on the server on which it is failing on. Would it be possible to just rekick on the failing server ?

I pulled the full nightly/weekly console logs via the Jenkins API and went through the failures. They split into two groups:

Per-server transients (occur inside a single matrix row, on one node):

GPU lost/hung: hipErrorNoDevice, GPU Hang, unable to reset GPU, dead-GPU garbage arch (unsupported hip gpu architecture: n/a)

Node/agent died mid-run: removed or offline, AgentOfflineException, launcher creation, broken pipe/closed channel, exit code -1/-2

Docker/OOM on the node: failed to run image, OutOfMemoryError, ninja exit 137

SCM checkout transients: connection reset, curl 18, transfer closed, sideband, bad pack header, clone/fetch errors

Whole-job transients (no single server to retry on):

no healthy node found for an arch (withHealthyNode exhausted all candidate nodes)

Controller / queue / pipeline-level issues, not tied to a specific row

Your suggestion, retry only on the failing server is doable for the first group: we'd extend withHealthyNode to treat those as node-specific, blacklist the node and re-run that arch on a fresh one, which also avoids failFast killing the other arches. For the second group there's no per-server retry possible, so a whole-job re-kick stays the only option.
I'd propose a hybrid, per-server retry for the first group, and keep the whole-job re-kick as the fallback for the second group.
Also, re #1806 I think I originally scoped it as whole-job re-kick on any transient, this per-server refinement is better, so what do you think about that hybrid approach?

…tly-weekly

[AIROCMLIR-376] Auto re-kick nightly/weekly builds on transient failures

a1fc80d

Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>

bogdan-petkovic self-assigned this Jun 15, 2026

bogdan-petkovic added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026

bogdan-petkovic requested a review from Copilot June 15, 2026 12:33

Copilot started reviewing on behalf of bogdan-petkovic June 15, 2026 12:34 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread mlir/utils/jenkins/Jenkinsfile Outdated

Comment thread mlir/utils/jenkins/Jenkinsfile Outdated

Comment thread mlir/utils/jenkins/Jenkinsfile Outdated

rocmlir-pr-reviewer Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread mlir/utils/jenkins/Jenkinsfile Outdated

rocmlir-pr-reviewer Bot reviewed Jun 15, 2026

View reviewed changes

rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026

[AIROCMLIR-376] Address review: clamp attempt, abs job path, guard no…

764db3c

…tify Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>

bogdan-petkovic added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026

rocmlir-pr-reviewer Bot reviewed Jun 15, 2026

View reviewed changes

rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026

bogdan-petkovic requested review from dhernandez0, justinrosner, mirza-halilcevic and umangyadav June 15, 2026 14:13

justinrosner reviewed Jun 15, 2026

View reviewed changes

bogdan-petkovic marked this pull request as ready for review June 16, 2026 10:03

bogdan-petkovic requested a review from causten as a code owner June 16, 2026 10:03

bogdan-petkovic added 3 commits June 16, 2026 14:17

[AIROCMLIR-376] Post a single Teams card per re-kicked nightly/weekly

0ef766d

Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>

Merge branch 'develop' into users/bpetkovi/auto-rekick-transient-nigh…

1ba97b3

…tly-weekly

Merge branch 'develop' into users/bpetkovi/auto-rekick-transient-nigh…

42ff25d

…tly-weekly

bogdan-petkovic added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 17, 2026

bogdan-petkovic requested a review from justinrosner June 17, 2026 10:10

rocmlir-pr-reviewer Bot reviewed Jun 17, 2026

View reviewed changes

rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 17, 2026

Merge branch 'develop' into users/bpetkovi/auto-rekick-transient-nigh…

f63acd9

…tly-weekly

umangyadav requested review from dorde-antic and leo-automation June 17, 2026 16:07

umangyadav reviewed Jun 17, 2026

View reviewed changes

Comment thread mlir/utils/jenkins/Jenkinsfile

umangyadav reviewed Jun 17, 2026

View reviewed changes

Merge branch 'develop' into users/bpetkovi/auto-rekick-transient-nigh…

acfb3d1

…tly-weekly

Uh oh!

Conversation

bogdan-petkovic commented Jun 15, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rocmlir-pr-reviewer Bot left a comment

Choose a reason for hiding this comment

Scope

Findings

Notes

CI status

Uh oh!

rocmlir-pr-reviewer Bot left a comment

Choose a reason for hiding this comment

Scope

Findings

Notes

CI status

Uh oh!

justinrosner left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bogdan-petkovic commented Jun 16, 2026

Uh oh!

justinrosner commented Jun 16, 2026

Uh oh!

rocmlir-pr-reviewer Bot left a comment

Choose a reason for hiding this comment

Scope

Findings

Notes

CI status

Uh oh!

Uh oh!

umangyadav Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

bogdan-petkovic Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jun 15, 2026 •

edited

Loading

bogdan-petkovic Jun 17, 2026 •

edited

Loading