Skip to content

[AIROCMLIR-376] Auto re-kick nightly/weekly builds on transient failures#2409

Open
bogdan-petkovic wants to merge 7 commits into
developfrom
users/bpetkovi/auto-rekick-transient-nightly-weekly
Open

[AIROCMLIR-376] Auto re-kick nightly/weekly builds on transient failures#2409
bogdan-petkovic wants to merge 7 commits into
developfrom
users/bpetkovi/auto-rekick-transient-nightly-weekly

Conversation

@bogdan-petkovic

Copy link
Copy Markdown
Contributor

Motivation

Nightly and weekly CI builds frequently fail for reasons that have nothing to do with the code under test — a GPU drops off the bus, an agent goes offline, a docker pull or SCM checkout times out, a node OOMs. Today these spurious failures require a human to notice and manually re-trigger the job, which wastes time and pollutes the failure signal. This PR automatically re-kicks nightly/weekly builds that fail for transient hardware/infrastructure reasons, while leaving genuine test/logic failures untouched.

Technical Details

  • New isTransientHardwareFailure(logText) classifier scans the last 1 MB of the build log (the decisive cause sits at the end) using a three-tier rule:

    • Pre-veto (hardware loss): dead GPU, hipErrorNoDevice, GPU reset failures, no healthy node found, etc. These win even over the veto, because a lost GPU also makes lit report spurious failures.
    • Veto (genuine failures): lit failed tests, FileCheck error: no match found / filecheck error, conv/perf sweep summaries, tuning detected errors, invalid MLIR — never re-kicked.
    • Post-veto (transient infra): agent offline, docker failed to run image, OOM (137), broken pipe, GPU hang / HW exception, and retriable SCM checkout errors.
  • Re-kick is gated to the official MLIR/mlir-nightly-all and MLIR/mlir-weekly jobs only, bounded to MAX_REKICK_ATTEMPTS = 2 via a forwarded rekickAttempt counter parameter.

  • rekickEnabled boolean parameter (default true) is a kill-switch to disable re-kicks from the Jenkins UI without a revert.

  • The re-kick decision is surfaced on the Teams notification card.

  • A breadcrumb echo in withHealthyNode ensures the "no healthy node" cause is in the log before the stage fails, so the post block can classify it.

Test Plan

  • Post-merge: once merged to develop, the feature runs on the real MLIR/mlir-nightly-all and MLIR/mlir-weekly jobs. We will monitor the next nightly/weekly cycles via the Teams card — which now shows the re-kick decision and attempt count — to confirm that only genuine transient failures get re-kicked and that the attempt cap (MAX_REKICK_ATTEMPTS = 2) holds.
  • Kill-switch: the rekickEnabled parameter (default true) is intentionally exposed so that, if a misclassification or any other problem shows up during live monitoring, re-kicks can be disabled immediately from the Jenkins UI (uncheck the parameter on the next run) without reverting this PR.

Test Result

  • Live results to be observed on the first nightly/weekly cycles after merge; rekickEnabled provides an immediate off-switch if any adjustment is needed.

Submission Checklist

Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>
@bogdan-petkovic bogdan-petkovic self-assigned this Jun 15, 2026
@bogdan-petkovic bogdan-petkovic added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026
@bogdan-petkovic bogdan-petkovic requested a review from Copilot June 15, 2026 12:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automatic “re-kick” behavior to the MLIR nightly/weekly Jenkins pipeline so that builds which fail due to transient hardware/infrastructure issues can be retried automatically (up to a capped number of attempts), while preserving the existing behavior for genuine test/logic failures. It also surfaces the re-kick status in the Teams notification.

Changes:

  • Add a log-tail classifier (isTransientHardwareFailure) to distinguish transient infra/hardware failures from real test/logic failures.
  • Introduce rekickEnabled (kill-switch) and rekickAttempt (counter) parameters, plus a MAX_REKICK_ATTEMPTS cap, and trigger an asynchronous build job: ... re-run when eligible.
  • Add a breadcrumb in withHealthyNode and include “Auto re-kick” details in the Teams adaptive card.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/utils/jenkins/Jenkinsfile Outdated
Comment thread mlir/utils/jenkins/Jenkinsfile Outdated
Comment thread mlir/utils/jenkins/Jenkinsfile Outdated
Comment thread mlir/utils/jenkins/Jenkinsfile Outdated

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: REQUEST_CHANGES -- submitted as COMMENT (automated reviews are advisory)  ·  Findings: 1 (0 Critical, 1 Major, 0 Minor)


Scope

Single-file change to mlir/utils/jenkins/Jenkinsfile adding auto re-kick of official nightly/weekly builds that fail for transient hardware/infra reasons: a new isTransientHardwareFailure() log-tail classifier (pre-veto hardware loss / veto genuine failures / post-veto transient infra), a rekickParameters() forwarder, two @Field constants, a rekickEnabled kill-switch + internal rekickAttempt param, a Teams-card line, and the gated build job: re-trigger in the post block.

Findings

  • mlir/utils/jenkins/Jenkinsfile:2337 — re-kick uses build job: jobName where jobName == env.JOB_NAME is the full path (MLIR/mlir-nightly-all); the build step resolves a non-absolute name relative to the current job's folder (MLIR/), so the full path likely resolves to MLIR/MLIR/mlir-nightly-all and fails to find the job. See inline comment.

Notes

  • Self-termination is sound: MAX_REKICK_ATTEMPTS = 2 with rekickAttempt forwarded and incremented bounds the chain (0→1→2, then 2 < 2 is false). Re-kick is correctly gated on result == 'FAILURE', rekickEnabled, isOfficialNightlyOrWeekly, and nightly||weekly, all of which also gate logText population, so the classifier always sees the log.
  • rekickParameters() forwards all 24 declared pipeline parameters — verified one-to-one against the parameters {} block (lines 1234-1300).
  • The [withHealthyNode] TRANSIENT breadcrumb (line 1228) matches the lowercased '[withhealthynode] transient' hardware signal, so the "no healthy node" cause is classifiable from the post block.
  • The veto-before-transient ordering is intentional (documented as collateral-failure defense); the pre-veto hardware list is the deliberate exception. No correctness issue, just a design tradeoff worth watching during live monitoring.
  • Test plan is post-merge live monitoring only (no pre-merge validation), which raises the cost of the build job: path issue above — worth confirming before relying on the feature.

CI status

No non-self CI checks report fail/cancel; the substantive jobs (Build and Test, MIGraphX, Jenkins, etc.) are still pending. The auto-review pipeline's own check being in-progress is expected.

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026
…tify

Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>
@bogdan-petkovic bogdan-petkovic added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: APPROVE -- submitted as COMMENT (automated reviews are advisory)  ·  New findings: 0 (0 Critical, 0 Major, 0 Minor)


Scope

Adds auto re-kick of official nightly/weekly Jenkins builds (MLIR/mlir-nightly-all, MLIR/mlir-weekly) on transient hardware/infra failures, in mlir/utils/jenkins/Jenkinsfile. Introduces an isTransientHardwareFailure classifier (pre-veto hardware loss / veto genuine failures / post-veto transient infra), a bounded rekickAttempt counter (cap MAX_REKICK_ATTEMPTS = 2), a rekickEnabled kill-switch, a Teams-card re-kick line, and a withHealthyNode breadcrumb.

Findings

No blocking issues found.

The previously flagged issue (re-kick used a folder-relative job: path that would resolve to MLIR/MLIR/... and never re-trigger) is resolved: line 2344 now uses the absolute build job: "/${jobName}".

The three Copilot-flagged items were also addressed in this revision:

  • Negative rekickAttempt is now clamped via Math.max(0, ...) (line 2314).
  • The Teams node('build-only') block is wrapped in try/catch so a notification failure no longer blocks the re-kick (lines 2333-2339).
  • FAILURE_LOG_TAIL_BYTES was renamed to FAILURE_LOG_TAIL_CHARS with a corrected comment (line 14), matching its character-based substring usage.

Notes

  • Spot-checked classifier wiring: willRekick is only evaluated for result == 'FAILURE' when logText is populated (lines 2292-2316); the breadcrumb echo [withHealthyNode] TRANSIENT lowercases to the signal [withhealthynode] transient, so the "no healthy node" cause is classifiable. Veto ordering (hardware-loss before genuine-failure before transient) matches the PR description.
  • isRetriableScmCheckoutError reused by the classifier (line 76) is pre-existing (referenced at line 204), so the reuse is sound and DRY.
  • The bulk of the PR review checklist targets C++/MLIR sources and does not apply to this Groovy pipeline change.

CI status

No non-self checks report fail/cancel; premerge checks were still in progress at review time. The pipeline's own review check being in-progress is expected.

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 15, 2026

@justinrosner justinrosner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we are now going to have multiple notifications in the CI channel for the same nightly? Since we have CI notifications for rocmlirTriton and rocMLIR, this could get very polluted very quickly. Would it make sense to only post the teams notification once (i.e., the result of the re-kicked run)?

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2409      +/-   ##
===========================================
+ Coverage    82.57%   82.68%   +0.12%     
===========================================
  Files          120      120              
  Lines        42852    42828      -24     
  Branches      7110     7106       -4     
===========================================
+ Hits         35381    35411      +30     
+ Misses        4815     4790      -25     
+ Partials      2656     2627      -29     
Flag Coverage Δ
gfx120x 82.57% <ø> (+0.04%) ⬆️
gfx950 82.42% <ø> (+0.07%) ⬆️
mfma 82.47% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bogdan-petkovic

Copy link
Copy Markdown
Contributor Author

Does this mean that we are now going to have multiple notifications in the CI channel for the same nightly? Since we have CI notifications for rocmlirTriton and rocMLIR, this could get very polluted very quickly. Would it make sense to only post the teams notification once (i.e., the result of the re-kicked run)?

Right now it'd fire a card for the run that fails and then another for the re-kicked one, so 2-3 per nightly, maybe too noisy. I'll just skip the notification when a run is going to be re-kicked and only let the final run post, so it's one card. I'll still add a short note on it like "re-kicked once after a GPU hang, passed now" so we don't lose why it happened. Same thing for weekly. What do you think?

@bogdan-petkovic bogdan-petkovic marked this pull request as ready for review June 16, 2026 10:03
@bogdan-petkovic bogdan-petkovic requested a review from causten as a code owner June 16, 2026 10:03
@justinrosner

Copy link
Copy Markdown
Contributor

Does this mean that we are now going to have multiple notifications in the CI channel for the same nightly? Since we have CI notifications for rocmlirTriton and rocMLIR, this could get very polluted very quickly. Would it make sense to only post the teams notification once (i.e., the result of the re-kicked run)?

Right now it'd fire a card for the run that fails and then another for the re-kicked one, so 2-3 per nightly, maybe too noisy. I'll just skip the notification when a run is going to be re-kicked and only let the final run post, so it's one card. I'll still add a short note on it like "re-kicked once after a GPU hang, passed now" so we don't lose why it happened. Same thing for weekly. What do you think?

Yeah, I think that would be great!

@bogdan-petkovic bogdan-petkovic added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 17, 2026

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: APPROVE -- submitted as COMMENT (automated reviews are advisory)  ·  New findings: 0 (0 Critical, 0 Major, 0 Minor)


Scope

Updates mlir/utils/jenkins/Jenkinsfile to auto re-kick official nightly/weekly builds (MLIR/mlir-nightly-all, MLIR/mlir-weekly) that fail for transient hardware/infra reasons, while vetoing genuine test/logic failures. Adds an isTransientHardwareFailure classifier, a bounded re-kick (MAX_REKICK_ATTEMPTS = 2) with a forwarded rekickAttempt counter, a rekickEnabled kill-switch, and a Teams-card re-kick breadcrumb.

Findings

No blocking issues found. All four findings from the previous review round are addressed in this revision:

  • Re-kick now targets an absolute job path (build job: "/${jobName}", line 2331), fixing the folder-relative resolution bug.
  • rekickAttempt is clamped non-negative (Math.max(0, ...), line 2320), closing the negative-counter cap bypass.
  • The Teams-notification node('build-only') block is wrapped in try/catch (lines 2355-2361) so notification failures no longer block the re-kick.
  • The tail-window constant was renamed FAILURE_LOG_TAIL_CHARS (line 14) to match its character-based usage.

Notes

  • rekickParameters forwards all 25 declared pipeline parameters, so a re-kicked run preserves the original configuration; no parameter silently resets to its default.
  • The classifier's tail is lowercased before matching, and isRetriableScmCheckoutError matches lowercased signals, so case handling is consistent.
  • Suppression logic yields exactly one Teams card per logical build chain: cards are suppressed only on a successful re-kick handoff, and the attempt cap guarantees a terminal card once rekickAttempt reaches the limit.
  • This is a Jenkins/Groovy CI script, so the C++/MLIR-specific checklist items (license headers, op verifiers, Lit/E2E coverage) do not apply.

CI status

No non-self checks report fail or cancel; premerge C/C++, Python format/lint, and Python performance script checks pass, with the remaining Jenkins/Windows checks still pending.

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 17, 2026
Comment thread mlir/utils/jenkins/Jenkinsfile

// Forwards all current parameters unchanged, with the incremented re-kick attempt counter.
List rekickParameters(int nextAttempt, String reason) {
return [

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it will rekick entire job instead of just rekicking on the server on which it is failing on. Would it be possible to just rekick on the failing server ?

@bogdan-petkovic bogdan-petkovic Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the full nightly/weekly console logs via the Jenkins API and went through the failures. They split into two groups:

  1. Per-server transients (occur inside a single matrix row, on one node):
  • GPU lost/hung: hipErrorNoDevice, GPU Hang, unable to reset GPU, dead-GPU garbage arch (unsupported hip gpu architecture: n/a)
  • Node/agent died mid-run: removed or offline, AgentOfflineException, launcher creation, broken pipe/closed channel, exit code -1/-2
  • Docker/OOM on the node: failed to run image, OutOfMemoryError, ninja exit 137
  • SCM checkout transients: connection reset, curl 18, transfer closed, sideband, bad pack header, clone/fetch errors
  1. Whole-job transients (no single server to retry on):
  • no healthy node found for an arch (withHealthyNode exhausted all candidate nodes)
  • Controller / queue / pipeline-level issues, not tied to a specific row

Your suggestion, retry only on the failing server is doable for the first group: we'd extend withHealthyNode to treat those as node-specific, blacklist the node and re-run that arch on a fresh one, which also avoids failFast killing the other arches. For the second group there's no per-server retry possible, so a whole-job re-kick stays the only option.
I'd propose a hybrid, per-server retry for the first group, and keep the whole-job re-kick as the fallback for the second group.
Also, re #1806 I think I originally scoped it as whole-job re-kick on any transient, this per-server refinement is better, so what do you think about that hybrid approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants