Skip to content

Spre 5085 kueue predictive alerts#1093

Open
idankatzle wants to merge 1 commit into
redhat-appstudio:mainfrom
idankatzle:spre-5085-kueue-predictive-alerts
Open

Spre 5085 kueue predictive alerts#1093
idankatzle wants to merge 1 commit into
redhat-appstudio:mainfrom
idankatzle:spre-5085-kueue-predictive-alerts

Conversation

@idankatzle
Copy link
Copy Markdown
Contributor

@idankatzle idankatzle commented Apr 26, 2026

https://redhat.atlassian.net/issues?filter=-1&selectedIssue=SPRE-5085
Description
This PR introduces predictive alerting for Kueue to identify potential infrastructure issues before they impact users. These alerts were designed based on the analysis of itn-2026-00086.

Included Alerts:

KueuePredictiveQuotaStarvation: Detects gaps between admitted and active workloads (Quota leak detection).

KueueHighPriorityBuildDelay: Alerts when critical build tasks wait in queue for more than 15 minutes.

KueueMintmakerStarvation: Monitors long-term starvation (>12h) for background dependency updates.

Pre-Submission Checklist
[x] Jira Ticket: Linked in PR name and commit (SPRE-5085).

[x] Alert Tests: Unit tests added in test/promql/tests/data_plane/kueue_predictive_alerts_test.yaml and verified with promtool.

[x] SOP / Runbook: Alerts include descriptive annotations to guide the on-call engineer.

[ ] Dashboards Addition: N/A.

[x] Contribution Guides: Follows all repository conventions.

[x] Pipeline Finished Successfully: Local promtool validation passed.

@idankatzle idankatzle self-assigned this Apr 26, 2026
@idankatzle idankatzle force-pushed the spre-5085-kueue-predictive-alerts branch from e777d88 to 0b98d41 Compare April 26, 2026 10:29
@idankatzle idankatzle force-pushed the spre-5085-kueue-predictive-alerts branch from 0b98d41 to 8538f57 Compare May 6, 2026 14:02
@idankatzle idankatzle force-pushed the spre-5085-kueue-predictive-alerts branch from 8538f57 to e17ca4c Compare May 6, 2026 14:14
@idankatzle idankatzle requested review from eisraeli, gbenhaim and gcpsoares and removed request for eisraeli, gbenhaim and gcpsoares May 6, 2026 14:19
@idankatzle
Copy link
Copy Markdown
Contributor Author

Added KueuePredictiveQuotaStarvation to detect potential quota leaks.
Implemented latency alerts with differentiated thresholds: 15m for pre-merge and 12h for mintmaker to avoid false positives.
Verified logic with new unit tests in kueue_alerts_test.yaml.

@idankatzle idankatzle requested review from eisraeli, gbenhaim and gcpsoares and removed request for eisraeli and gcpsoares May 6, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant