Restrict test retries to an allowlist of test names by mkowalski · Pull Request #31222 · openshift/origin

mkowalski · 2026-05-27T12:48:47Z

Summary

Replace the image-tag-based and hardcoded retry eligibility in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.
Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML to get a retry.
The allowlist is populated from BigQuery retry_statistics data: 1395 tests that historically failed first but passed on retry are included. The 6 previously hardcoded exception tests are covered by the BigQuery data and no longer need special treatment in Go code.

Behavior

Scenario	Retries?
Test listed in the YAML allowlist	Yes
Test NOT in the allowlist	No
Empty allowlist (`tests: []`)	No

How to allow retries for a test

Add the full test name to pkg/test/ginkgo/retry_allowed_tests.yaml:

tests:
  - "[sig-network] Services should be rejected for evicted pods ..."
  - "[sig-node] Pods Extended Pod Container lifecycle ..."

The file is embedded at compile time (go:embed), so changes require rebuilding openshift-tests.

How to regenerate the allowlist from BigQuery

SELECT DISTINCT TestName
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics`
WHERE FinalOutcome = 'flaky'
  AND TestName IS NOT NULL AND TestName != ''
ORDER BY TestName

The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).

Summary by CodeRabbit

Tests
- Retry behavior refactored to use an embedded allowlist: only tests present in the allowlist are eligible for a retry.
- Added tests verifying allowlist-driven retry decisions and the allowlist loader returns a valid result.
- Retry logic now logs allow/deny outcomes and falls back to no retries if the allowlist cannot be parsed.

openshift-ci · 2026-05-27T12:53:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mkowalski]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-27T12:53:31Z

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a20686ea-3775-4a44-bedf-d4574080a352

📥 Commits

Reviewing files that changed from the base of the PR and between 355916d and 0412516.

📒 Files selected for processing (3)

pkg/test/ginkgo/retries.go
pkg/test/ginkgo/retries_test.go
pkg/test/ginkgo/retry_allowed_tests.yaml

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

Walkthrough

The pull request refactors test retry eligibility in pkg/test/ginkgo/retries.go from complex in-function rules to an embedded YAML allowlist. RetryOnceStrategy now stores and uses an in-memory allowlist of retry-eligible test names; shouldRetryTest permits retries only when a test name is in the allowlist. Tests verify both the allowlist loader and the updated retry behavior.

Changes

Test Retry Allowlist Refactor

Layer / File(s)	Summary
Allowlist infrastructure and initialization `pkg/test/ginkgo/retries.go`	Embeds `retry_allowed_tests.yaml` and defines YAML schema via `loadRetryAllowedTests()`, which unmarshals into a `map[string]bool`. `RetryOnceStrategy` struct gains `AllowedRetryTests` field and initializes it in `NewRetryOnceStrategy()`.
Retry eligibility logic refactor `pkg/test/ginkgo/retries.go`	`shouldRetryTest` now permits retries only when `test.name` is present in `AllowedRetryTests` with debug logging; prior image-tag and binary-presence rules are removed.
Test coverage for allowlist mechanism `pkg/test/ginkgo/retries_test.go`	Table-driven test (`TestRetryOnceStrategy_ShouldRetryTest_Allowlist`) validates retry behavior with a supplied allowlist map. New `TestLoadRetryAllowedTests` verifies the embedded loader returns a non-nil map.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Restrict test retries to an allowlist of test names' accurately describes the main change: implementing an allowlist mechanism to control which tests can be retried.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR does not declare or modify Ginkgo test titles. All 1,395 test names in the retry allowlist are static, deterministic, and descriptive with no dynamic content.
Test Structure And Quality	✅ Passed	The custom check targets Ginkgo BDD tests, but retries_test.go uses standard Go testing (func Test...). No Describe, It, BeforeEach/AfterEach, or cluster interactions present.
Microshift Test Compatibility	✅ Passed	This PR modifies test retry framework code and configuration, not Ginkgo e2e tests. No new e2e tests (It/Describe/Context/When) are added, so the MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR modifies only framework/infrastructure code (retries.go) and unit tests (retries_test.go), not new Ginkgo e2e tests. Check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only test infrastructure code (pkg/test/ginkgo/) for retry logic, not deployment manifests, operators, or controllers; no scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	No stdout violations found. Code contains no process-level stdout writes—no init(), TestMain(), or suite setup functions emit to stdout, and logrus defaults to stderr.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests (It(), Describe(), Context(), When()) are added in this PR. Files modified (retries.go, retries_test.go) are infrastructure/utility code and unit tests, not e2e test specs.
No-Weak-Crypto	✅ Passed	No weak cryptography patterns found. The PR modifies retry logic using YAML parsing and map lookups only, with no crypto imports, algorithms, or secret comparisons.
Container-Privileges	✅ Passed	PR modifies test retry logic in Go code and test configuration; contains no container specs, K8s manifests, or privilege escalation configurations to flag.
No-Sensitive-Data-In-Logs	✅ Passed	All logging statements log non-sensitive information: test names, counts, durations, filenames. No passwords, tokens, API keys, PII, or sensitive data exposed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-merge-bot · 2026-05-27T12:54:07Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries_test.go`:
- Around line 101-107: The test TestLoadRetryAllowedJobs is not detecting parse
failures because loadRetryAllowedJobs hides errors; change the loader to expose
parse errors and update the test to assert them: modify loadRetryAllowedJobs (or
add a new helper like loadRetryAllowedJobsWithError) to return (map[string]bool,
error) and have the test call that function, assert error == nil and the
returned map is non-nil/non-empty; update callers accordingly (or keep a thin
wrapper loadRetryAllowedJobs that panics on error for production code) so the
test will fail when the embedded YAML is malformed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9a11f0a6-4dbc-4d47-8df4-a2e6327e42d7

📥 Commits

Reviewing files that changed from the base of the PR and between 355916d and f7fa00e.

📒 Files selected for processing (3)

pkg/test/ginkgo/retries.go
pkg/test/ginkgo/retries_test.go
pkg/test/ginkgo/retry_allowed_jobs.yaml

Replace the image-tag-based and hardcoded retry eligibility logic in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure. Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML file to get a retry. Tests not on the list must pass on the first attempt. The allowlist is populated from BigQuery retry_statistics data: all 1395 tests that historically failed first but passed on retry are included. The 6 previously hardcoded exception tests (OCPBUGS-57477, OCPBUGS-57665, OCPBUGS-57658, OCPBUGS-61287, OCPBUGS-61734, OCPBUGS-66967) are covered by the BigQuery data and no longer need special treatment in Go code. The list should be reduced over time as flaky tests are fixed. To disable retries entirely, empty the YAML list.

mkowalski · 2026-05-27T13:25:07Z

@coderabbitai help

coderabbitai · 2026-05-27T13:25:14Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai rate limit to show your current review rate limit status.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
@coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
@coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai emit path instructions to emit generated path-specific review instructions for this repository.
@coderabbitai resolve merge conflict to automatically resolve merge conflicts.
@coderabbitai autofix to automatically fix issues identified in unresolved review comments.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

mkowalski · 2026-05-27T13:25:39Z

@coderabbitai summary
@coderabbitai full review

coderabbitai · 2026-05-27T13:25:53Z

@mkowalski Sure! I'll regenerate the PR summary and kick off a full review of all the changes.

✅ Actions performed

Summary regeneration triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries.go`:
- Around line 16-37: The current code implements a test-level allowlist
(retryAllowedTestsYAML, retryAllowedTestsConfig, loadRetryAllowedTests) but the
PR requires a job-level allowlist driven by JOB_NAME; update the implementation
to match by adding a job-level loader (e.g., loadRetryAllowedJobs reading
retry_allowed_jobs.yaml into a struct like Jobs []string) and change the retry
decision logic to consult os.Getenv("JOB_NAME") against that allowlist before
allowing retries, or alternatively update the PR text to state this is
test-level behavior; locate uses of loadRetryAllowedTests and the retry decision
point and replace or supplement them to check JOB_NAME membership in the new
jobs set.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3d187d45-503b-4721-9d24-973c1daa5c5b

📥 Commits

Reviewing files that changed from the base of the PR and between f7fa00e and 0412516.

📒 Files selected for processing (3)

pkg/test/ginkgo/retries.go
pkg/test/ginkgo/retries_test.go
pkg/test/ginkgo/retry_allowed_tests.yaml

openshift-ci · 2026-05-27T13:52:03Z

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mkowalski · 2026-05-27T14:18:18Z

/cc @dgoodwin

dgoodwin · 2026-05-27T14:55:51Z

The query claude came up with is better than I would have, I did not know that table exists, and it's perfect for our purposes.

I suggest this query:

  SELECT
    rs.TestName,
    COUNT(*) AS flake_count,
    COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
    SUM(rs.TotalAttempts) AS total_attempts,
    SUM(rs.FailedAttempts) AS total_failed_attempts
  FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
  INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
    ON rs.JobName = jv.job_name
    AND jv.variant_name = 'Release'
    AND jv.variant_value = '4.22'
  WHERE rs.FinalOutcome = 'flaky'
    AND rs.TestName IS NOT NULL AND rs.TestName != ''
    AND rs.PartitionTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  GROUP BY rs.TestName
  HAVING COUNT(*) >= 5
  ORDER BY flake_count DESC

We're not concerned with flakes in old releases, and filtering to the past 30 days seems logical. 4.22 has the best data over the past month as it's the most stable.

If a test flaked only once in the last month, I'm fine removing it's right to flake. We may have to add more to this list over time, things could surface in CR, but hopefully that will be rare. 5 seems like a good threshold, and that should be about 207 tests.

Threshold options from claude:

┌───────────┬────────────┬────────────┬──────────────────────┐
│ Threshold │ Tests kept │ % of tests │ Flake events covered │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 2 │ 395 │ 81% │ 99.0% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 3 │ 316 │ 65% │ 97.3% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 5 │ 207 │ 42.5% │ 93.2% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 10 │ 113 │ 23.2% │ 86.6% │
└───────────┴────────────┴────────────┴──────────────────────┘

We could go lower though, 2-3 would also be reasonable IMO.

dgoodwin · 2026-05-27T15:03:50Z


 func NewRetryOnceStrategy() *RetryOnceStrategy {
 	return &RetryOnceStrategy{
-		PermittedRetryImageTags: []string{"tests"}, // tests = openshift-tests image


@stbenjam this looks safe to me, but would appreciate if you can confirm. Mat is closing the retry gap at last, with an explicit list of tests that need retries only allowed.

dgoodwin · 2026-05-27T15:06:18Z

-		"[sig-node] Pods Extended Pod Container lifecycle evicted pods should be terminal [Suite:openshift/conformance/parallel] [Suite:k8s]",                                                                                                                                           // https://issues.redhat.com/browse/OCPBUGS-57658
-		"[sig-cli] Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment [Suite:openshift/conformance/parallel] [Suite:k8s]",                                                             // https://issues.redhat.com/browse/OCPBUGS-61287
-		"[sig-cli] Kubectl Port forwarding Shutdown client connection while the remote stream is writing data to the port-forward connection port-forward should keep working after detect broken connection [Suite:openshift/conformance/parallel] [Suite:k8s]",                        // https://issues.redhat.com/browse/OCPBUGS-61734
-		"[sig-storage] OCP CSI Volumes [Driver: csi-hostpath-groupsnapshot] [OCPFeatureGate:VolumeGroupSnapshot] [Testpattern:  (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod", // https://issues.redhat.com/browse/OCPBUGS-66967


Please check if all these are in the list with the new query I added in the PR, if not we should add them.

Also could you ensure the list is sorted, the query I provided isn't. :)

dgoodwin · 2026-05-27T15:07:05Z

Code looks good, just a revision to the query and some sorting concerns.

openshift-ci Bot requested review from deads2k and sjenning May 27, 2026 12:51

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026

coderabbitai Bot requested changes May 27, 2026

View reviewed changes

Comment thread pkg/test/ginkgo/retries_test.go Outdated

mkowalski force-pushed the retry-allowlist branch from 9d45683 to 0412516 Compare May 27, 2026 13:21

mkowalski changed the title ~~Restrict test retries to an allowlist of CI jobs~~ Restrict test retries to an allowlist of test names May 27, 2026

coderabbitai Bot requested changes May 27, 2026

View reviewed changes

Comment thread pkg/test/ginkgo/retries.go

coderabbitai Bot approved these changes May 27, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 27, 2026

openshift-ci Bot requested a review from dgoodwin May 27, 2026 14:18

dgoodwin reviewed May 27, 2026

View reviewed changes

Conversation

mkowalski commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

How to allow retries for a test

How to regenerate the allowlist from BigQuery

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

openshift-merge-bot Bot commented May 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkowalski commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Chat

CodeRabbit commands

Other keywords and placeholders

Status, support, documentation and community

Uh oh!

mkowalski commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

mkowalski commented May 27, 2026

Uh oh!

dgoodwin commented May 27, 2026

Uh oh!

dgoodwin May 27, 2026

Choose a reason for hiding this comment

Uh oh!

dgoodwin May 27, 2026

Choose a reason for hiding this comment

Uh oh!

dgoodwin commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkowalski commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading