Skip to content

Restrict test retries to an allowlist of test names#31222

Open
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:retry-allowlist
Open

Restrict test retries to an allowlist of test names#31222
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:retry-allowlist

Conversation

@mkowalski
Copy link
Copy Markdown
Contributor

@mkowalski mkowalski commented May 27, 2026

Summary

  • Replace the image-tag-based and hardcoded retry eligibility in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.
  • Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML to get a retry.
  • The allowlist is populated from BigQuery retry_statistics data: 1395 tests that historically failed first but passed on retry are included. The 6 previously hardcoded exception tests are covered by the BigQuery data and no longer need special treatment in Go code.

Behavior

Scenario Retries?
Test listed in the YAML allowlist Yes
Test NOT in the allowlist No
Empty allowlist (tests: []) No

How to allow retries for a test

Add the full test name to pkg/test/ginkgo/retry_allowed_tests.yaml:

tests:
  - "[sig-network] Services should be rejected for evicted pods ..."
  - "[sig-node] Pods Extended Pod Container lifecycle ..."

The file is embedded at compile time (go:embed), so changes require rebuilding openshift-tests.

How to regenerate the allowlist from BigQuery

SELECT DISTINCT TestName
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics`
WHERE FinalOutcome = 'flaky'
  AND TestName IS NOT NULL AND TestName != ''
ORDER BY TestName

The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).

Summary by CodeRabbit

  • Tests
    • Retry behavior refactored to use an embedded allowlist: only tests present in the allowlist are eligible for a retry.
    • Added tests verifying allowlist-driven retry decisions and the allowlist loader returns a valid result.
    • Retry logic now logs allow/deny outcomes and falls back to no retries if the allowlist cannot be parsed.

@openshift-ci openshift-ci Bot requested review from deads2k and sjenning May 27, 2026 12:51
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a20686ea-3775-4a44-bedf-d4574080a352

📥 Commits

Reviewing files that changed from the base of the PR and between 355916d and 0412516.

📒 Files selected for processing (3)
  • pkg/test/ginkgo/retries.go
  • pkg/test/ginkgo/retries_test.go
  • pkg/test/ginkgo/retry_allowed_tests.yaml

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting


Walkthrough

The pull request refactors test retry eligibility in pkg/test/ginkgo/retries.go from complex in-function rules to an embedded YAML allowlist. RetryOnceStrategy now stores and uses an in-memory allowlist of retry-eligible test names; shouldRetryTest permits retries only when a test name is in the allowlist. Tests verify both the allowlist loader and the updated retry behavior.

Changes

Test Retry Allowlist Refactor

Layer / File(s) Summary
Allowlist infrastructure and initialization
pkg/test/ginkgo/retries.go
Embeds retry_allowed_tests.yaml and defines YAML schema via loadRetryAllowedTests(), which unmarshals into a map[string]bool. RetryOnceStrategy struct gains AllowedRetryTests field and initializes it in NewRetryOnceStrategy().
Retry eligibility logic refactor
pkg/test/ginkgo/retries.go
shouldRetryTest now permits retries only when test.name is present in AllowedRetryTests with debug logging; prior image-tag and binary-presence rules are removed.
Test coverage for allowlist mechanism
pkg/test/ginkgo/retries_test.go
Table-driven test (TestRetryOnceStrategy_ShouldRetryTest_Allowlist) validates retry behavior with a supplied allowlist map. New TestLoadRetryAllowedTests verifies the embedded loader returns a non-nil map.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Restrict test retries to an allowlist of test names' accurately describes the main change: implementing an allowlist mechanism to control which tests can be retried.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not declare or modify Ginkgo test titles. All 1,395 test names in the retry allowlist are static, deterministic, and descriptive with no dynamic content.
Test Structure And Quality ✅ Passed The custom check targets Ginkgo BDD tests, but retries_test.go uses standard Go testing (func Test...). No Describe, It, BeforeEach/AfterEach, or cluster interactions present.
Microshift Test Compatibility ✅ Passed This PR modifies test retry framework code and configuration, not Ginkgo e2e tests. No new e2e tests (It/Describe/Context/When) are added, so the MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies only framework/infrastructure code (retries.go) and unit tests (retries_test.go), not new Ginkgo e2e tests. Check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test infrastructure code (pkg/test/ginkgo/) for retry logic, not deployment manifests, operators, or controllers; no scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed No stdout violations found. Code contains no process-level stdout writes—no init(), TestMain(), or suite setup functions emit to stdout, and logrus defaults to stderr.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests (It(), Describe(), Context(), When()) are added in this PR. Files modified (retries.go, retries_test.go) are infrastructure/utility code and unit tests, not e2e test specs.
No-Weak-Crypto ✅ Passed No weak cryptography patterns found. The PR modifies retry logic using YAML parsing and map lookups only, with no crypto imports, algorithms, or secret comparisons.
Container-Privileges ✅ Passed PR modifies test retry logic in Go code and test configuration; contains no container specs, K8s manifests, or privilege escalation configurations to flag.
No-Sensitive-Data-In-Logs ✅ Passed All logging statements log non-sensitive information: test names, counts, durations, filenames. No passwords, tokens, API keys, PII, or sensitive data exposed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries_test.go`:
- Around line 101-107: The test TestLoadRetryAllowedJobs is not detecting parse
failures because loadRetryAllowedJobs hides errors; change the loader to expose
parse errors and update the test to assert them: modify loadRetryAllowedJobs (or
add a new helper like loadRetryAllowedJobsWithError) to return (map[string]bool,
error) and have the test call that function, assert error == nil and the
returned map is non-nil/non-empty; update callers accordingly (or keep a thin
wrapper loadRetryAllowedJobs that panics on error for production code) so the
test will fail when the embedded YAML is malformed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9a11f0a6-4dbc-4d47-8df4-a2e6327e42d7

📥 Commits

Reviewing files that changed from the base of the PR and between 355916d and f7fa00e.

📒 Files selected for processing (3)
  • pkg/test/ginkgo/retries.go
  • pkg/test/ginkgo/retries_test.go
  • pkg/test/ginkgo/retry_allowed_jobs.yaml

Comment thread pkg/test/ginkgo/retries_test.go Outdated
Replace the image-tag-based and hardcoded retry eligibility logic in
shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml)
that is the sole source of truth for which tests may retry on failure.

Previously, any test from a permitted image tag could retry, and 6
specific tests had hardcoded exceptions. Now, a test must be explicitly
listed in the YAML file to get a retry. Tests not on the list must pass
on the first attempt.

The allowlist is populated from BigQuery retry_statistics data: all 1395
tests that historically failed first but passed on retry are included.
The 6 previously hardcoded exception tests (OCPBUGS-57477, OCPBUGS-57665,
OCPBUGS-57658, OCPBUGS-61287, OCPBUGS-61734, OCPBUGS-66967) are covered
by the BigQuery data and no longer need special treatment in Go code.

The list should be reduced over time as flaky tests are fixed. To
disable retries entirely, empty the YAML list.
@mkowalski mkowalski changed the title Restrict test retries to an allowlist of CI jobs Restrict test retries to an allowlist of test names May 27, 2026
@mkowalski
Copy link
Copy Markdown
Contributor Author

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai rate limit to show your current review rate limit status.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
  • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
  • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai emit path instructions to emit generated path-specific review instructions for this repository.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@mkowalski
Copy link
Copy Markdown
Contributor Author

@coderabbitai summary
@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

@mkowalski Sure! I'll regenerate the PR summary and kick off a full review of all the changes.

✅ Actions performed

Summary regeneration triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/test/ginkgo/retries.go`:
- Around line 16-37: The current code implements a test-level allowlist
(retryAllowedTestsYAML, retryAllowedTestsConfig, loadRetryAllowedTests) but the
PR requires a job-level allowlist driven by JOB_NAME; update the implementation
to match by adding a job-level loader (e.g., loadRetryAllowedJobs reading
retry_allowed_jobs.yaml into a struct like Jobs []string) and change the retry
decision logic to consult os.Getenv("JOB_NAME") against that allowlist before
allowing retries, or alternatively update the PR text to state this is
test-level behavior; locate uses of loadRetryAllowedTests and the retry decision
point and replace or supplement them to check JOB_NAME membership in the new
jobs set.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3d187d45-503b-4721-9d24-973c1daa5c5b

📥 Commits

Reviewing files that changed from the base of the PR and between f7fa00e and 0412516.

📒 Files selected for processing (3)
  • pkg/test/ginkgo/retries.go
  • pkg/test/ginkgo/retries_test.go
  • pkg/test/ginkgo/retry_allowed_tests.yaml

Comment thread pkg/test/ginkgo/retries.go
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 27, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 27, 2026

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mkowalski
Copy link
Copy Markdown
Contributor Author

/cc @dgoodwin

@openshift-ci openshift-ci Bot requested a review from dgoodwin May 27, 2026 14:18
@dgoodwin
Copy link
Copy Markdown
Contributor

The query claude came up with is better than I would have, I did not know that table exists, and it's perfect for our purposes.

I suggest this query:

  SELECT
    rs.TestName,
    COUNT(*) AS flake_count,
    COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
    SUM(rs.TotalAttempts) AS total_attempts,
    SUM(rs.FailedAttempts) AS total_failed_attempts
  FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
  INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
    ON rs.JobName = jv.job_name
    AND jv.variant_name = 'Release'
    AND jv.variant_value = '4.22'
  WHERE rs.FinalOutcome = 'flaky'
    AND rs.TestName IS NOT NULL AND rs.TestName != ''
    AND rs.PartitionTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  GROUP BY rs.TestName
  HAVING COUNT(*) >= 5
  ORDER BY flake_count DESC

We're not concerned with flakes in old releases, and filtering to the past 30 days seems logical. 4.22 has the best data over the past month as it's the most stable.

If a test flaked only once in the last month, I'm fine removing it's right to flake. We may have to add more to this list over time, things could surface in CR, but hopefully that will be rare. 5 seems like a good threshold, and that should be about 207 tests.

Threshold options from claude:

┌───────────┬────────────┬────────────┬──────────────────────┐
│ Threshold │ Tests kept │ % of tests │ Flake events covered │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 2 │ 395 │ 81% │ 99.0% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 3 │ 316 │ 65% │ 97.3% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 5 │ 207 │ 42.5% │ 93.2% │
├───────────┼────────────┼────────────┼──────────────────────┤
│ >= 10 │ 113 │ 23.2% │ 86.6% │
└───────────┴────────────┴────────────┴──────────────────────┘

We could go lower though, 2-3 would also be reasonable IMO.


func NewRetryOnceStrategy() *RetryOnceStrategy {
return &RetryOnceStrategy{
PermittedRetryImageTags: []string{"tests"}, // tests = openshift-tests image
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stbenjam this looks safe to me, but would appreciate if you can confirm. Mat is closing the retry gap at last, with an explicit list of tests that need retries only allowed.

"[sig-node] Pods Extended Pod Container lifecycle evicted pods should be terminal [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-57658
"[sig-cli] Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61287
"[sig-cli] Kubectl Port forwarding Shutdown client connection while the remote stream is writing data to the port-forward connection port-forward should keep working after detect broken connection [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61734
"[sig-storage] OCP CSI Volumes [Driver: csi-hostpath-groupsnapshot] [OCPFeatureGate:VolumeGroupSnapshot] [Testpattern: (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod", // https://issues.redhat.com/browse/OCPBUGS-66967
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if all these are in the list with the new query I added in the PR, if not we should add them.

Also could you ensure the list is sorted, the query I provided isn't. :)

@dgoodwin
Copy link
Copy Markdown
Contributor

Code looks good, just a revision to the query and some sorting concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants