Restrict test retries to an allowlist of test names#31222
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkowalski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to data retention organization setting WalkthroughThe pull request refactors test retry eligibility in ChangesTest Retry Allowlist Refactor
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/test/ginkgo/retries_test.go`:
- Around line 101-107: The test TestLoadRetryAllowedJobs is not detecting parse
failures because loadRetryAllowedJobs hides errors; change the loader to expose
parse errors and update the test to assert them: modify loadRetryAllowedJobs (or
add a new helper like loadRetryAllowedJobsWithError) to return (map[string]bool,
error) and have the test call that function, assert error == nil and the
returned map is non-nil/non-empty; update callers accordingly (or keep a thin
wrapper loadRetryAllowedJobs that panics on error for production code) so the
test will fail when the embedded YAML is malformed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9a11f0a6-4dbc-4d47-8df4-a2e6327e42d7
📒 Files selected for processing (3)
pkg/test/ginkgo/retries.gopkg/test/ginkgo/retries_test.gopkg/test/ginkgo/retry_allowed_jobs.yaml
Replace the image-tag-based and hardcoded retry eligibility logic in shouldRetryTest() with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure. Previously, any test from a permitted image tag could retry, and 6 specific tests had hardcoded exceptions. Now, a test must be explicitly listed in the YAML file to get a retry. Tests not on the list must pass on the first attempt. The allowlist is populated from BigQuery retry_statistics data: all 1395 tests that historically failed first but passed on retry are included. The 6 previously hardcoded exception tests (OCPBUGS-57477, OCPBUGS-57665, OCPBUGS-57658, OCPBUGS-61287, OCPBUGS-61734, OCPBUGS-66967) are covered by the BigQuery data and no longer need special treatment in Go code. The list should be reduced over time as flaky tests are fixed. To disable retries entirely, empty the YAML list.
9d45683 to
0412516
Compare
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai summary |
|
✅ Actions performedSummary regeneration triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/test/ginkgo/retries.go`:
- Around line 16-37: The current code implements a test-level allowlist
(retryAllowedTestsYAML, retryAllowedTestsConfig, loadRetryAllowedTests) but the
PR requires a job-level allowlist driven by JOB_NAME; update the implementation
to match by adding a job-level loader (e.g., loadRetryAllowedJobs reading
retry_allowed_jobs.yaml into a struct like Jobs []string) and change the retry
decision logic to consult os.Getenv("JOB_NAME") against that allowlist before
allowing retries, or alternatively update the PR text to state this is
test-level behavior; locate uses of loadRetryAllowedTests and the retry decision
point and replace or supplement them to check JOB_NAME membership in the new
jobs set.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3d187d45-503b-4721-9d24-973c1daa5c5b
📒 Files selected for processing (3)
pkg/test/ginkgo/retries.gopkg/test/ginkgo/retries_test.gopkg/test/ginkgo/retry_allowed_tests.yaml
|
@mkowalski: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/cc @dgoodwin |
|
The query claude came up with is better than I would have, I did not know that table exists, and it's perfect for our purposes. I suggest this query: SELECT
rs.TestName,
COUNT(*) AS flake_count,
COUNT(DISTINCT rs.JobName) AS distinct_jobs_affected,
SUM(rs.TotalAttempts) AS total_attempts,
SUM(rs.FailedAttempts) AS total_failed_attempts
FROM `openshift-ci-data-analysis.ci_data_autodl.retry_statistics` rs
INNER JOIN `openshift-gce-devel.ci_analysis_us.job_variants` jv
ON rs.JobName = jv.job_name
AND jv.variant_name = 'Release'
AND jv.variant_value = '4.22'
WHERE rs.FinalOutcome = 'flaky'
AND rs.TestName IS NOT NULL AND rs.TestName != ''
AND rs.PartitionTime >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY rs.TestName
HAVING COUNT(*) >= 5
ORDER BY flake_count DESCWe're not concerned with flakes in old releases, and filtering to the past 30 days seems logical. 4.22 has the best data over the past month as it's the most stable. If a test flaked only once in the last month, I'm fine removing it's right to flake. We may have to add more to this list over time, things could surface in CR, but hopefully that will be rare. 5 seems like a good threshold, and that should be about 207 tests. Threshold options from claude: ┌───────────┬────────────┬────────────┬──────────────────────┐ We could go lower though, 2-3 would also be reasonable IMO. |
|
|
||
| func NewRetryOnceStrategy() *RetryOnceStrategy { | ||
| return &RetryOnceStrategy{ | ||
| PermittedRetryImageTags: []string{"tests"}, // tests = openshift-tests image |
There was a problem hiding this comment.
@stbenjam this looks safe to me, but would appreciate if you can confirm. Mat is closing the retry gap at last, with an explicit list of tests that need retries only allowed.
| "[sig-node] Pods Extended Pod Container lifecycle evicted pods should be terminal [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-57658 | ||
| "[sig-cli] Kubectl logs all pod logs the Deployment has 2 replicas and each pod has 2 containers should get logs from each pod and each container in Deployment [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61287 | ||
| "[sig-cli] Kubectl Port forwarding Shutdown client connection while the remote stream is writing data to the port-forward connection port-forward should keep working after detect broken connection [Suite:openshift/conformance/parallel] [Suite:k8s]", // https://issues.redhat.com/browse/OCPBUGS-61734 | ||
| "[sig-storage] OCP CSI Volumes [Driver: csi-hostpath-groupsnapshot] [OCPFeatureGate:VolumeGroupSnapshot] [Testpattern: (delete policy)] volumegroupsnapshottable [Feature:volumegroupsnapshot] VolumeGroupSnapshottable should create snapshots for multiple volumes in a pod", // https://issues.redhat.com/browse/OCPBUGS-66967 |
There was a problem hiding this comment.
Please check if all these are in the list with the new query I added in the PR, if not we should add them.
Also could you ensure the list is sorted, the query I provided isn't. :)
|
Code looks good, just a revision to the query and some sorting concerns. |
Summary
shouldRetryTest()with a single YAML allowlist (retry_allowed_tests.yaml) that is the sole source of truth for which tests may retry on failure.retry_statisticsdata: 1395 tests that historically failed first but passed on retry are included. The 6 previously hardcoded exception tests are covered by the BigQuery data and no longer need special treatment in Go code.Behavior
tests: [])How to allow retries for a test
Add the full test name to
pkg/test/ginkgo/retry_allowed_tests.yaml:The file is embedded at compile time (
go:embed), so changes require rebuildingopenshift-tests.How to regenerate the allowlist from BigQuery
The goal is to reduce this list over time as flaky tests are fixed, eventually reaching an empty list (no retries for anyone).
Summary by CodeRabbit