OCPBUGS-85426: improve DCM tests to reduce flakiness in endpoints#31173
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-85426, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR improves router e2e test observability and reliability by adding tabular Endpoints and EndpointSlice output helpers, deployment state fetch/print infrastructure used throughout scale and churn operations, and more robust Endpoints polling with improved EndpointSlice synchronization. ChangesRouter Diagnostics and Robustness
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 12 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-85426, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
3dcf0fe to
f272628
Compare
|
Scheduling required tests: |
|
/jira-refresh |
|
cc @rhamini3 |
|
/jira refresh |
|
@melvinjoseph86: This pull request references Jira Issue OCPBUGS-85426, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
f272628 to
1a4ff75
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/router/config_manager_ingress.go`:
- Around line 71-83: The list calls swallow errors; change each call to capture
the error (routes, err :=
routeClient.RouteV1().Routes(oc.Namespace()).List(...); endpoints, err :=
kubeClient.CoreV1().Endpoints(oc.Namespace()).List(...); epsList, err :=
kubeClient.DiscoveryV1().EndpointSlices(oc.Namespace()).List(...)) and if err !=
nil log the error with context (e.g. include namespace and operation) before
attempting to output items; update the branches that currently call
outputIngress, outputEndpoints, outputEndpointSlice to still run when results
exist but ensure the err is logged (using the test/logger available in the file,
e.g. framework.Logf/t.Logf or the existing logger) so failed specs include
actionable diagnostics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 30a7ec20-63d2-4427-8031-a2cc9653b227
📒 Files selected for processing (2)
test/extended/router/config_manager_ingress.gotest/extended/router/stress.go
|
/test verify |
|
/assign @bentito |
|
Scheduling required tests: |
1a4ff75 to
7ae081b
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/extended/router/stress.go (2)
699-699: 💤 Low valueNit: simplify boolean comparison.
*ready == truecan be written as just*ready.go vet/linters typically flag this asS1002.Proposed change
- if ready := ep.Conditions.Ready; ready == nil || *ready == true { + if ready := ep.Conditions.Ready; ready == nil || *ready {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/router/stress.go` at line 699, The boolean comparison is overly verbose: in the condition that checks ep.Conditions.Ready, replace the explicit comparison "*ready == true" with the simpler dereference "*ready" so the if becomes "if ready := ep.Conditions.Ready; ready == nil || *ready { ... }"; update the conditional around ep.Conditions.Ready accordingly to satisfy linters like S1002.
667-679: 💤 Low valueOptional: hoist
resumeAddrsout of the subset loop and simplify boolean check.
resumeAddrsis redefined on everySubsetsiteration; it can be a package-level helper (or defined once above the loop). Also consider renaming tosummarizeAddrsfor clarity — "resume" reads as the Portuguese/Spanish cognate of "summarize" rather than the English meaning. Purely stylistic; no behavior impact.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/router/stress.go` around lines 667 - 679, The inline function resumeAddrs defined inside the Subsets iteration should be hoisted so it’s defined once (either as a package-level helper or above the loop) and renamed to summarizeAddrs for clarity; update calls inside the loop to use summarizeAddrs(addrs). While moving it, simplify the logic that picks the address by returning the first non-empty of addr.IP or addr.Hostname (instead of nested if/else) and keep the join behavior the same so the behavior of functions referencing Subsets remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/extended/router/stress.go`:
- Line 699: The boolean comparison is overly verbose: in the condition that
checks ep.Conditions.Ready, replace the explicit comparison "*ready == true"
with the simpler dereference "*ready" so the if becomes "if ready :=
ep.Conditions.Ready; ready == nil || *ready { ... }"; update the conditional
around ep.Conditions.Ready accordingly to satisfy linters like S1002.
- Around line 667-679: The inline function resumeAddrs defined inside the
Subsets iteration should be hoisted so it’s defined once (either as a
package-level helper or above the loop) and renamed to summarizeAddrs for
clarity; update calls inside the loop to use summarizeAddrs(addrs). While moving
it, simplify the logic that picks the address by returning the first non-empty
of addr.IP or addr.Hostname (instead of nested if/else) and keep the join
behavior the same so the behavior of functions referencing Subsets remains
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 108561fd-89f8-4529-b2a1-2862fa2cac51
📒 Files selected for processing (2)
test/extended/router/config_manager_ingress.gotest/extended/router/stress.go
|
Scheduling required tests: |
7ae081b to
a817ecf
Compare
|
@mdbooth I refactored a method that was causing the race you reported, thanks for reaching out. @bentito I rebased the commit so it's hard to find the difference. I only refactored /unhold |
|
Scheduling required tests: |
|
/retest-required |
|
/test e2e-gcp-ovn-upgrade |
1 similar comment
|
/test e2e-gcp-ovn-upgrade |
| return false | ||
| } | ||
| for _, ep := range eps { | ||
| if len(ep.Addresses) == 0 || !slices.ContainsFunc(targetAddresses, func(addr corev1.EndpointAddress) bool { return addr.IP == ep.Addresses[0] }) { |
There was a problem hiding this comment.
I think [0] might be brittle, better to iterate and see if contains?
There was a problem hiding this comment.
This is by the spec:
// EndpointSlices generated by the EndpointSlice
// controller will always have exactly 1 address. No semantics are defined for
// additional addresses beyond the first, and kube-proxy does not look at them.
|
just one nit comment. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bentito, jcmoraisjr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-gcp-ovn-techpreview |
|
@rhamini3: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
confirmed that all DCM tests in e2e-gcp-techpreview job passed /verified by CI |
|
@rhamini3: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test verify-deps |
|
/retest-required |
|
Scheduling required tests: |
|
/retest |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: a817ecf
New tests seen in this PR at sha: a817ecf
|
|
/test e2e-gcp-ovn |
|
/test e2e-gcp-ovn |
|
@jcmoraisjr: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@jcmoraisjr: Jira Issue Verification Checks: Jira Issue OCPBUGS-85426 Jira Issue OCPBUGS-85426 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-27-125119 |
Addressing a race when patching endpoints resource:
Also, dumping all endpoints and endpointSlice in the test namespace in case of a test failure, which helps to correlate the current state and what should be expected in the test scenario.
Example of deployments state log:
Example of Endpoints and EndpointSlice resources listed if the test fails:
https://redhat.atlassian.net/browse/OCPBUGS-85426
Summary by CodeRabbit