OCPBUGS-81507: UPSTREAM: <138159>: Tests: Call removeNodeTaint variadically instead of in a loop#2637
Conversation
|
@jacobsee: This pull request references Jira Issue OCPBUGS-81507, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@jacobsee: the contents of this pull request could not be automatically validated. The following commits could not be validated and must be approved by a top-level approver:
Comment |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jacobsee The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
89898f8 to
fac1b62
Compare
|
@jacobsee: the contents of this pull request could not be automatically validated. The following commits could not be validated and must be approved by a top-level approver:
Comment |
I believe there is a sneaky timing/caching issue here. When RemoveTaintsOffNode removes taints one at a time in a loop, each iteration reads the node from the API server cache, removes one taint, and patches. If the cache is stale for any iteration (still reflecting the state before the previous patch's watch event propagated) the patch sends the full stale list minus only the current taint, effectively re-adding the taint that was just removed in the prior step. Since the loop has already moved past that taint, it's never cleaned up. verifyThatTaintIsGone (previously called once per loop) doesn't catch this because it only checks whether the current taint was removed, not whether previously removed taints were re-introduced. The fix is to remove all taints in a single cycle instead of looping, which removeNodeTaint's variadic signature already supports.
fac1b62 to
8c97863
Compare
|
@jacobsee: the contents of this pull request could not be automatically validated. The following commits could not be validated and must be approved by a top-level approver:
Comment |
|
@jacobsee: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-azure-ovn-serial-runc 10 |
|
@jacobsee: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-serial 10 |
|
@jacobsee: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0e67d300-2e2c-11f1-94ff-54087b0b4ae0-0 |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-serial 10 |
|
@jacobsee: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b357aa00-2ed3-11f1-8d9f-ed430f46d238-0 |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-serial 10 |
|
@jacobsee: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/fc87ec90-2f0e-11f1-902a-f57c7b79fa99-0 |
What type of PR is this?
/kind flake
What this PR does / why we need it:
I believe there is a sneaky timing/caching issue here. When RemoveTaintsOffNode removes taints one at a time in a loop, each iteration reads the node from the API server cache, removes one taint, and patches. If the cache is stale for any iteration (still reflecting the state before the previous patch's watch event propagated) the patch sends the full stale list minus only the current taint, effectively re-adding the taint that was just removed in the prior step. Since the loop has already moved past that taint, it's never cleaned up. verifyThatTaintIsGone (previously called once per loop) doesn't catch this because it only checks whether the current taint was removed, not whether previously removed taints were re-introduced. The fix is to remove all taints in a single cycle instead of looping, which removeNodeTaint's variadic signature already supports.
Which issue(s) this PR is related to:
OCPBUGS-81507
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: