WIP: lease controller: cancel in-flight sync from Stop to prevent shutdown…#139176
WIP: lease controller: cancel in-flight sync from Stop to prevent shutdown…#139176willie-yao wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: willie-yao The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
/test pull-kubernetes-integration |
48e87c5 to
d2cef5b
Compare
… deadlock Signed-off-by: William Yao <william2000yao@gmail.com>
d2cef5b to
ab21151
Compare
|
/test pull-kubernetes-integration |
11 similar comments
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
|
/test pull-kubernetes-integration |
… deadlock
What type of PR is this?
/kind flake
What this PR does / why we need it:
Fixes a deadlock in the apiserver-identity-lease controller's shutdown sequence when sync is stuck on an API call blocked by a misbehaving admission webhook.
Which issue(s) this PR is related to:
Flake issue: #138608
Background
TestBrokenWebhookregisters a fail-closedValidatingWebhookConfigurationthat points at a non-existent service, then restarts the apiserver. The webhook intercepts all mutating API operations, including the writes the apiserver-identity-lease controller makes againstcoordination.k8s.io/leases.If the lease controller's first
Create(issued during apiserver startup) races against the test installing the webhook and loses,latestLeasestays nil. From that point on, every subsequentsynccall entersbackoffEnsureLease, which loops forever callingCreateuntil either the call succeeds or its context cancels. With the broken webhook, the call never succeeds.The deadlock chain at shutdown:
TestBrokenWebhooktimes out at 10m as a result. See verified failing run ci-kubernetes-integration-1-36/2056150753738756096:controller.go:201 "Failed to ensure lease exists, will retry"log lines continue every 7s from23:14:47(Restarting apiserver) through the test timeout, with no"Cleaning up lease on exit"ever appearing.What this PR changes
Adds a
context.CancelFuncfield on the controller, wires it up inStart()viacontext.WithCancel(wait.ContextForChannel(stopCh)), and invokes it inStop()before acquiringreconcilingLock. This givesStop()a direct way to interrupt sync's in-flight API call so the lock is released promptly, without depending on caller-side cancellation.The cancel function is protected by a mutex so
Start()andStop()can safely race during apiserver shutdown. IfStop()is called beforeStart()finishes wiring the cancel function,Start()cancels the run context immediately after installing it.This same pattern is seen in
decorated_watcher.go: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/decorated_watcher.go#L33Special notes for your reviewer:
Reviewers may want to spot-check that
Stop()'s invariants are preserved: cleanupDeletestill runs afterreconcilingLockis acquired, sync's serialization is unchanged, and theStop/syncrace fix from #119083 still holds.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: