Skip to content

WIP: lease controller: cancel in-flight sync from Stop to prevent shutdown…#139176

Open
willie-yao wants to merge 1 commit into
kubernetes:masterfrom
willie-yao:admissionwebhook-flake
Open

WIP: lease controller: cancel in-flight sync from Stop to prevent shutdown…#139176
willie-yao wants to merge 1 commit into
kubernetes:masterfrom
willie-yao:admissionwebhook-flake

Conversation

@willie-yao
Copy link
Copy Markdown
Contributor

@willie-yao willie-yao commented May 20, 2026

… deadlock

What type of PR is this?

/kind flake

What this PR does / why we need it:

Fixes a deadlock in the apiserver-identity-lease controller's shutdown sequence when sync is stuck on an API call blocked by a misbehaving admission webhook.

Which issue(s) this PR is related to:

Flake issue: #138608

Background

TestBrokenWebhook registers a fail-closed ValidatingWebhookConfiguration that points at a non-existent service, then restarts the apiserver. The webhook intercepts all mutating API operations, including the writes the apiserver-identity-lease controller makes against coordination.k8s.io/leases.

If the lease controller's first Create (issued during apiserver startup) races against the test installing the webhook and loses, latestLease stays nil. From that point on, every subsequent sync call enters backoffEnsureLease, which loops forever calling Create until either the call succeeds or its context cancels. With the broken webhook, the call never succeeds.

The deadlock chain at shutdown:

Stop() waits on reconcilingLock
  └─ sync holds reconcilingLock, stuck in backoffEnsureLease
       └─ backoffEnsureLease only exits on ctx.Done()
            └─ ctx is derived from RunPostStartHooks's ctx, which uses
               WithoutCancel of the parent — only cancels when
               notAcceptingNewRequestCh signals
                 └─ notAcceptingNewRequestCh signals only after
                    preShutdownHooksHasStoppedCh signals
                      └─ preShutdownHooksHasStoppedCh signals only after
                         RunPreShutdownHooks returns
                           └─ RunPreShutdownHooks waits for Stop() to return
                              └─ back to the top

TestBrokenWebhook times out at 10m as a result. See verified failing run ci-kubernetes-integration-1-36/2056150753738756096: controller.go:201 "Failed to ensure lease exists, will retry" log lines continue every 7s from 23:14:47 (Restarting apiserver) through the test timeout, with no "Cleaning up lease on exit" ever appearing.

What this PR changes

Adds a context.CancelFunc field on the controller, wires it up in Start() via context.WithCancel(wait.ContextForChannel(stopCh)), and invokes it in Stop() before acquiring reconcilingLock. This gives Stop() a direct way to interrupt sync's in-flight API call so the lock is released promptly, without depending on caller-side cancellation.

The cancel function is protected by a mutex so Start() and Stop() can safely race during apiserver shutdown. If Stop() is called before Start() finishes wiring the cancel function, Start() cancels the run context immediately after installing it.

This same pattern is seen in decorated_watcher.go: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/decorated_watcher.go#L33

Special notes for your reviewer:

Reviewers may want to spot-check that Stop()'s invariants are preserved: cleanup Delete still runs after reconcilingLock is acquired, sync's serialization is unchanged, and the Stop/sync race fix from #119083 still holds.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. labels May 20, 2026
@k8s-ci-robot k8s-ci-robot requested review from cici37 and jpbetz May 20, 2026 00:13
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: willie-yao
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 20, 2026
@willie-yao
Copy link
Copy Markdown
Contributor Author

/retest
/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao willie-yao force-pushed the admissionwebhook-flake branch from 48e87c5 to d2cef5b Compare May 20, 2026 21:33
… deadlock

Signed-off-by: William Yao <william2000yao@gmail.com>
@willie-yao willie-yao force-pushed the admissionwebhook-flake branch from d2cef5b to ab21151 Compare May 20, 2026 22:31
@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

11 similar comments
@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@willie-yao
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants