Skip to content

Add configurable timeout for pending backend dials#856

Open
inerplat wants to merge 1 commit into
kubernetes-sigs:masterfrom
inerplat:fix/backend-dial-timeout-pending-dial
Open

Add configurable timeout for pending backend dials#856
inerplat wants to merge 1 commit into
kubernetes-sigs:masterfrom
inerplat:fix/backend-dial-timeout-pending-dial

Conversation

@inerplat
Copy link
Copy Markdown

Summary

This PR adds an opt-in --backend-dial-timeout option to the proxy server.

When configured, the proxy server bounds how long a frontend dial can remain pending while sending DIAL_REQ to the selected backend or waiting for DIAL_RSP. On timeout, the pending dial is removed and the frontend receives a bounded error.

The default value is 0, so existing behavior is preserved unless operators explicitly enable the timeout.

Motivation

In HTTPConnect deployments used by kube-apiserver egress-selector setups, we observed a failure mode where backend connections remained registered while new HTTP CONNECT dials accumulated in pending_backend_dials. This can surface as admission webhook failures when the server-to-agent path stops making forward progress but the backend is still considered available.

The new option gives operators a guardrail against unbounded pending dial accumulation without changing default behavior.

Changes

  • Add --backend-dial-timeout to proxy-server options.
  • Keep the default disabled with 0.
  • Apply the timeout while sending frontend DIAL_REQ packets to the selected backend.
  • Apply the timeout while waiting for a backend DIAL_RSP before the HTTPConnect tunnel is established.
  • Remove pending dial entries on timeout.
  • Add a backend_dial_timeout dial failure metric reason.
  • Add a reduced HTTPConnect regression test for blocked backend dial send.

Compatibility

The default behavior is unchanged because --backend-dial-timeout defaults to 0.

Operators can opt in with a value such as:

--backend-dial-timeout=30s

Testing

Ran:

go test ./cmd/server/app/options ./pkg/server -count=1

Added test:

TestHTTPConnectTunnelBlockedBackendDialSendTimesOutAndCleansPendingDial

The test starts the HTTPConnect tunnel handler, blocks backend Send(DIAL_REQ), verifies the dial enters pending state, and then verifies the configured timeout returns HTTP 504 and cleans up the pending dial.

Related Issue

#855

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 28, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @inerplat!

It looks like this is your first PR to kubernetes-sigs/apiserver-network-proxy 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/apiserver-network-proxy has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @inerplat. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: inerplat
Once this PR has been reviewed and has the lgtm label, please assign ipochi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 28, 2026
@k8s-ci-robot k8s-ci-robot requested review from elmiko and tallclair May 28, 2026 01:50
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 28, 2026
@cheftako
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 28, 2026
@inerplat inerplat force-pushed the fix/backend-dial-timeout-pending-dial branch 2 times, most recently from df2db32 to dac1468 Compare May 29, 2026 00:54
Add a configurable backend dial timeout for frontend dial requests. The option defaults to 0 to preserve existing behavior, and operators can opt in to bound DIAL_REQ send and DIAL_RSP wait time.

Clean up pending dials on timeout and report backend_dial_timeout as a dial failure reason. Add an HTTP CONNECT regression test for a blocked backend DIAL_REQ send.

Signed-off-by: DH Kim <inerplat@gmail.com>
@inerplat inerplat force-pushed the fix/backend-dial-timeout-pending-dial branch from dac1468 to 2fef652 Compare May 29, 2026 01:26
@inerplat
Copy link
Copy Markdown
Author

@cheftako
The golangci-lint failure has been fixed. I also found and fixed an additional issue where a timed-out backend dial send could leave the backend connection wedged.

Could you please retrigger the CI for the latest commit?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants