Add configurable timeout for pending backend dials#856
Conversation
|
Welcome @inerplat! |
|
Hi @inerplat. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: inerplat The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
df2db32 to
dac1468
Compare
Add a configurable backend dial timeout for frontend dial requests. The option defaults to 0 to preserve existing behavior, and operators can opt in to bound DIAL_REQ send and DIAL_RSP wait time. Clean up pending dials on timeout and report backend_dial_timeout as a dial failure reason. Add an HTTP CONNECT regression test for a blocked backend DIAL_REQ send. Signed-off-by: DH Kim <inerplat@gmail.com>
dac1468 to
2fef652
Compare
|
@cheftako Could you please retrigger the CI for the latest commit? Thanks. |
Summary
This PR adds an opt-in
--backend-dial-timeoutoption to the proxy server.When configured, the proxy server bounds how long a frontend dial can remain pending while sending
DIAL_REQto the selected backend or waiting forDIAL_RSP. On timeout, the pending dial is removed and the frontend receives a bounded error.The default value is
0, so existing behavior is preserved unless operators explicitly enable the timeout.Motivation
In HTTPConnect deployments used by kube-apiserver egress-selector setups, we observed a failure mode where backend connections remained registered while new HTTP CONNECT dials accumulated in
pending_backend_dials. This can surface as admission webhook failures when the server-to-agent path stops making forward progress but the backend is still considered available.The new option gives operators a guardrail against unbounded pending dial accumulation without changing default behavior.
Changes
--backend-dial-timeoutto proxy-server options.0.DIAL_REQpackets to the selected backend.DIAL_RSPbefore the HTTPConnect tunnel is established.backend_dial_timeoutdial failure metric reason.Compatibility
The default behavior is unchanged because
--backend-dial-timeoutdefaults to0.Operators can opt in with a value such as:
--backend-dial-timeout=30sTesting
Ran:
go test ./cmd/server/app/options ./pkg/server -count=1Added test:
TestHTTPConnectTunnelBlockedBackendDialSendTimesOutAndCleansPendingDialThe test starts the HTTPConnect tunnel handler, blocks backend
Send(DIAL_REQ), verifies the dial enters pending state, and then verifies the configured timeout returns HTTP 504 and cleans up the pending dial.Related Issue
#855