Fix subchannel permanently stuck in TransientFailure after ConnectTimeout#2736
Merged
JamesNK merged 3 commits intoMay 20, 2026
Merged
Conversation
…eout When ConnectTimeout fires and TryConnectAsync returns Failure (not Timeout), ThrowIfCancellationRequested() throws OperationCanceledException. The catch block did not transition the subchannel state, leaving it stuck in TransientFailure with no running connect loop. RequestConnection() could never restart the loop because it only starts from Idle state. Transition to Idle in the OperationCanceledException catch block so that RequestConnection() can start a new connect loop. Fixes grpc#2734
BrennanConroy
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #2734
When
ConnectTimeoutfires whileTryConnectAsyncis running and the transport returnsConnectResult.Failure(e.g. from aSocketException), theThrowIfCancellationRequested()at line 395 throwsOperationCanceledException. The same can happen at line 425 when a backoff delay is interrupted after the connect timeout has already fired.The
catch (OperationCanceledException)block logs the cancellation but does not transition the subchannel state. Since the subchannel is inTransientFailureat this point (set by the failedTryConnectAsync), the connect loop exits and the subchannel is permanently stuck —RequestConnection()only starts a new connect loop fromIdlestate.Fix
Transition to
Idlein thecatch (OperationCanceledException)block. This restores the invariant thatRequestConnection()relies on: if the retry loop is not running, the subchannel must be inIdleso a new loop can be started.Test
Added
ConnectTimeout_FiresDuringFailedConnect_SubchannelRecoverablewhich:ConnectTimeout(100ms)FailureOperationCanceledExceptionis caught (viaConnectCanceledlog)RequestConnection()can restart the connect loop and reachReady