Skip to content

Fix subchannel permanently stuck in TransientFailure after ConnectTimeout#2736

Merged
JamesNK merged 3 commits into
grpc:masterfrom
JamesNK:fix/subchannel-stuck-transient-failure
May 20, 2026
Merged

Fix subchannel permanently stuck in TransientFailure after ConnectTimeout#2736
JamesNK merged 3 commits into
grpc:masterfrom
JamesNK:fix/subchannel-stuck-transient-failure

Conversation

@JamesNK

@JamesNK JamesNK commented May 14, 2026

Copy link
Copy Markdown
Member

Problem

Fixes #2734

When ConnectTimeout fires while TryConnectAsync is running and the transport returns ConnectResult.Failure (e.g. from a SocketException), the ThrowIfCancellationRequested() at line 395 throws OperationCanceledException. The same can happen at line 425 when a backoff delay is interrupted after the connect timeout has already fired.

The catch (OperationCanceledException) block logs the cancellation but does not transition the subchannel state. Since the subchannel is in TransientFailure at this point (set by the failed TryConnectAsync), the connect loop exits and the subchannel is permanently stuck — RequestConnection() only starts a new connect loop from Idle state.

Fix

Transition to Idle in the catch (OperationCanceledException) block. This restores the invariant that RequestConnection() relies on: if the retry loop is not running, the subchannel must be in Idle so a new loop can be started.

Test

Added ConnectTimeout_FiresDuringFailedConnect_SubchannelRecoverable which:

  1. Configures a short ConnectTimeout (100ms)
  2. Has the transport take longer than the timeout then return Failure
  3. Verifies the OperationCanceledException is caught (via ConnectCanceled log)
  4. Asserts that RequestConnection() can restart the connect loop and reach Ready

JamesNK added 3 commits May 14, 2026 09:35
…eout

When ConnectTimeout fires and TryConnectAsync returns Failure (not Timeout),
ThrowIfCancellationRequested() throws OperationCanceledException. The catch
block did not transition the subchannel state, leaving it stuck in
TransientFailure with no running connect loop. RequestConnection() could
never restart the loop because it only starts from Idle state.

Transition to Idle in the OperationCanceledException catch block so that
RequestConnection() can start a new connect loop.

Fixes grpc#2734
@JamesNK JamesNK merged commit c042687 into grpc:master May 20, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subchannel permanently stuck in TransientFailure after ConnectTimeout fires

2 participants