What version of gRPC and what language are you using?
- Grpc.Net.Client v2.71.0 (where we first observed the issue in production)
- Confirmed still present in v2.76.0 (latest stable) and master as of May 2026 — the affected code in
Subchannel.ConnectTransportAsync() is unchanged.
What operating system (Linux, Windows,...) and version?
- Linux (production incident)
- macOS 15.4 (local reproduction)
What runtime / compiler are you using (e.g. .NET Core SDK version dotnet --info)
- .NET 8.0 (production)
- .NET 10.0 (reproduction)
What did you do?
A gRPC server became temporarily unreachable due to a network routing issue (~60 seconds). After the network recovered, the subchannel remained permanently stuck in TransientFailure — it never reconnected despite the server being fully available. All RPCs failed indefinitely with StatusCode.Unavailable "Error connecting to subchannel.". The only recovery was to restart the client process.
A memory dump of the stuck process confirmed:
Subchannel._state = TransientFailure
ConnectTransportAsync task status = RAN_TO_COMPLETION (retry loop exited)
_delayInterruptTcs = null (no pending backoff delay)
ConnectContext.Disposed = true, IsConnectCanceled = false (connect timeout fired, not explicit cancel)
_currentEndPoint = null, _activeStreams.Count = 0 (no connection)
Minimal reproduction (works with any gRPC server):
using Grpc.Net.Client;
using System.Net.Http;
// Prerequisites: start any gRPC server on localhost:5001, e.g.:
// dotnet new grpc -o TestServer && cd TestServer && dotnet run --urls http://localhost:5001
var channel = GrpcChannel.ForAddress("http://localhost:5001", new GrpcChannelOptions
{
HttpHandler = new SocketsHttpHandler
{
ConnectTimeout = TimeSpan.FromSeconds(5),
EnableMultipleHttp2Connections = true,
},
});
// 1. Establish connection
await channel.ConnectAsync();
Console.WriteLine($"1. Connected. State: {channel.State}"); // Ready
// 2. Kill the server process externally (port returns ECONNREFUSED)
Console.WriteLine(">> Stop the server now, then press Enter...");
Console.ReadLine();
// 3. Wait for socket ping timer to detect dead socket (Ready → Idle)
Console.WriteLine("3. Waiting 12s for dead socket detection...");
await Task.Delay(TimeSpan.FromSeconds(12));
Console.WriteLine($" State: {channel.State}"); // Idle
// 4. Trigger ConnectTransportAsync retry loop
try { using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(2)); await channel.ConnectAsync(cts.Token); } catch { }
// 5. Wait ~4s (just before ConnectTimeout fires at 5s), then rapidly interrupt backoff
await Task.Delay(TimeSpan.FromSeconds(4));
Console.WriteLine("5. Sending ConnectAsync() interrupts for 4s...");
var start = DateTime.UtcNow;
while (DateTime.UtcNow - start < TimeSpan.FromSeconds(4))
{
try { using var cts = new CancellationTokenSource(TimeSpan.FromMilliseconds(1)); await channel.ConnectAsync(cts.Token); } catch { }
await Task.Delay(20);
}
await Task.Delay(TimeSpan.FromSeconds(3)); // settle
Console.WriteLine($" State: {channel.State}"); // TransientFailure — STUCK!
// 6. Restart the server
Console.WriteLine(">> Restart the server now, then press Enter...");
Console.ReadLine();
await Task.Delay(TimeSpan.FromSeconds(5));
// 7. Verify: channel never recovers
Console.WriteLine($"7. State after server restart: {channel.State}"); // TransientFailure
for (int i = 0; i < 5; i++)
{
try { using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(2)); await channel.ConnectAsync(cts.Token); Console.WriteLine($" Attempt {i}: OK"); break; }
catch { Console.WriteLine($" Attempt {i}: FAILED — state={channel.State}"); }
}
// All attempts fail. Channel is permanently stuck.
What did you expect to see?
After the server recovers, the subchannel should reconnect — transitioning from TransientFailure → Idle → Connecting → Ready.
What did you see instead?
The subchannel stays in TransientFailure permanently. Every pick returns PickResultFailure with the stale error from the original connection failure. ConnectTransportAsync has exited (RAN_TO_COMPLETION) and nothing restarts it.
How it happens:
The ConnectTransportAsync retry loop is wrapped in a catch (OperationCanceledException) (line 432 - 435) that logs but does not transition the state:
catch (OperationCanceledException)
{
SubchannelLog.ConnectCanceled(_logger, Id);
// BUG: no UpdateConnectivityState(Idle) — state stays TransientFailure
}
The OperationCanceledException is thrown by connectContext.CancellationToken.ThrowIfCancellationRequested() after ConnectTimeout fires. There are two trigger paths:
Path A — Line 395 (race condition, likely production trigger):
TryConnectAsync fails with SocketException (e.g., connection refused, TCP RST) and returns ConnectResult.Failure. The ConnectResult.Timeout check in TryConnectAsync (lines 214-220) does not fire because the first exception was SocketException, not OperationCanceledException. Then line 395 connectContext.CancellationToken.ThrowIfCancellationRequested() throws because the connect timeout has already fired. This race is realistic with remote servers where connection failures take ~10-100ms.
Path B — Line 425 (backoff interruption, used in reproduction):
The backoff delay is interrupted via _delayInterruptTcs (triggered by channel.ConnectAsync() → RequestConnection()) after ConnectTimeout has fired. Line 425 connectContext.CancellationToken.ThrowIfCancellationRequested() throws because the token is already cancelled.
After the OCE is caught at (line 432 - 435), the method exits. The subchannel is now stuck because RequestConnection() only starts a new ConnectTransportAsync from Idle state — for TransientFailure, it only tries to interrupt a backoff delay that no longer exists:
case ConnectivityState.Idle:
connectionRequested = true; // starts new loop
break;
case ConnectivityState.TransientFailure:
_delayInterruptTcs?.TrySetResult(null); // no-op: loop already exited, TCS is null
return;
Anything else we should know about your project / environment?
Why ConnectResult.Timeout doesn't fully fix this:
The ConnectResult.Timeout path in TryConnectAsync correctly handles the case where the OCE is thrown inside socket.ConnectAsync (the connect timeout cancels the token, socket throws OCE, and TryConnectAsync returns ConnectResult.Timeout → subchannel transitions to Idle). However, it does not cover:
- Line 395:
ThrowIfCancellationRequested() called after TryConnectAsync already returned ConnectResult.Failure
- Line 425:
ThrowIfCancellationRequested() called when a backoff delay is interrupted
Suggested fix:
Transition to Idle in the OperationCanceledException catch block:
catch (OperationCanceledException)
{
SubchannelLog.ConnectCanceled(_logger, Id);
UpdateConnectivityState(ConnectivityState.Idle, "Connect canceled."); // ← ADD THIS
}
This restores the invariant that RequestConnection() relies on: if the retry loop is not running, the subchannel must be in Idle so a new loop can be started. The same change should be applied to the catch (Exception ex) block (line 436), which currently sets TransientFailure — it should transition to Idle instead.
What version of gRPC and what language are you using?
Subchannel.ConnectTransportAsync()is unchanged.What operating system (Linux, Windows,...) and version?
What runtime / compiler are you using (e.g. .NET Core SDK version
dotnet --info)What did you do?
A gRPC server became temporarily unreachable due to a network routing issue (~60 seconds). After the network recovered, the subchannel remained permanently stuck in
TransientFailure— it never reconnected despite the server being fully available. All RPCs failed indefinitely withStatusCode.Unavailable "Error connecting to subchannel.". The only recovery was to restart the client process.A memory dump of the stuck process confirmed:
Subchannel._state = TransientFailureConnectTransportAsynctask status =RAN_TO_COMPLETION(retry loop exited)_delayInterruptTcs = null(no pending backoff delay)ConnectContext.Disposed = true,IsConnectCanceled = false(connect timeout fired, not explicit cancel)_currentEndPoint = null,_activeStreams.Count = 0(no connection)Minimal reproduction (works with any gRPC server):
What did you expect to see?
After the server recovers, the subchannel should reconnect — transitioning from
TransientFailure→Idle→Connecting→Ready.What did you see instead?
The subchannel stays in
TransientFailurepermanently. Every pick returnsPickResultFailurewith the stale error from the original connection failure.ConnectTransportAsynchas exited (RAN_TO_COMPLETION) and nothing restarts it.How it happens:
The
ConnectTransportAsyncretry loop is wrapped in acatch (OperationCanceledException)(line 432 - 435) that logs but does not transition the state:The
OperationCanceledExceptionis thrown byconnectContext.CancellationToken.ThrowIfCancellationRequested()afterConnectTimeoutfires. There are two trigger paths:Path A — Line 395 (race condition, likely production trigger):
TryConnectAsyncfails withSocketException(e.g., connection refused, TCP RST) and returnsConnectResult.Failure. TheConnectResult.Timeoutcheck inTryConnectAsync(lines 214-220) does not fire because the first exception wasSocketException, notOperationCanceledException. Then line 395connectContext.CancellationToken.ThrowIfCancellationRequested()throws because the connect timeout has already fired. This race is realistic with remote servers where connection failures take ~10-100ms.Path B — Line 425 (backoff interruption, used in reproduction):
The backoff delay is interrupted via
_delayInterruptTcs(triggered bychannel.ConnectAsync()→RequestConnection()) afterConnectTimeouthas fired. Line 425connectContext.CancellationToken.ThrowIfCancellationRequested()throws because the token is already cancelled.After the OCE is caught at (line 432 - 435), the method exits. The subchannel is now stuck because
RequestConnection()only starts a newConnectTransportAsyncfromIdlestate — forTransientFailure, it only tries to interrupt a backoff delay that no longer exists:Anything else we should know about your project / environment?
Why
ConnectResult.Timeoutdoesn't fully fix this:The
ConnectResult.Timeoutpath inTryConnectAsynccorrectly handles the case where the OCE is thrown insidesocket.ConnectAsync(the connect timeout cancels the token, socket throws OCE, andTryConnectAsyncreturnsConnectResult.Timeout→ subchannel transitions toIdle). However, it does not cover:ThrowIfCancellationRequested()called afterTryConnectAsyncalready returnedConnectResult.FailureThrowIfCancellationRequested()called when a backoff delay is interruptedSuggested fix:
Transition to
Idlein theOperationCanceledExceptioncatch block:This restores the invariant that
RequestConnection()relies on: if the retry loop is not running, the subchannel must be inIdleso a new loop can be started. The same change should be applied to thecatch (Exception ex)block (line 436), which currently setsTransientFailure— it should transition toIdleinstead.