Skip to content

fix(spanner): fix grpc-gcp affinity cleanup and multiplexed channel usage leaks#12726

Open
rahul2393 wants to merge 10 commits intomainfrom
fix-12693-b-494614610
Open

fix(spanner): fix grpc-gcp affinity cleanup and multiplexed channel usage leaks#12726
rahul2393 wants to merge 10 commits intomainfrom
fix-12693-b-494614610

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Apr 9, 2026

Summary

This change fixes two regressions in the Spanner Java client introduced around multiplexed session initialization and
grpc-gcp affinity handling.

1. Fix static CHANNEL_USAGE leak in MultiplexedSessionDatabaseClient

Fixes: #12693
MultiplexedSessionDatabaseClient stored per-SpannerImpl channel usage state in a static map but never removed
entries on close. After multiplexed client creation became unconditional in SpannerImpl.getDatabaseClient(),
applications that repeatedly created and closed Spanner instances could retain closed SpannerImpl objects, gRPC
channels, and related transport state indefinitely.

This change:

  • replaces the static Map<SpannerImpl, BitSet> with reference-counted shared state
  • removes the map entry when the last MultiplexedSessionDatabaseClient for a given SpannerImpl closes
  • preserves sharing semantics for multiple database clients created from the same SpannerImpl

2. Stop using bitset / bounded % numChannels affinity for grpc-gcp

For grpc-gcp-enabled paths, channel affinity should use the raw random channel hint and rely on explicit unbind /
cleanup, rather than:

  • bitset reservation
  • collapsing the hint with % numChannels

This change:

  • keeps grpc-gcp on raw random affinity keys in GapicSpannerRpc
  • removes % numChannels mapping from the grpc-gcp call path
  • keeps the old non-grpc-gcp GAX affinity behavior unchanged

3. Add explicit grpc-gcp affinity cleanup for multi-use read-only transaction close

Multi-use read-only transactions reuse a single random channel hint for the lifetime of the transaction. That hint
should remain stable across all reads in the transaction, then be explicitly cleaned up when the transaction closes.

This change:

  • adds a new RPC cleanup hook that carries both transaction id and channel hint
  • invokes that cleanup from multi-use read-only transaction close()
  • unbinds grpc-gcp affinity on transaction close without issuing an ExecuteSql RPC to Spanner
  • handles both:
    • location API disabled: cleanup via the underlying grpc-gcp channel
    • location API enabled: cleanup via KeyAwareChannel, using the routed endpoint associated with the transaction

4. Apply grpc-gcp affinity cleanup settings without implicitly enabling DCP

grpc-gcp affinity cleanup settings (affinityKeyLifetime, cleanupInterval) should be applied whether or not dynamic channel pool (DCP) is enabled.

This change:

  • always passes grpc-gcp channel-pool options when grpc-gcp is enabled
  • preserves full pool settings only when DCP is enabled
  • passes cleanup-only channel-pool options when DCP is disabled:
    • affinityKeyLifetime
    • cleanupInterval
    • dynamic scaling explicitly disabled

@rahul2393 rahul2393 requested review from a team as code owners April 9, 2026 07:47
@rahul2393 rahul2393 requested a review from olavloite April 9, 2026 07:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements explicit cleanup for gRPC-GCP channel affinity and introduces reference counting for shared channel usage in MultiplexedSessionDatabaseClient to ensure proper resource management. Key changes include the addition of clearTransactionAndChannelAffinity to the SpannerRpc interface and its implementations, allowing the system to unbind affinity keys when transactions or single-use operations complete. Additionally, GapicSpannerRpc was updated to propagate affinity cleanup settings even when dynamic channel pooling is disabled. Review feedback suggests adding a null check in getGrpcGcpChannelPoolOptions to prevent a potential NullPointerException and recommends consolidating the clearChannelHintAffinity implementation to avoid code duplication between KeyAwareChannel and GapicSpannerRpc.

Comment on lines +829 to +838
GcpChannelPoolOptions channelPoolOptions = options.getGcpChannelPoolOptions();
if (options.isDynamicChannelPoolEnabled()) {
return channelPoolOptions;
}

return GcpChannelPoolOptions.newBuilder()
.disableDynamicScaling()
.setAffinityKeyLifetime(channelPoolOptions.getAffinityKeyLifetime())
.setCleanupInterval(channelPoolOptions.getCleanupInterval())
.build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The method getGrpcGcpChannelPoolOptions assumes that options.getGcpChannelPoolOptions() returns a non-null value. If it returns null, a NullPointerException will occur when calling getAffinityKeyLifetime() or getCleanupInterval(). While SpannerOptions typically provides a default, adding a null check or ensuring a non-null value is safer for robustness.

  static GcpChannelPoolOptions getGrpcGcpChannelPoolOptions(SpannerOptions options) {
    GcpChannelPoolOptions channelPoolOptions = options.getGcpChannelPoolOptions();
    if (options.isDynamicChannelPoolEnabled()) {
      return channelPoolOptions;
    }

    GcpChannelPoolOptions.Builder builder = GcpChannelPoolOptions.newBuilder().disableDynamicScaling();
    if (channelPoolOptions != null) {
      builder.setAffinityKeyLifetime(channelPoolOptions.getAffinityKeyLifetime())
          .setCleanupInterval(channelPoolOptions.getCleanupInterval());
    }
    return builder.build();
  }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcpChannelPoolOptions is always initialized to either merged user options. The builder setter also rejects null, adding a defensive null branch would just hide invariant breaks.

# Conflicts:
#	java-spanner/google-cloud-spanner/src/test/java/com/google/cloud/spanner/spi/v1/GapicSpannerRpcTest.java
@rahul2393 rahul2393 force-pushed the fix-12693-b-494614610 branch from 3599670 to ce64c9e Compare April 9, 2026 07:57
synchronized (CHANNEL_USAGE) {
CHANNEL_USAGE.putIfAbsent(sessionClient.getSpanner(), new BitSet(numChannels));
this.channelUsage = CHANNEL_USAGE.get(sessionClient.getSpanner());
SharedChannelUsage sharedChannelUsage = CHANNEL_USAGE.get(this.spanner);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should seriously consider removing this manual channel distribution logic from this class. It was introduced when using the Gax channel pool to prevent low-QPS from sticking to just one channel. My understanding is that grpc-gcp will do that automatically, as it falls back to a round-robin scheme when there is low load. That would significantly simplify this class without breaking the purpose of what this was intended to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customers still have an option to disable grpc-gcp and switch to GAX so cleanup task should be done later.

// read-only transactions tend to keep picking the same idle channel, so keep reads
// overlapping to verify distribution across the fixed-size pool.
mockSpanner.setExecuteStreamingSqlExecutionTime(
SimulatedExecutionTime.ofMinimumAndRandomTime(500, 0));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to just freeze the mock server and then wait for the server to contain N requests, and then unfreeze it, instead of adding 500ms execution time for each query? I think that would achieve the same, but with less execution speed. (There should be a util method in the mock server for 'waitForRequests' or something like that)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this approach, but the current mock server freeze() is global and blocks earlier RPCs like session creation/transaction setup before enough ExecuteStreamingSql requests are enqueued. That made the test deadlock/time out. I kept the per-query streaming delay for now because it is deterministic and keeps the overlap in the specific RPC path we want to exercise.

If we want to switch to freeze/wait/unfreeze, I think we first need a mock-server utility that can either freeze only ExecuteStreamingSql or wait for an exact request count without globally blocking unrelated RPCs.

@rahul2393 rahul2393 requested a review from olavloite April 9, 2026 11:08
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[spanner]: MultiplexedSessionDatabaseClient.CHANNEL_USAGE static HashMap leaks SpannerImpl instances on close

2 participants