fix(spanner): don't set sticky AFFINITY_KEY for multiplexed sessions#12725
Open
akash329d wants to merge 1 commit intogoogleapis:mainfrom
Open
fix(spanner): don't set sticky AFFINITY_KEY for multiplexed sessions#12725akash329d wants to merge 1 commit intogoogleapis:mainfrom
akash329d wants to merge 1 commit intogoogleapis:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Contributor
There was a problem hiding this comment.
Code Review
This pull request modifies the channel affinity logic in GapicSpannerRpc.java. Specifically, it disables the setting of AFFINITY_KEY when dynamic channel pooling is disabled to prevent sticky binding issues under concurrent load, allowing the system to perform fresh least-busy channel picks instead. I have no feedback to provide.
Under grpc-gcp (default since 6.105.0, googleapis#4239), newCallContext set GcpManagedChannel.AFFINITY_KEY on every data RPC. GcpManagedChannel binds each new key via pickLeastBusyChannel, which reads activeStreamsCount before any concurrent caller's start() has incremented it (tiebreak channelRefs[0]). A high-concurrency cold start therefore binds keys to channel 0 and the bindings are sticky, so RPCs funnel through one HTTP/2 connection and queue at MAX_CONCURRENT_STREAMS. The static-numChannels path bounded the key to ~2*numChannels-1 distinct values, making the collapse permanent (~6x throughput regression at 400 concurrent). The dynamic-channel-pool path used per-transaction random keys and largely self-corrected, but a few BitSet-recycled hints still sticky-bound, leaving a p99 tail. Multiplexed sessions get no backend-locality benefit from sticky per-transaction channel affinity, so don't set the key under grpc-gcp at all. getChannelRef(null) does a fresh per-call least-busy pick with no sticky binding and no affinity-map growth. Drops the now-unobservable distinct-AFFINITY_KEY assertions from RetryOnDifferentGrpcChannelMockServerTest; the request-count and session assertions still cover the retry loop.
55a3136 to
4738cd6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Since 6.105.0 (#4239 enabled grpc-gcp by default), a
Spannerclient whose first traffic is a high-concurrency burst sees throughput collapse and p50 latency climb.disableGrpcGcpExtension()restores the expected scaling.numChannels=32qps / p50 / p99(single multiplexed-session client,
SELECT 1; each cell is a fresh client whose first traffic is at the target concurrency. Dynamic-pool row from a separate probe withenableDynamicChannelPool()set on the builder.)Root cause
newCallContextsetsGcpManagedChannel.AFFINITY_KEYon every data RPC under grpc-gcp.GcpManagedChannel.getChannelRef(key)binds each new key viapickLeastBusyChannel()(1754-1790), which readsactiveStreamsCountand ties tochannelRefs.get(0). The count isn't incremented until later inGcpClientCall.start()(:284), so a concurrent first burst all binds to channel 0 and the bindings are sticky. Subsequent RPCs funnel through one HTTP/2 connection and queue atMAX_CONCURRENT_STREAMS.The static-
numChannelspath usedaffinity.intValue() % numChannels(≈63 distinct keys), so the collapse is permanent. The dynamic-pool path mostly self-corrects (p50 matches OFF) because most hints are per-transaction random longs — butMultiplexedSessionDatabaseClient.getSingleUseChannelHintallocates the firstnumChannelsconcurrent hints from a recycledBitSet(values0..N-1), and those few sticky-bind during the warmup race and keep getting reused, leaving a p99 tail. (Separately: in our testing, settingenableDynamicChannelPool=truevia JDBC/connection properties did not fully propagate toGcpManagedChannel, so DCP is not currently a workaround for connection-API users.)Fix
Don't set
AFFINITY_KEYwhen grpc-gcp is on. Multiplexed sessions are a single session, so sticky per-transaction channel affinity provides no backend-locality benefit. With no key,getChannelRef(null)does a fresh per-call least-busy pick with no sticky binding and no affinity-map growth — matches the OFF curve. (Math.floorModalone doesn't help; the race still ties bounded keys to channel 0.)RetryOnDifferentGrpcChannelMockServerTestpreviously asserted distinctAFFINITY_KEYvalues via an interceptor; with no key set those assertions are unobservable, so they're removed (the request-count and session assertions still cover the retry loop). Note this means the opt-inspanner.retry_deadline_exceeded_on_different_channelfeature now relies on grpc-gcp's per-call least-busy pick rather than a forced distinct channel under grpc-gcp (the wedged channel will normally have a higher active-stream count, so least-busy usually picks a different one, but it's no longer guaranteed); the GAXwithChannelAffinitypath (used when grpc-gcp is off) is unchanged.Repro
The trigger is the client's first traffic burst being high-concurrency (e.g., a connection pool warming many connections at once); a gentle low-C warmup spreads the keys and masks the bug. Standalone single-file reproducer (only dep
google-cloud-spanner):SpannerAffinityRepro.java