chore(spanner): optimize lock contention and skipped tablet reporting by rahul2393 · Pull Request #12719 · googleapis/google-cloud-java

rahul2393 · 2026-04-08T18:32:06Z

Summary

Replace synchronized with ReadWriteLock in KeyRangeCache to reduce lock contention on the hot path. Read operations (findServer, getActiveAddresses, size, debugString) now take a read lock, allowing concurrent lookups. Group updates are performed outside the write lock to minimize critical section duration.
Move cache updates to an async executor in ChannelFinder via updateAsync(), using a sequential executor backed by a shared daemon thread pool. This prevents CacheUpdate processing from blocking the gRPC response listener thread.
Report skipped tablets for recently evicted TRANSIENT_FAILURE endpoints. Previously, when an endpoint was evicted after repeated TRANSIENT_FAILURE probes and not yet recreated, its tablets were silently skipped — the server never learned the client considered them unhealthy. Now EndpointLifecycleManager tracks addresses evicted for TRANSIENT_FAILURE and KeyRangeCache includes their tablet UIDs in skipped_tablets, giving the server better routing visibility.
Deduplicate skipped tablet UIDs to avoid reporting the same tablet multiple times when it appears across replicas.
Report known TRANSIENT_FAILURE replicas even when an earlier healthy replica was already selected, so the server gets a complete picture of unhealthy tablets for the group.
Mark CachedRange.lastAccess as volatile for safe cross-thread reads.

gemini-code-assist

Code Review

This pull request introduces asynchronous cache updates in ChannelFinder and KeyAwareChannel to improve performance, and enhances EndpointLifecycleManager to track and report transient failure evictions. Additionally, it refactors KeyRangeCache to use ReentrantReadWriteLock for better concurrency. My review highlights critical issues: the addRanges method in KeyRangeCache is not exception-safe and risks reference count leaks, the readLock release in fillRoutingHint creates a race condition that requires CachedGroup to be fully synchronized, and the use of Executors.newCachedThreadPool in ChannelFinder poses a risk of resource exhaustion.

gemini-code-assist · 2026-04-08T18:42:40Z

+    readLock.lock();
+    try {
      targetRange = findRangeLocked(key, hintBuilder.getLimitKey(), rangeMode);
+    } finally {
+      readLock.unlock();


Releasing the readLock before calling targetRange.group.fillRoutingHint() introduces a race condition because addRanges now performs group updates outside the global lock. Since CachedGroup relies on monitor-based synchronization (as seen in getActiveAddresses at line 180), both CachedGroup.update and CachedGroup.fillRoutingHint must be synchronized to ensure thread safety. Without this, concurrent updates and reads of the group's internal state (like tablets and leaderIndex) could lead to inconsistent behavior or data corruption.

CachedGroup.update() is synchronized and CachedGroup.fillRoutingHint() now synchronizes on the same monitor

olavloite · 2026-04-09T06:51:19Z

      for (Set<String> addresses : activeAddressesPerFinder.values()) {
        allActive.addAll(addresses);
      }
+      transientFailureEvictedAddresses.retainAll(allActive);


This call is synchronized on activeAddressLock. However, the probe method also adds to this set without holding a lock. This means that this call (and also the add call from probe) are not guaranteed to be atomic, and that the outcome is undefined. It won't cause any exceptions, as it is a ConcurrentHashSet, so if we are fine with this being slightly undeterministic, then this is fine. Otherwise, we should take the same lock in probe as well as here.

olavloite · 2026-04-09T06:55:32Z

+  }
+
+  @VisibleForTesting
+  void awaitPendingUpdates() throws InterruptedException {


Could we move this method to src/test/java? Or otherwise add a comment that it should only be called from test code? (The @VisibleForTesting annotation is intended to indicate that a method has higher visibility than otherwise needed to be able to test, it does not indicate that it should only be invoked from test code.)

chore(spanner): optimize lock contention and skipped tablet reporting

e9de863

rahul2393 requested review from a team as code owners April 8, 2026 18:32

rahul2393 requested a review from olavloite April 8, 2026 18:32

fix lint

eea8b85

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

rahul2393 added 6 commits April 9, 2026 00:34

fix tests

da7a1ba

more fixes

286dc79

Merge branch 'main' into optimize-locks-skipped-tablet-uid

5a0bad1

fix flaky tests

f0086ec

Merge branch 'main' into optimize-locks-skipped-tablet-uid

c99d2af

add safety and boundness on thread

61e2f16

olavloite approved these changes Apr 9, 2026

View reviewed changes

address comments

101c085

rahul2393 merged commit be795b5 into main Apr 9, 2026
118 of 119 checks passed

rahul2393 deleted the optimize-locks-skipped-tablet-uid branch April 9, 2026 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(spanner): optimize lock contention and skipped tablet reporting#12719

chore(spanner): optimize lock contention and skipped tablet reporting#12719
rahul2393 merged 9 commits into
mainfrom
optimize-locks-skipped-tablet-uid

rahul2393 commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Uh oh!

rahul2393 Apr 9, 2026

Uh oh!

Uh oh!

olavloite Apr 9, 2026

Uh oh!

olavloite Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rahul2393 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

rahul2393 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

olavloite Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

olavloite Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rahul2393 commented Apr 8, 2026 •

edited

Loading