Skip to content

chore(spanner): optimize lock contention and skipped tablet reporting#12719

Open
rahul2393 wants to merge 3 commits intomainfrom
optimize-locks-skipped-tablet-uid
Open

chore(spanner): optimize lock contention and skipped tablet reporting#12719
rahul2393 wants to merge 3 commits intomainfrom
optimize-locks-skipped-tablet-uid

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Apr 8, 2026

Summary

  • Replace synchronized with ReadWriteLock in KeyRangeCache to reduce lock contention on the hot path. Read operations (findServer, getActiveAddresses, size, debugString) now take a read lock, allowing concurrent lookups. Group updates are performed outside the write lock to minimize critical section duration.
  • Move cache updates to an async executor in ChannelFinder via updateAsync(), using a sequential executor backed by a shared daemon thread pool. This prevents CacheUpdate processing from blocking the gRPC response listener thread.
  • Report skipped tablets for recently evicted TRANSIENT_FAILURE endpoints. Previously, when an endpoint was evicted after repeated TRANSIENT_FAILURE probes and not yet recreated, its tablets were silently skipped — the server never learned the client considered them unhealthy. Now EndpointLifecycleManager tracks addresses evicted for TRANSIENT_FAILURE and KeyRangeCache includes their tablet UIDs in skipped_tablets, giving the server better routing visibility.
  • Deduplicate skipped tablet UIDs to avoid reporting the same tablet multiple times when it appears across replicas.
  • Report known TRANSIENT_FAILURE replicas even when an earlier healthy replica was already selected, so the server gets a complete picture of unhealthy tablets for the group.
  • Mark CachedRange.lastAccess as volatile for safe cross-thread reads.

@rahul2393 rahul2393 requested review from a team as code owners April 8, 2026 18:32
@rahul2393 rahul2393 requested a review from olavloite April 8, 2026 18:32
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces asynchronous cache updates in ChannelFinder and KeyAwareChannel to improve performance, and enhances EndpointLifecycleManager to track and report transient failure evictions. Additionally, it refactors KeyRangeCache to use ReentrantReadWriteLock for better concurrency. My review highlights critical issues: the addRanges method in KeyRangeCache is not exception-safe and risks reference count leaks, the readLock release in fillRoutingHint creates a race condition that requires CachedGroup to be fully synchronized, and the use of Executors.newCachedThreadPool in ChannelFinder poses a risk of resource exhaustion.

Comment on lines 102 to 128
public void addRanges(CacheUpdate cacheUpdate) {
List<CachedGroup> newGroups = new ArrayList<>();
synchronized (lock) {
List<CachedGroup> touchedGroups = new ArrayList<>();
writeLock.lock();
try {
for (Group groupIn : cacheUpdate.getGroupList()) {
newGroups.add(findOrInsertGroup(groupIn));
touchedGroups.add(findOrInsertGroup(groupIn.getGroupUid()));
}
} finally {
writeLock.unlock();
}

for (int i = 0; i < cacheUpdate.getGroupCount(); i++) {
touchedGroups.get(i).update(cacheUpdate.getGroup(i));
}

writeLock.lock();
try {
for (Range rangeIn : cacheUpdate.getRangeList()) {
replaceRangeIfNewer(rangeIn);
}
for (CachedGroup group : newGroups) {
for (CachedGroup group : touchedGroups) {
unref(group);
}
} finally {
writeLock.unlock();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The addRanges implementation is not exception-safe. If an exception occurs during update() or replaceRangeIfNewer(), the unref() calls will be skipped, leading to a reference count leak. Following repository guidelines for resource management, the unref() logic should be placed in a finally block and executed in reverse order (LIFO) to ensure all resources are released even if exceptions occur.

  public void addRanges(CacheUpdate cacheUpdate) {
    List<CachedGroup> touchedGroups = new ArrayList<>();
    writeLock.lock();
    try {
      for (Group groupIn : cacheUpdate.getGroupList()) {
        touchedGroups.add(findOrInsertGroup(groupIn.getGroupUid()));
      }
    } finally {
      writeLock.unlock();
    }

    try {
      for (int i = 0; i < cacheUpdate.getGroupCount(); i++) {
        touchedGroups.get(i).update(cacheUpdate.getGroup(i));
      }

      writeLock.lock();
      try {
        for (Range rangeIn : cacheUpdate.getRangeList()) {
          replaceRangeIfNewer(rangeIn);
        }
      } finally {
        writeLock.unlock();
      }
    } finally {
      writeLock.lock();
      try {
        for (int i = touchedGroups.size() - 1; i >= 0; i--) {
          unref(touchedGroups.get(i));
        }
      } finally {
        writeLock.unlock();
      }
    }
  }
References
  1. Ensure resources are released in reverse order of creation (LIFO) and that the implementation is exception-safe using finally blocks to prevent leaks.

Comment on lines +154 to +158
readLock.lock();
try {
targetRange = findRangeLocked(key, hintBuilder.getLimitKey(), rangeMode);
} finally {
readLock.unlock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Releasing the readLock before calling targetRange.group.fillRoutingHint() introduces a race condition because addRanges now performs group updates outside the global lock. Since CachedGroup relies on monitor-based synchronization (as seen in getActiveAddresses at line 180), both CachedGroup.update and CachedGroup.fillRoutingHint must be synchronized to ensure thread safety. Without this, concurrent updates and reads of the group's internal state (like tablets and leaderIndex) could lead to inconsistent behavior or data corruption.

Comment on lines +54 to +60
private static final ExecutorService CACHE_UPDATE_POOL =
Executors.newCachedThreadPool(
r -> {
Thread t = new Thread(r, "spanner-cache-update");
t.setDaemon(true);
return t;
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Executors.newCachedThreadPool() for a static shared pool can lead to unbounded thread creation if many ChannelFinder instances (e.g., in a multi-database environment) receive updates simultaneously. Consider using a ThreadPoolExecutor with a defined maximum pool size to prevent potential resource exhaustion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant