Skip to content

fix(spanner): preserve all async cache updates#12740

Merged
rahul2393 merged 3 commits intomainfrom
cache-update-lossless-fix
Apr 10, 2026
Merged

fix(spanner): preserve all async cache updates#12740
rahul2393 merged 3 commits intomainfrom
cache-update-lossless-fix

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Apr 10, 2026

Summary

This change fixes cache updates being missed after cache application moved off the request hot path. It replaces the lossy async updater with a lossless queue, skips truly empty async cache updates, moves endpoint lifecycle reconciliation off the cache-apply critical path

Problem

After moving cache application off the request hot path, cache updates could either be lost or applied too slowly. In production this showed up as:

  • cache update backlog growing continuously
  • pending cache update objects increasing memory usage
  • long delays in applying updates on the executor path
  • logs showing many cache updates with little or no change in cache size

We also observed that a large number of responses carried a cache_update field that was present but effectively empty, which added queue pressure without changing routing state.

Root Cause

There were a few separate issues:

  • async cache update handling used a lossy pending-update model, so intermediate updates could be overwritten before being applied
  • databaseId=0 updates could clear an already initialized finder state even when they did not carry meaningful cache contents
  • endpoint lifecycle reconciliation still ran in the cache update path, adding unnecessary work to cache application
  • many empty cache updates were still being enqueued and drained even though they were no-ops

What This PR Changes

  • replace the lossy async cache updater with a lossless FIFO queue so all material cache updates are preserved
  • keep zero-database-id updates from clearing existing cache state
  • skip truly empty async cache updates while still preserving:
    • all non-empty updates
    • database-id-only transitions that can invalidate old cache state
  • batch queue draining to reduce scheduling and lock churn
  • decouple lifecycle address reconciliation from the cache update critical section

Result

Based on production logs after the fix:

  • empty cache updates are being skipped in large numbers
  • material pending cache updates are shrinking instead of growing
  • cache apply time remains very small
  • routing cache size remains stable and traffic converges away from the default endpoint quickly

This indicates the main issue was queue pressure from lossy handling and empty/no-op updates, not slow cache map mutation itself.

@rahul2393 rahul2393 requested review from a team as code owners April 10, 2026 09:10
@rahul2393 rahul2393 requested a review from olavloite April 10, 2026 09:10
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the ChannelFinder class to support queuing multiple asynchronous cache updates using a ConcurrentLinkedQueue, replacing the previous logic that only retained the latest pending update. It introduces helper methods to filter out non-material updates and improves the handling of database ID transitions to prevent unnecessary cache clears. The update draining mechanism was updated to ensure all queued tasks are processed, and awaitPendingUpdates was adjusted for better synchronization in tests. New unit tests were added to verify that updates are processed in order without being dropped and to validate edge cases for database ID changes. I have no feedback to provide as there were no review comments to assess.

@rahul2393 rahul2393 enabled auto-merge (squash) April 10, 2026 10:57
@rahul2393 rahul2393 merged commit b8bf432 into main Apr 10, 2026
102 of 103 checks passed
@rahul2393 rahul2393 deleted the cache-update-lossless-fix branch April 10, 2026 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants