Fix background database account refresh stopping in multi-writer accounts#48758
Draft
jeet1995 wants to merge 3 commits intoAzure:mainfrom
Draft
Fix background database account refresh stopping in multi-writer accounts#48758jeet1995 wants to merge 3 commits intoAzure:mainfrom
jeet1995 wants to merge 3 commits intoAzure:mainfrom
Conversation
…unts In multi-writer accounts, refreshLocationPrivateAsync() stops the background refresh timer when shouldRefreshEndpoints() returns false. This means topology changes (e.g., multi-write to single-write transitions) go undetected until the next explicit refresh trigger. The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background refresh loop unconditionally - the loop only stops when canRefreshInBackground is explicitly false, not when shouldRefreshEndpoints returns false. This fix adds startRefreshLocationTimerAsync() to the else-branch of refreshLocationPrivateAsync(), ensuring the background timer always reschedules itself regardless of whether endpoints currently need refreshing. Without this fix, after a multi-write -> single-write -> multi-write transition, reads remain stuck on the primary region because the SDK never re-reads account metadata to learn about the restored multi-write topology. Unit tests updated: - backgroundRefreshForMultiMaster: assertTrue (timer must keep running) - backgroundRefreshDetectsTopologyChangeForMultiMaster: new test proving MW->SW transition detection via mock Related: PR Azure#6139 (point #4 in description acknowledged this bug) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c95fb7b to
2048abe
Compare
…W switch, SW offline) Kusto-backed evidence with charts for PR Azure#48758 validation. Accounts: bgrefresh-mw-test-440 (multi-writer), bgrefresh-sw-test-440 (single-writer) Branch: fix/background-refresh-multi-writer @ 2048abe Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tions, SW switch, SW offline)" This reverts commit c9fc5c4.
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
GlobalEndpointManagerbackground refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting topology changes (e.g., multi-write to single-write transitions).Root Cause
In
refreshLocationPrivateAsync(), whenLocationCache.shouldRefreshEndpoints()returnsfalse, the timer is never restarted:For multi-writer accounts,
shouldRefreshEndpoints()returnsfalsewhen the preferred write endpoint matches the current primary -- a steady-state condition. Once that happens, no further background refreshes occur for the lifetime of the client. Bug has existed since PR #6139 (Nov 2019, point #4 in description).Behavioral Difference with .NET SDK
The .NET SDK handles this correctly in
StartLocationBackgroundRefreshLoop()-- it only terminates whencanRefreshInBackgroundis explicitlyfalse, continuing even whenShouldRefreshEndpoints()returnsfalse.Fix
Add
startRefreshLocationTimerAsync()to theelsebranch ofrefreshLocationPrivateAsync():Unit Tests
6/6 pass:
backgroundRefreshForMultiMaster: Updated assertion -- timer must keep runningbackgroundRefreshDetectsTopologyChangeForMultiMaster: New -- simulates MW-to-SW transition via mockLive DR Drill Validation (4 Scenarios)
Date: 2026-04-10 22:10Z -- 2026-04-11 00:32Z | Branch:
fix/background-refresh-multi-writer@2048abecaAll scenarios used Direct + Gateway modes simultaneously. Kusto data from
BackendEndRequest5M(Direct) andRequest5M(Gateway).Accounts
bgrefresh-mw-test-440bgrefresh-sw-test-440Scenario 1: MW -- Offline Secondary Region
Global endpoint, preferred = West US. Offline West US, observe failover to East US.
PASS -- Failover to East US in ~4 min. 32 GEM refreshes. West US traffic resumed after restore.
Scenario 2: MW -- MW-to-SW-to-MW Transition (Core PR validation)
Regional endpoint (
westus.documents.azure.com), no preferred region. Disable then re-enable multi-write.PASS -- Both transitions detected. MW-to-SW in ~3.5 min (writes shifted to EUS). SW-to-MW in ~1 min (writes returned to WUS). 28 GEM refreshes.
Scenario 3: SW -- Switch Write Region
Global endpoint, preferred = East US. Switch write EUS-to-WUS.
PASS -- Writes on WUS within 1 Kusto bucket. 20 GEM refreshes.
Scenario 4: SW -- Offline Write Region
Global endpoint, preferred = East US. Offline East US.
PASS -- Full failover to WUS in ~3 min. 32 GEM refreshes.
Backend Success Rates
Direct mode (
BackendEndRequest5M)dr-off-direct-writedr-off-direct-readdr-mwsw-direct-writedr-mwsw-direct-readdr-direct-writedr-direct-readdr-off-direct-writedr-off-direct-readGateway mode (
Request5M)dr-off-gw-writedr-off-gw-readdr-mwsw-gw-writedr-mwsw-gw-readdr-gw-writedr-gw-readdr-off-gw-writedr-off-gw-readVerdict
Kusto Queries Used
Changes
GlobalEndpointManager.java)GlobalEndPointManagerTest.java)