Skip to content

Commit 4b1814a

Browse files
jeet1995Copilot
andcommitted
Add DR drill test results and analysis charts
Charts and report from live DR drill validation of the background refresh fix on account bgrefresh-mw-test-440 (East US + West US). 3 drills over 51 minutes, 103 background refreshes, 305 reads, 0 errors: - Drill 1: MW→SW→MW transition (both detected) - Drill 2: Region removal + re-add (detected, DIRECT mode connections persist) - Drill 3: Failover priority change (fastest propagation ~1 min) Config: Global endpoint, preferred=West US, DIRECT mode, 30s refresh (overridden via COSMOS.UNAVAILABLE_LOCATIONS_EXPIRATION_TIME_IN_SECONDS). Unpatched SDK: 0 refreshes (timer stops after init). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent ba0cbd5 commit 4b1814a

File tree

5 files changed

+155
-0
lines changed

5 files changed

+155
-0
lines changed
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# DR Drill Report — Background Refresh Fix Validation
2+
3+
**Account**: `bgrefresh-mw-test-440`
4+
**Resource Group**: `rg-bgrefresh-test`
5+
**Tenant**: `022026052026cdbprod.onmicrosoft.com` (ephemeral)
6+
**Regions**: East US (hub) + West US
7+
**SDK Version**: 4.80.0-beta.1 (patched with background refresh fix)
8+
**Client Config**: Global endpoint, preferred region = West US, DIRECT mode, 30s background refresh
9+
**Drill Duration**: 13:05 – 13:56 UTC (51 minutes)
10+
**Date**: 2026-04-10
11+
12+
---
13+
14+
## Executive Summary
15+
16+
| Metric | Value |
17+
|--------|-------|
18+
| Background refreshes | **103** (every ~30s, continuous) |
19+
| Total point reads | **305** |
20+
| Errors | **0** |
21+
| Metadata transitions detected | **9** |
22+
| Read availability | **100%** |
23+
24+
The patched SDK's background refresh timer ran continuously through all 3 drills, detecting every topology change. Zero read failures occurred. This validates the fix in [PR #48758](https://github.com/Azure/azure-sdk-for-java/pull/48758).
25+
26+
**Contrast with unpatched SDK**: The unpatched 4.80.0-beta.1 stopped the background refresh timer after init (0 refreshes), and topology changes were never detected.
27+
28+
---
29+
30+
## Latency Summary
31+
32+
| Scope | Count | Avg | P50 | P90 | P99 | Max |
33+
|-------|-------|-----|-----|-----|-----|-----|
34+
| **All reads** | 305 | 165ms | 145ms | 207ms | 687ms | 2007ms |
35+
| Drill 1 (MW↔SW) | 77 | 174ms | 145ms || 725ms | 725ms |
36+
| Drill 2 (Region removal) | 141 | 166ms | 142ms || 819ms | 2007ms |
37+
| Drill 3 (Failover change) | 87 | 156ms | 157ms || 270ms | 270ms |
38+
39+
**Notable latency spikes**:
40+
- 13:05 (init): 725ms — cold start, first TCP connections to West US replicas
41+
- 13:24 (West US removed): 819ms spike, then bimodal 40-100ms (gateway rerouting) / 140ms (direct)
42+
- 13:36 (West US re-added): **2007ms** — highest spike as SDK re-establishes connections to newly available region
43+
44+
---
45+
46+
## Drill 1: MW → SW → MW Transition
47+
48+
**Purpose**: Validate that the background refresh timer continues running in multi-write accounts and detects `enableMultipleWriteLocations` changes.
49+
50+
| Time (UTC) | Control Plane Action | SDK Detection (via background refresh) | Lag |
51+
|---|---|---|---|
52+
| 13:07:11 | MW DISABLED |||
53+
| 13:10:02 ||`MW=false W=1 R=2` | ~3 min |
54+
| 13:13:15 | MW RE-ENABLED |||
55+
| 13:14:56 ||`MW=true W=2 R=2` | ~2 min |
56+
57+
**Verdict**: ✅ **PASS** — Both transitions detected. `writableLocations` changed from `[EUS, WUS]``[EUS]``[EUS, WUS]`. This is the exact scenario that FAILS on the unpatched SDK (timer stops, transitions never detected).
58+
59+
```
60+
13:05-13:10 MW=true W=2 R=2 reads: West US ~145ms
61+
┌─ MW DISABLED at 13:07
62+
13:10-13:15 MW=false W=1 R=2 reads: West US ~145ms ← SDK detected!
63+
┌─ MW RE-ENABLED at 13:13
64+
13:15-13:18 MW=true W=2 R=2 reads: West US ~145ms ← SDK detected!
65+
```
66+
67+
---
68+
69+
## Drill 2: Region Removal + Re-add
70+
71+
**Purpose**: Validate SDK detection of region topology changes (readableLocations/writableLocations shrink and expand).
72+
73+
| Time (UTC) | Control Plane Action | SDK Detection | Lag |
74+
|---|---|---|---|
75+
| 13:18:15 | Remove West US (started) |||
76+
| 13:29:22 | Remove West US (completed) |||
77+
| 13:24:15 ||`MW=true W=1 R=1` | ~6 min from start |
78+
| 13:36:26 | West US re-added + MW enabled |||
79+
| 13:36:16 ||`MW=true W=2 R=2` | immediate |
80+
81+
**Verdict**: ✅ **PASS** — Region removal and re-addition both detected.
82+
83+
**DIRECT mode observation**: Even after West US was removed from the account metadata (`R=1`), reads continued on West US via existing RNTBD connections. This is expected — DIRECT mode maintains active TCP sockets to replicas, which keep serving until decommissioned. Latency shifted to bimodal pattern (~40-100ms vs ~140ms), suggesting some gateway-level rerouting.
84+
85+
```
86+
13:18-13:24 MW=true W=2 R=2 reads: West US ~145ms
87+
┌─ West US REMOVED
88+
13:24-13:36 MW=true W=1 R=1 reads: West US ~95ms (RNTBD connections alive)
89+
┌─ West US RE-ADDED (2007ms spike!)
90+
13:36-13:42 MW=true W=2 R=2 reads: West US ~160ms (connections re-established)
91+
```
92+
93+
---
94+
95+
## Drill 3: Failover Priority Change
96+
97+
**Purpose**: Validate SDK detection of write region swap via `failover-priority-change`.
98+
99+
| Time (UTC) | Control Plane Action | SDK Detection | Lag |
100+
|---|---|---|---|
101+
| 13:43:06 | MW disabled |||
102+
| 13:44:09 | Write swapped EUS→WUS |||
103+
| 13:45:06 ||`MW=false W=1 R=2` | ~2 min |
104+
| 13:49:09 | Write swapped WUS→EUS |||
105+
| 13:49:31 ||`MW=false W=2 R=2` (transitional) | <1 min |
106+
| 13:50:01 ||`MW=false W=1 R=2` (settled) | ~1 min |
107+
| 13:51:16 | MW re-enabled |||
108+
| 13:55:06 ||`MW=true W=2 R=2` | ~4 min |
109+
110+
**Verdict**: ✅ **PASS** — All transitions detected. Fastest detection was <1 min (failover-priority-change propagates quickly via global endpoint). Transient `W=2` state at 13:49 reflects the failover mid-transition.
111+
112+
```
113+
13:42-13:45 MW=false W=1 R=2 (MW just disabled)
114+
┌─ Write swapped EUS→WUS at 13:44
115+
13:45-13:49 MW=false W=1 R=2 (WUS is now write region)
116+
┌─ Write swapped WUS→EUS at 13:49
117+
13:49-13:50 MW=false W=2 R=2 (transitional)
118+
13:50-13:55 MW=false W=1 R=2 (EUS restored as write)
119+
┌─ MW RE-ENABLED at 13:51
120+
13:55+ MW=true W=2 R=2 ← Full cycle complete
121+
```
122+
123+
---
124+
125+
## Background Refresh Continuity
126+
127+
The core validation of the fix — the background refresh timer must never stop:
128+
129+
| Period | Refreshes | Status |
130+
|--------|-----------|--------|
131+
| Startup (13:05) | 1 (init) ||
132+
| Drill 1: MW=true steady state | ~10 | ✅ Timer kept running |
133+
| Drill 1: MW=false (SW mode) | ~10 | ✅ Timer kept running |
134+
| Drill 1: MW=true restored | ~6 | ✅ Timer kept running |
135+
| Drill 2: Region removal | ~24 | ✅ Timer kept running |
136+
| Drill 2: Region re-add | ~12 | ✅ Timer kept running |
137+
| Drill 3: Failover changes | ~28 | ✅ Timer kept running |
138+
| Drill 3: MW restored | ~12 | ✅ Timer kept running |
139+
| **Total** | **103** | ✅ Continuous |
140+
141+
**Unpatched comparison**: 0 refreshes after init. Timer stopped. None of these transitions would have been detected.
142+
143+
---
144+
145+
## Key Findings
146+
147+
1. **Fix validated**: Background refresh timer runs continuously in MW accounts — the fix in `GlobalEndpointManager.refreshLocationPrivateAsync()` works as designed.
148+
149+
2. **Metadata propagation latency**: Global endpoint reflects changes in 1-6 minutes depending on the operation type. `failover-priority-change` is fastest (~1 min), region removal is slowest (~6 min).
150+
151+
3. **DIRECT mode resilience**: Region removal doesn't immediately cause read failures — RNTBD connections to replicas persist after metadata removal. This is by design for DIRECT mode.
152+
153+
4. **Zero errors**: All 305 reads succeeded (HTTP 200) across all 3 drills. The SDK handled every topology change gracefully.
154+
155+
5. **6-year-old bug**: This bug existed since [PR #6139](https://github.com/Azure/azure-sdk-for-java/pull/6139) (Nov 2019). The PR author explicitly acknowledged it in point #4 but it was never fixed until now.
183 KB
Loading
198 KB
Loading
231 KB
Loading
114 KB
Loading

0 commit comments

Comments
 (0)