|
| 1 | +# DR Drill Report — Background Refresh Fix Validation |
| 2 | + |
| 3 | +**Account**: `bgrefresh-mw-test-440` |
| 4 | +**Resource Group**: `rg-bgrefresh-test` |
| 5 | +**Tenant**: `022026052026cdbprod.onmicrosoft.com` (ephemeral) |
| 6 | +**Regions**: East US (hub) + West US |
| 7 | +**SDK Version**: 4.80.0-beta.1 (patched with background refresh fix) |
| 8 | +**Client Config**: Global endpoint, preferred region = West US, DIRECT mode, 30s background refresh |
| 9 | +**Drill Duration**: 13:05 – 13:56 UTC (51 minutes) |
| 10 | +**Date**: 2026-04-10 |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Executive Summary |
| 15 | + |
| 16 | +| Metric | Value | |
| 17 | +|--------|-------| |
| 18 | +| Background refreshes | **103** (every ~30s, continuous) | |
| 19 | +| Total point reads | **305** | |
| 20 | +| Errors | **0** | |
| 21 | +| Metadata transitions detected | **9** | |
| 22 | +| Read availability | **100%** | |
| 23 | + |
| 24 | +The patched SDK's background refresh timer ran continuously through all 3 drills, detecting every topology change. Zero read failures occurred. This validates the fix in [PR #48758](https://github.com/Azure/azure-sdk-for-java/pull/48758). |
| 25 | + |
| 26 | +**Contrast with unpatched SDK**: The unpatched 4.80.0-beta.1 stopped the background refresh timer after init (0 refreshes), and topology changes were never detected. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +## Latency Summary |
| 31 | + |
| 32 | +| Scope | Count | Avg | P50 | P90 | P99 | Max | |
| 33 | +|-------|-------|-----|-----|-----|-----|-----| |
| 34 | +| **All reads** | 305 | 165ms | 145ms | 207ms | 687ms | 2007ms | |
| 35 | +| Drill 1 (MW↔SW) | 77 | 174ms | 145ms | — | 725ms | 725ms | |
| 36 | +| Drill 2 (Region removal) | 141 | 166ms | 142ms | — | 819ms | 2007ms | |
| 37 | +| Drill 3 (Failover change) | 87 | 156ms | 157ms | — | 270ms | 270ms | |
| 38 | + |
| 39 | +**Notable latency spikes**: |
| 40 | +- 13:05 (init): 725ms — cold start, first TCP connections to West US replicas |
| 41 | +- 13:24 (West US removed): 819ms spike, then bimodal 40-100ms (gateway rerouting) / 140ms (direct) |
| 42 | +- 13:36 (West US re-added): **2007ms** — highest spike as SDK re-establishes connections to newly available region |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## Drill 1: MW → SW → MW Transition |
| 47 | + |
| 48 | +**Purpose**: Validate that the background refresh timer continues running in multi-write accounts and detects `enableMultipleWriteLocations` changes. |
| 49 | + |
| 50 | +| Time (UTC) | Control Plane Action | SDK Detection (via background refresh) | Lag | |
| 51 | +|---|---|---|---| |
| 52 | +| 13:07:11 | MW DISABLED | — | — | |
| 53 | +| 13:10:02 | — | ✅ `MW=false W=1 R=2` | ~3 min | |
| 54 | +| 13:13:15 | MW RE-ENABLED | — | — | |
| 55 | +| 13:14:56 | — | ✅ `MW=true W=2 R=2` | ~2 min | |
| 56 | + |
| 57 | +**Verdict**: ✅ **PASS** — Both transitions detected. `writableLocations` changed from `[EUS, WUS]` → `[EUS]` → `[EUS, WUS]`. This is the exact scenario that FAILS on the unpatched SDK (timer stops, transitions never detected). |
| 58 | + |
| 59 | +``` |
| 60 | +13:05-13:10 MW=true W=2 R=2 reads: West US ~145ms |
| 61 | + ┌─ MW DISABLED at 13:07 |
| 62 | +13:10-13:15 MW=false W=1 R=2 reads: West US ~145ms ← SDK detected! |
| 63 | + ┌─ MW RE-ENABLED at 13:13 |
| 64 | +13:15-13:18 MW=true W=2 R=2 reads: West US ~145ms ← SDK detected! |
| 65 | +``` |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +## Drill 2: Region Removal + Re-add |
| 70 | + |
| 71 | +**Purpose**: Validate SDK detection of region topology changes (readableLocations/writableLocations shrink and expand). |
| 72 | + |
| 73 | +| Time (UTC) | Control Plane Action | SDK Detection | Lag | |
| 74 | +|---|---|---|---| |
| 75 | +| 13:18:15 | Remove West US (started) | — | — | |
| 76 | +| 13:29:22 | Remove West US (completed) | — | — | |
| 77 | +| 13:24:15 | — | ✅ `MW=true W=1 R=1` | ~6 min from start | |
| 78 | +| 13:36:26 | West US re-added + MW enabled | — | — | |
| 79 | +| 13:36:16 | — | ✅ `MW=true W=2 R=2` | immediate | |
| 80 | + |
| 81 | +**Verdict**: ✅ **PASS** — Region removal and re-addition both detected. |
| 82 | + |
| 83 | +**DIRECT mode observation**: Even after West US was removed from the account metadata (`R=1`), reads continued on West US via existing RNTBD connections. This is expected — DIRECT mode maintains active TCP sockets to replicas, which keep serving until decommissioned. Latency shifted to bimodal pattern (~40-100ms vs ~140ms), suggesting some gateway-level rerouting. |
| 84 | + |
| 85 | +``` |
| 86 | +13:18-13:24 MW=true W=2 R=2 reads: West US ~145ms |
| 87 | + ┌─ West US REMOVED |
| 88 | +13:24-13:36 MW=true W=1 R=1 reads: West US ~95ms (RNTBD connections alive) |
| 89 | + ┌─ West US RE-ADDED (2007ms spike!) |
| 90 | +13:36-13:42 MW=true W=2 R=2 reads: West US ~160ms (connections re-established) |
| 91 | +``` |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## Drill 3: Failover Priority Change |
| 96 | + |
| 97 | +**Purpose**: Validate SDK detection of write region swap via `failover-priority-change`. |
| 98 | + |
| 99 | +| Time (UTC) | Control Plane Action | SDK Detection | Lag | |
| 100 | +|---|---|---|---| |
| 101 | +| 13:43:06 | MW disabled | — | — | |
| 102 | +| 13:44:09 | Write swapped EUS→WUS | — | — | |
| 103 | +| 13:45:06 | — | ✅ `MW=false W=1 R=2` | ~2 min | |
| 104 | +| 13:49:09 | Write swapped WUS→EUS | — | — | |
| 105 | +| 13:49:31 | — | ✅ `MW=false W=2 R=2` (transitional) | <1 min | |
| 106 | +| 13:50:01 | — | ✅ `MW=false W=1 R=2` (settled) | ~1 min | |
| 107 | +| 13:51:16 | MW re-enabled | — | — | |
| 108 | +| 13:55:06 | — | ✅ `MW=true W=2 R=2` | ~4 min | |
| 109 | + |
| 110 | +**Verdict**: ✅ **PASS** — All transitions detected. Fastest detection was <1 min (failover-priority-change propagates quickly via global endpoint). Transient `W=2` state at 13:49 reflects the failover mid-transition. |
| 111 | + |
| 112 | +``` |
| 113 | +13:42-13:45 MW=false W=1 R=2 (MW just disabled) |
| 114 | + ┌─ Write swapped EUS→WUS at 13:44 |
| 115 | +13:45-13:49 MW=false W=1 R=2 (WUS is now write region) |
| 116 | + ┌─ Write swapped WUS→EUS at 13:49 |
| 117 | +13:49-13:50 MW=false W=2 R=2 (transitional) |
| 118 | +13:50-13:55 MW=false W=1 R=2 (EUS restored as write) |
| 119 | + ┌─ MW RE-ENABLED at 13:51 |
| 120 | +13:55+ MW=true W=2 R=2 ← Full cycle complete |
| 121 | +``` |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## Background Refresh Continuity |
| 126 | + |
| 127 | +The core validation of the fix — the background refresh timer must never stop: |
| 128 | + |
| 129 | +| Period | Refreshes | Status | |
| 130 | +|--------|-----------|--------| |
| 131 | +| Startup (13:05) | 1 (init) | ✅ | |
| 132 | +| Drill 1: MW=true steady state | ~10 | ✅ Timer kept running | |
| 133 | +| Drill 1: MW=false (SW mode) | ~10 | ✅ Timer kept running | |
| 134 | +| Drill 1: MW=true restored | ~6 | ✅ Timer kept running | |
| 135 | +| Drill 2: Region removal | ~24 | ✅ Timer kept running | |
| 136 | +| Drill 2: Region re-add | ~12 | ✅ Timer kept running | |
| 137 | +| Drill 3: Failover changes | ~28 | ✅ Timer kept running | |
| 138 | +| Drill 3: MW restored | ~12 | ✅ Timer kept running | |
| 139 | +| **Total** | **103** | ✅ Continuous | |
| 140 | + |
| 141 | +**Unpatched comparison**: 0 refreshes after init. Timer stopped. None of these transitions would have been detected. |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Key Findings |
| 146 | + |
| 147 | +1. **Fix validated**: Background refresh timer runs continuously in MW accounts — the fix in `GlobalEndpointManager.refreshLocationPrivateAsync()` works as designed. |
| 148 | + |
| 149 | +2. **Metadata propagation latency**: Global endpoint reflects changes in 1-6 minutes depending on the operation type. `failover-priority-change` is fastest (~1 min), region removal is slowest (~6 min). |
| 150 | + |
| 151 | +3. **DIRECT mode resilience**: Region removal doesn't immediately cause read failures — RNTBD connections to replicas persist after metadata removal. This is by design for DIRECT mode. |
| 152 | + |
| 153 | +4. **Zero errors**: All 305 reads succeeded (HTTP 200) across all 3 drills. The SDK handled every topology change gracefully. |
| 154 | + |
| 155 | +5. **6-year-old bug**: This bug existed since [PR #6139](https://github.com/Azure/azure-sdk-for-java/pull/6139) (Nov 2019). The PR author explicitly acknowledged it in point #4 but it was never fixed until now. |
0 commit comments