Early yield on 429 throttling on barrier requests#48914
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Cosmos direct connectivity quorum/barrier logic to “yield early” when replica reads are uniformly throttled (HTTP 429), allowing the existing ResourceThrottleRetryPolicy to apply appropriate backoff instead of progressing into additional quorum/primary/barrier attempts.
Changes:
- Add
StoreResult.isThrottledExceptionto cheaply detect 429 responses. - In
QuorumReader, propagate 429 immediately when all collected replica results are throttled (including barrier paths). - In
ConsistencyWriter, track throttling during write barriers and, when retries are exhausted and the last attempt was fully throttled, throw aRequestTimeoutExceptionwith a new substatus code; add unit tests for the new behaviors.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResult.java | Adds a computed flag to identify throttling (429) on replica results. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/QuorumReader.java | Early-yields on replica-wide throttling to let throttle retry policy handle backoff. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriter.java | Tracks throttling during write barriers and surfaces a distinct timeout substatus when retries are exhausted. |
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java | Introduces a new substatus code for write-barrier throttling exhaustion. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/QuorumReaderTest.java | Adds unit tests covering 429 propagation and Gone+429 interactions. |
| sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriterTest.java | Adds unit tests for write-barrier behavior under sustained throttling and mixed outcomes. |
|
@sdkReviewAgent |
|
✅ Review complete (49:16) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run java - cosmos - tests |
|
No pipelines are associated with this pull request. |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Port of .NET PR #1667829: When receiving repeated 429 (Too Many Requests) responses with strong consistency, QuorumReader and ConsistencyWriter now handle throttling more efficiently. QuorumReader (reads): - waitForReadBarrierAsync: yield early when all replicas return 429 in both single-region and multi-region barrier loops - ensureQuorumSelectedStoreResponse: yield early when all replicas throttled during initial quorum read - All cases throw the 429 exception to let ResourceThrottleRetryPolicy handle retry with appropriate backoff ConsistencyWriter (writes): - waitForWriteBarrierAsync: track lastAttemptWasThrottled flag per iteration - Do NOT yield early (preserves idempotency guarantees) - When all retries exhausted due to consistent throttling, throw RequestTimeoutException (408) with substatus SERVER_WRITE_BARRIER_THROTTLED (21013) instead of returning barrier-not-met Other changes: - Added isThrottledException field to StoreResult - Added SERVER_WRITE_BARRIER_THROTTLED (21013) substatus code - Unit tests for all throttling scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed replica yields early Port of .NET test ValidatesReadMultipleReplicaAsyncExcludesGoneReplicas. Validates that when replicas return a mix of 410 (Gone) and 429 (TooManyRequests): - Gone replicas are excluded from results by StoreReader (isValid=false for GONE) - The 429 replica with valid LSN headers is kept (isValid=true for non-GONE with lsn>=0) - Since all remaining replicas are throttled, early yield triggers - The 429 exception propagates to ResourceThrottleRetryPolicy Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from xinlian12 (blocking): - Fix lastAttemptWasThrottled stale state: reset flag before avoidQuorumSelection early return to prevent incorrect 408 when prior iteration was throttled but current iteration hits 410 - Fix readStrong_AllReplicasThrottled_Returns429 false positive: set LSN on exception so StoreReader marks isValid=true, ensuring the early yield path is actually exercised. Add transport invocation count assertions to verify primary read is NOT attempted. - Add readStrong_BarrierRequestsThrottled_Returns429 test covering the waitForReadBarrierAsync barrier path (quorum succeeds, then barrier HEAD requests return 429) Fixes from Copilot review: - Fix checkstyle: add missing spaces around = operator (2 places) - Fix log wording: 'All replicas' -> 'All contacted replicas' (more accurate since not all replicas may be contacted per attempt) - Fix ConsistencyWriter log: 'consistent throttling' -> 'last attempt was throttled' (flag only tracks last attempt, not all attempts) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tling Unit tests (4 new, 10 total throttling tests): - writeBarrier_AvoidQuorumSelectionAfterThrottling_NoFalse408: validates lastAttemptWasThrottled reset on avoidQuorumSelection path (stale state fix) - writeBarrier_NRegionCommit_AllReplicasThrottled_Returns408: N-region synchronous commit barrier throttling produces 408/21013 - readStrong_QuorumNotSelected_PrimaryThrottled_Returns429: primary 429 propagates correctly through QuorumNotSelected → readPrimary path - readStrong_BarrierPartialThrottle_StillSucceeds: barrier succeeds when one replica is throttled but other meets LSN (no false-negative yield) Fault injection E2E tests (3 new, require strong consistency account): - faultInjection_readBarrierThrottled_yieldsEarly: inject 429 on HEAD_COLLECTION + GCLSN interceptor → verify early yield on reads - faultInjection_writeBarrierThrottled_returns408: inject 429 on HEAD_COLLECTION + GCLSN interceptor → verify 408 on writes - faultInjection_readBarrierThrottled_thenRecovers: inject 429 with hitLimit(2) → verify read succeeds after throttle clears Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Read and write barrier requests are only triggered on multi-region strong consistency accounts (numberOfReadRegions > 0). The emulator is single-region, so the GCLSN interceptor never triggers barriers and the tests fail with empty supplementalResponseStatisticsList. Added accountLevelReadRegions.size() > 1 skip check to all three E2E fault injection tests so they correctly skip on single-region environments. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
28b1fae to
3cd6163
Compare
…_ON_429 flag Align with the updated .NET implementation (PR 2117699), which puts the early-yield-on-429 barrier behavior behind a feature flag defaulting to off. - Add Configs.isBarrierEarlyYieldOn429Enabled() following the existing COSMOS.* / COSMOS_* system-property/env-var pattern, default false. - QuorumReader: keep the 'all replicas returned 429' diagnostic logs unconditional, but gate the three early-yield Mono/Flux.error returns behind the flag. - ConsistencyWriter: gate lastAttemptWasThrottled tracking and the 408 (substatus 21013) throw behind the flag; logs remain unconditional. - Tests: enable the flag for the barrier-throttle unit/E2E tests and add a flag-off default test; restore the system property after each test. - CHANGELOG: note the behavior is now opt-in and disabled by default. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- FaultInjectionServerErrorRuleOnDirectTests: the write-barrier E2E test now asserts the thrown CosmosException is 408 with substatus 21013 (SERVER_WRITE_BARRIER_THROTTLED) instead of only checking the fault rule hit count, so it actually validates the write-side behavior. Also move the System.setProperty for the feature flag inside the try block so the finally always clears it even if fault-rule setup throws. - QuorumReader: the unconditional 'all replicas returned 429' detection log no longer claims 'Yielding early'; the yield action is now logged inside the feature-flag guard so the log isn't misleading when the flag is off. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…throttling # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md
- StoreResult: exclude throughput-control 429s from isThrottledException so client-side throttling does not falsely trigger early yield - ConsistencyWriter: simplify lastAttemptWasThrottled assignment to set(enableBarrierEarlyYieldOn429) - E2E write test: fail when createItem unexpectedly succeeds - Add E2E test for default flag-off behavior asserting the 21013 early-yield substatus is not produced Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kushagraThapar
left a comment
There was a problem hiding this comment.
Thanks @mbhaskar for tackling this — the throttle amplification on barrier loops is real, and gating it behind a default-OFF feature flag (with the .NET parity story for future enablement) is exactly the right shape. The Copilot / xinlian12 / Praveen-Msft round-trips landed clean fixes; the notes below are the gaps I see after that.
Inline notes (7):
- M1 — Building on xinlian12's
:520thread: 408/21013 currently routes throughClientRetryPolicy.shouldRetryOnRequestTimeoutto PPAF endpoint-marking, not toResourceThrottleRetryPolicy. Means a throttle storm becomes a region failover, which doesn't fix RU exhaustion at the source. - M2 — The synthesized
RequestTimeoutException(message, URI, subStatusCode)overload drops both the underlying 429 cause AND the server'sx-ms-retry-after-msheader. Capturing into anAtomicReference<CosmosException>parallel tolastAttemptWasThrottledwould preserve both. - M3 — Write-side terminal yield may preempt dynamic-quorum convergence —
waitForWriteBarrierAsyncpreviously polled until replicas committed; now we yield terminally when one replica is hot. Worth pinging the replication owners on the Strong-vs-Eventual semantics (chapter 05#write-path-consistencywriter-internals). - M4 — Building on xinlian12's
:520thread: 21013 has a 3-part documentation cliff —StatusCodes.mdrow missing, CHANGELOG bullet doesn't mention the new substatus value, chapter 17 § 5.1 needs the 408-substatus entry. - M5 — Three near-identical 429-allMatch early-yield blocks in
QuorumReader(:443, :713, :810). Drift risk especially since xinlian12's still-openretryAfterInMsthread will need to touch all three. - M6 — The 20-arg positional
new StoreResult(...)ctor is duplicated across 5+ new test methods inConsistencyWriterTest(plus more inQuorumReaderTest). This is the third PR in recent memory adding to that pattern — would love aStoreResultTestFactoryextraction as a follow-up. - m16 —
SERVER_WRITE_BARRIER_THROTTLEDnaming suggests server-origin but is client-synthesized; sibling precedent is*_EXCEEDED_RETRY_LIMIT. Cheaper to rename pre-merge than post-release.
Cross-SDK 21013 dispute resolution (re xinlian12's note at HttpConstants.java:504):
xinlian12 asked whether SERVER_WRITE_BARRIER_THROTTLED = 21013 exists in .NET; the reply indicated .NET does have something equivalent. I tried to verify this end-to-end against public sources:
Azure/azure-cosmos-dotnet-v3main (SHAe45ca1976cea0986455403422ce0e7380157d613):git grep -wn "21013" -- '*.cs'→ zero source hits (only test JSON false positives)grep -rn "BarrierThrottled\|WriteBarrierThrottled" --include="*.cs"→ zero hitsConfigurationManager.cs(~30AZURE_COSMOS_*env vars) → zero hits for any barrier / early-yield / 429 flag
So in the public .NET v3 surface, the feature isn't present. The reply is most likely referring to the closed Microsoft.Azure.Cosmos.Direct NuGet (which hosts ConsistencyWriter for .NET) or to internal MSDATA PR #2117699 — both of which I can't see from here.
The reason this matters: 21013 is client-synthesized by Java, not server-emitted. If MSDATA picks a different numeric value (or a different substatus name) when they add the same feature, cross-SDK diagnostics keyed on substatus (e.g. customer dashboards correlating .NET vs Java telemetry) will permanently diverge. Suggesting a quick sync with the .NET team to lock the numeric value + canonical name before this merges. Happy to be wrong if there's already cross-SDK alignment I'm not seeing.
Design-doc follow-ups (separate PR against tvaron3/cosmosdb-design-docs):
- D1 — chapter 17
17-status-codes-and-sdk-retries.md§ 5.1 "408 substatus reference" — add the 21010 / 21011 / 21012 / 21013 row with the local-commit-state column + retry-policy mapping - D2 — chapter 05
05-replication-consistency.md— add a "Barrier HEAD request throttling" subsection covering both the read early-yield (returning the underlying 429 toResourceThrottleRetryPolicy) and the write terminal 408/21013 asymmetry - D3 — chapter 11
11-request-units-throttling.md— cross-ref note that barrier HEAD requests share the per-physical RU bucket with data-plane reads, so a single hot logical partition can throttle barriers indefinitely
Also FYI — PR currently shows mergeable: CONFLICTING so a rebase is needed regardless; might be cheap to fold the above into that rebase pass.
Approving with notes — none of this is blocking the early-yield mechanic itself, which I think is the right architectural call given the flag-gating + .NET parity story.
| } | ||
|
|
||
| if (writeBarrierRetryCount.get() == 0) { | ||
| if (this.enableBarrierEarlyYieldOn429 && lastAttemptWasThrottled.get()) { |
There was a problem hiding this comment.
One thing I'd love your read on (and probably worth pinging the replication owners):
The write-side waitForWriteBarrierAsync barrier loop polls replicas waiting for globalCommittedLsn to catch up — i.e. it's the dynamic-quorum convergence loop. If one replica's RU bucket is hot enough that it returns 429 on every barrier poll, but the OTHER replicas eventually catch up and commit, the prior behavior (return Mono.just(false) after retries exhausted → caller falls back to recomputing quorum) was effectively eventual-consistency-friendly. The new behavior (throw RequestTimeoutException(21013) on the same condition) is terminal — we never get to see whether the slow-but-not-throttled replicas converged.
For Strong / BoundedStaleness this might actually be the right call (we'd rather surface a clear timeout than serve a stale read). For Eventual / Session it feels slightly aggressive. Wondering if the design intent is to gate this on consistency level, or whether the chapter 05 05-replication-consistency.md#write-path-consistencywriter-internals contract explicitly permits the terminal yield regardless. Could be worth a comment thread with the replication owners before this merges.
| public static final int SERVER_GENERATED_408 = 21010; | ||
| public static final int FAILED_TO_PARSE_SERVER_RESPONSE = 21011; | ||
| public static final int GLOBAL_N_REGION_COMMIT_WRITE_BARRIER_NOT_MET = 21012; | ||
| public static final int SERVER_WRITE_BARRIER_THROTTLED = 21013; |
There was a problem hiding this comment.
Building on xinlian12's thread at HttpConstants.java:504:
Even setting aside the cross-SDK numeric coordination question (covered in the summary), 21013 as a customer-observable substatus has a documentation cliff:
sdk/cosmos/azure-cosmos/docs/StatusCodes.mdlists every other 408 substatus (21010SERVER_GENERATED_408, 21011FAILED_TO_PARSE_SERVER_RESPONSE, 21012GLOBAL_N_REGION_COMMIT_WRITE_BARRIER_NOT_MET) with semantics + retry guidance. 21013 needs a parallel row so customers who see "408 substatus 21013" in their telemetry can self-serve diagnose.- The CHANGELOG bullet currently reads "Added early yield on 429 throttling on barrier requests" — it doesn't mention the new substatus value or what it means for downstream alerting / dashboards. Customers with substatus-keyed monitoring will see a brand-new 408 substatus appear without warning.
- Design-doc follow-up (separate PR): chapter 17
17-status-codes-and-sdk-retries.mdhas section 5.1 listing 408 substatus codes — 21013 should be added there with the retry-policy mapping.
verified by: grep -n "21013\|SERVER_WRITE_BARRIER_THROTTLED" sdk/cosmos/azure-cosmos/docs/StatusCodes.md → no output (confirmed undocumented)
| // Check if all contacted replicas returned 429 Too Many Requests. | ||
| // Yield early to let ResourceThrottleRetryPolicy handle the retry with appropriate backoff, | ||
| // instead of returning QuorumNotSelected which would trigger an unnecessary primary read attempt. | ||
| if (!responseResult.isEmpty() && responseResult.stream().allMatch(r -> r.isThrottledException)) { |
There was a problem hiding this comment.
This isEmpty() && allMatch(isThrottledException) predicate is duplicated at three sites in this file — here at :443, again at :713 (waitForReadBarrierAsync first throttle gate), and at :810 (waitForReadBarrierAsync second throttle gate). The three blocks are structurally identical except for the log strings and the early-return target.
Two reasons to extract a helper before merge:
- Pure drift risk — when xinlian12's still-open
retryAfterInMsthread (currently on:445) gets addressed, the fix needs to land in all three sites consistently. With the current copy-paste shape it's easy to miss one. - The early-yield decision is the new authoritative invariant of this PR. Having it expressed once (something like
private Mono<X> tryEarlyYieldOn429(List<StoreResult>, ...)returningOptional.empty()when not all-throttled) makes the policy testable in isolation and removes the chance of three sites diverging behaviorally.
Would also let the unit-test footprint shrink — currently each new test exercises one of the three sites in a near-identical way.
…throttling # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md
…ottled 408 A 408 synthesized from write-barrier throttling (substatus SERVER_WRITE_BARRIER_THROTTLED) is an RU hot-spot signal, not an endpoint failure. Guard shouldRetryOnRequestTimeout() to surface it as terminal (no retry) without marking the endpoint unavailable or triggering per-partition automatic failover. Addresses review concern #1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the write-barrier loop exhausts retries under consistent throttling and synthesizes a 408 (substatus SERVER_WRITE_BARRIER_THROTTLED), capture the last throttled 429 in an AtomicReference and pass it as the inner exception, and propagate its server-advised x-ms-retry-after-ms header. This keeps the diagnostics cause-chain intact and gives a downstream retry policy honoring the 408 the server hint to work with. Addresses review concern Azure#2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Flip COSMOS.ENABLE_BARRIER_EARLY_YIELD_ON_429 default from false to true to match the .NET SDK (default-on with opt-out). Update CHANGELOG and test comments, and switch the opt-out E2E test to explicitly set the flag to false (clearing the property now means enabled). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@copilot resolve the merge conflicts in this pull request |
Description
This PR introduces early yield on 429 (throttling) responses during barrier requests for strong/bounded-staleness consistency.
When receiving 429s under strong consistency, the quorum reader/writer barrier code does not yield early enough. Instead it keeps retrying the barrier until the overall request timeout, producing multiple stack traces and accumulating resource pressure on the client. With this change, when all barrier responses are throttled (429), the SDK yields early rather than spinning until timeout:
QuorumReader): when the barrier responses are all 429s, the original 429 is surfaced/yielded early instead of being retried to timeout.ConsistencyWriter): when the write barrier is throttled, a 408 (RequestTimeout) with substatus21013(SERVER_WRITE_BARRIER_THROTTLED) is returned early.Feature flag (gated, default OFF)
This behavior is disabled by default and gated behind a static
Configsfeature flag, mirroring the updated .NET implementation which placed the same behavior behind a feature flag. The Java public SDK has no per-tenant config path, so a staticConfigsflag is used following existing patterns (e.g.isSessionTokenFalseProgressMergeEnabled()):COSMOS.ENABLE_BARRIER_EARLY_YIELD_ON_429COSMOS_ENABLE_BARRIER_EARLY_YIELD_ON_429falseConfigs.isBarrierEarlyYieldOn429Enabled()When the flag is off, behavior is identical to before this PR. Diagnostic detection logs are emitted unconditionally for livesite visibility; only the behavioral side effects (early-yield return, 408 throw, throttle tracking) and the "yielding early" action log are gated behind the flag.
Reference
Aligns with the updated .NET implementation, which puts this behavior behind a feature flag:
https://msdata.visualstudio.com/CosmosDB/_git/CosmosDB/pullrequest/2117699
Changes
Configs.java: new flag constants +isBarrierEarlyYieldOn429Enabled()accessor.QuorumReader.java: gate the three read barrier early-yield paths behind the flag.ConsistencyWriter.java: gate the write barrier 408/21013early-yield and throttle tracking behind the flag.HttpConstants.java: add substatusSERVER_WRITE_BARRIER_THROTTLED = 21013.StoreResult.java: addisThrottledExceptionhelper.QuorumReaderTest,ConsistencyWriterTest) for both flag-on and flag-off behavior, plus fault-injection E2E tests.All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines