Cosmos: Adds Cross-Region Hedging Design Spec to Driver Crate#4330
Cosmos: Adds Cross-Region Hedging Design Spec to Driver Crate#4330kundadebdatta wants to merge 6 commits into
Conversation
NaluTripician
left a comment
There was a problem hiding this comment.
Cross-checked the spec against the actual .NET source in azure-cosmos-dotnet-v3 (CrossRegionHedgingAvailabilityStrategy.cs, AvailabilityStrategy.cs, DocumentClient.InitializePartitionLevelFailoverWithDefaultHedging). Overall the spec is well-grounded — author clearly read the .NET code, not just the docs. One material issue around the SDK-default write-hedging behavior on multi-master, plus a handful of minor nits inline.
| request_number, | ||
| region: regions[request_number].clone(), | ||
| result: Err(cancelled_error()), | ||
| } |
There was a problem hiding this comment.
Missing app-cancellation re-raise behavior from .NET. The .NET impl deliberately awaits the faulted task when the app token is the source of cancellation (lines 209–212 of CrossRegionHedgingAvailabilityStrategy.cs):
if (applicationProvidedCancellationToken.IsCancellationRequested)
{
await (Task<HedgingResponse>)completedTask;
}This is what allows RequestSenderAndResultCheckAsync to rethrow as CosmosOperationCanceledException with the trace attached (lines 358–363). The Rust pseudocode collapses both "hedge cancellation" and "app cancellation" into a generic cancelled_error(), losing the trace context that .NET preserves.
Suggest distinguishing app-token cancellation from hedge-token cancellation in the orchestrator and propagating the former with full diagnostics, mirroring .NET's behavior.
| |---:|-----------|--------| | ||
| | 1 | No strategy resolved (or `AvailabilityStrategy::Disabled`) | No | | ||
| | 2 | Application preferred-region list empty (no fan-out targets) | No | | ||
| | 3 | `ResourceType != Document` | No | |
There was a problem hiding this comment.
This rule won't survive Phase 3. §16 Phase 3 adds metadata operations (Database / Container / Offer / Throughput), all of which have non-Document ResourceType. As written, this row is a hard reject for all non-Document ops — when Phase 3 lands, this row needs to become phase-/resource-type-gated rather than a blanket exclusion.
Worth either:
- Tagging this row with "Phase 1 only — see §16 Phase 3 for metadata coverage", or
- Restating it as
ResourceType not in <phase-allowed set>so the eligibility rule's evolution is encoded in one place rather than getting rewritten in Phase 3.
…datta/3935_add_hedging_spec
There was a problem hiding this comment.
Pull request overview
Adds a new in-repo design specification (HEDGING_SPEC.md) to the Cosmos driver crate docs, describing the planned cross-region hedging (availability strategy) feature and its intended integration with existing routing/retry systems.
Changes:
- Adds a comprehensive design spec for cross-region request hedging (configuration, eligibility, algorithm, diagnostics, and phased rollout plan).
- Documents intended interactions with PPAF/PPCB, session consistency, throughput control, deadlines, and cancellation.
| ### Operation-type scope (phased) | ||
|
|
||
| | Operation type | Phase 1 | Phase 2 | Future | | ||
| |---|:---:|:---:|:---:| | ||
| | Document point reads (GetItem) | ✅ | ✅ | ✅ | | ||
| | Document point writes on multi-master (Create/Replace/Upsert/Delete/Patch) | ✅ | ✅ | ✅ | | ||
| | Queries (`QueryItems`) | ✅ | ✅ | ✅ | | ||
| | `ReadMany` | ✅ | ✅ | ✅ | | ||
| | Change feed | ✅ | ✅ | ✅ | | ||
| | Metadata operations (Database / Container / Offer / Throughput) | ❌ | ✅ | ✅ | | ||
| | Stored procedures / triggers / UDFs execution | ❌ | ❌ | 🟡 candidate | |
| ### Solution: Speculative Hedging | ||
|
|
||
| **Hedging** sends the same request to an alternate region after a latency threshold | ||
| is exceeded, and returns whichever response arrives first. This bounds tail latency |
There was a problem hiding this comment.
| is exceeded, and returns whichever response arrives first. This bounds tail latency | |
| is exceeded, and returns whichever finite response arrives first. This bounds tail latency |
| **Hedging** sends the same request to an alternate region after a latency threshold | ||
| is exceeded, and returns whichever response arrives first. This bounds tail latency | ||
| at roughly `threshold + cross-region-RTT` instead of waiting for the slow region to | ||
| respond. |
There was a problem hiding this comment.
Should frequently choosing the hedged respinse from second region also trigger PPCB? Like when identifying at least for one partition the second region is always faster - why not skip even trying in the frist region after few iterations? This is a bit like Bhaskhars initial proposal for hedging - which was more dynamic what we have now - please follow up with Bhaskar end let him give you the docs he had - we should revaluate whether at elast some aspects of his initial idea would make sense to be now adopted in Rust. We were not able to convice customers of Java V4 at the time - but it is definitely worth looking at it again.
| 4. **Complementary to failover** — hedging handles *latency*; PPAF/PPCB handle | ||
| *failures*. They compose without interference. | ||
| 5. **Resource-safe** — hedged requests that lose the race are cancelled promptly to | ||
| avoid wasted RU/s and transport resources. |
There was a problem hiding this comment.
Intend makes sense - but it is not perfectly "safe" - so I would set expecttaions a bit more carefully
| | `ReadMany` | ✅ | ✅ | ✅ | | ||
| | Change feed | ✅ | ✅ | ✅ | | ||
| | Metadata operations (Database / Container / Offer / Throughput) | ❌ | ✅ | ✅ | | ||
| | Stored procedures / triggers / UDFs execution | ❌ | ❌ | 🟡 candidate | |
There was a problem hiding this comment.
tirggers are never executed separately - a header on normal point operation is sent to service indicating that service should execute the trigger - so, this is completely independent of hedging - if you hedge the createItem it is hedged with or without triggers. Just remove triggers and UDFs here - only Stored Procedure execution is its own "operation"
| operation coverage where it is safe and cheap. Sprocs / triggers / UDFs | ||
| are deferred to Future because their server-side execution model | ||
| interacts with hedging in non-obvious ways (server-side state, | ||
| idempotency). See §16 for the full rollout plan. |
There was a problem hiding this comment.
triggers / UDF ireelevant
StoredProcedures or point writes all have the same issue with idempotency in Multi-Master. I think in Phase 1 (and maybe all we ever do) we should stick to read/query only. Multi_master with hedging could result in a significantly higher number of conflicts - and that is usually not something you would want - so, my 2 cents - it was a mistake to ever enable hedging for write sin multi-master in Java - and very few customers if any have enabled it (also requires opting into enabling retriable writes). I would scope this down to Pahse 1 - reads, Phase 2 metadata operations - and explicitly not enable hedging for writes.
| so a PPAF-enabled deployment ends up running with all three (PPAF + PPCB + | ||
| hedging) active simultaneously. | ||
|
|
||
| **The Rust driver matches .NET exactly:** the SDK-default hedging strategy |
There was a problem hiding this comment.
To me we should simplify this. Always enable PPAF (if server allows), PPCB and Hedging. PPCB and Hedging we could allow to opt-out as an escape hedge. PPAF will always follow server signal. Whether we ship with the escape hedge or force enablement of PPCB and hedging we can decide after some more stress tests.
|
|
||
| **Rationale:** Hedging must operate above the retry loop because each hedged | ||
| request needs its own independent retry state, session tokens, and endpoint | ||
| resolution. The operation pipeline already handles per-region retries; hedging |
There was a problem hiding this comment.
Needs to dynamically add excluded regions - also we shoudl make sure that any cloning/allocations only happen when heding threshold kicks in - any request not needing heding should be free/super cheap. I would call this out here explicity as a design constraint
There was a problem hiding this comment.
Also I am wondering whether we really need to hedge ever to more than one region - I mean two Azure regions Out is a scenario that ws extremely unlikely. And when cleints thrigger heding to more regions because the issue is on cleint side it usually makes things worse. My 2 cents - gate hedging to at most one region. Default routing policy in Rust uses proximity - if a customer intnetionally chooses a slow 2nd preferred region we should honor that choice even if 3rd region might be faster
will simplify this and IMO is teh right design form what we have learned so far.
| cancelled when a winner is found. | ||
| 3. **Immutable request cloning** — the `CosmosOperation` (which contains `&[u8]` | ||
| body, headers, partition key) is cheap to clone (bytes are `Arc`-backed). | ||
| 4. **Respect existing systems** — hedging does not interfere with PPAF/PPCB, |
There was a problem hiding this comment.
I think you might want to recosnider for PPCB - if first region is so slow that hedged second region wins repeatedly this should probably trigger PPCB? Like e2e timeouts would do.
| /// Configuration error returned by fallible `HedgingStrategy` constructors. | ||
| #[derive(Debug, thiserror::Error)] | ||
| pub enum HedgingConfigError { | ||
| #[error("hedging threshold must be > 0, got {0:?}")] |
There was a problem hiding this comment.
I think instead of making it fallible I would use newtypes for Duration that enforces the guard rails natively - that way you do not have to worry about validations.
| > read account that has only one *write* region should still hedge writes | ||
| > only when the write list has ≥ 2 entries. | ||
|
|
||
| ### 5.2 Default Hedging Enablement Driven by PPAF |
There was a problem hiding this comment.
The only reason why we combined PPAF and enabling hedging in :Net and Java was that PPAF was an opt-in customers have to do explicitly anyway - and it was the only way to enable hedging as automated as possible without being possibly breaking.
In Rust tehse are independnet features - and heding should be on indpednent of PPAF by default - maybe we allow an opt-out (could be by just allowing threshold that is artificially high) - no threshold-step needed anymore if we gate hedging to at most one region. That simplifies the API surface a lot IMO.
| > still pending. | ||
|
|
||
| ```rust | ||
| async fn execute_with_hedging( |
There was a problem hiding this comment.
i skipped this for now - will become much simpler when we allo hedging to at most one region.
| /// Diagnostic information about a hedging execution, attached to the winning | ||
| /// response. | ||
| #[derive(Clone, Debug)] | ||
| pub struct HedgeDiagnostics { |
There was a problem hiding this comment.
Seems too verbose - didn't we align on a much narrower surface area in .Net/for IC3?
There was a problem hiding this comment.
LGTM overal - but I think there are a few areas that need further iterations
- No correlation at alll betwene hedging and PPAF
- Limit heding to at most one region?
- Don't allow heding for writes at all?
- Can config surface area be minimized - do we need more than the threshold? Can that be the way to opt-out?
- Should have second-region hedging winner impact PPCB?
- Alignment with TRANSPORT_PIPELINE_SPEC.md
- Alignment with FeedOperation spec Ashley is working on - hedging at the level proposed might not work well wihth queries (where hedging really shoud happen for each "Page" individually.
|
|
||
| ## 3. Architectural Overview | ||
|
|
||
| ### 3.1 Where Hedging Sits in the Driver |
There was a problem hiding this comment.
This contradicts with TRANSPORT_PIPELINE_SPEC §4.2 - let us discuss what is the better aproach - but might need reconciliation with the pipeline spec.
🔴 Blocking — Direct contradiction with TRANSPORT_PIPELINE_SPEC.md §4.2 — needs reconciliation, not parallel specs
File: HEDGING_SPEC.md §3.1, §6 vs. TRANSPORT_PIPELINE_SPEC.md §4.2
The two specs disagree on essentially every design axis:
| Axis | TRANSPORT_PIPELINE_SPEC §4.2 |
HEDGING_SPEC (this PR) |
|---|---|---|
| Layer | OperationAction::Hedge { secondary_routing } returned by evaluate_transport_result inside the pipeline loop |
Orchestrator wraps execute_operation_pipeline() from above |
| Fan-out | Single secondary region (max 2 concurrent) | Up to N regions, progressive timer |
| Default | Enabled by default for all ops (writes only on MWR) | Off by default; auto-enabled only by PPAF |
| Threshold | Dynamic, P99-based, clamped 50–4000 ms | Static (min(1000ms, RT/2) / 500ms) |
| Decision enum | New OperationAction::Hedge variant (TPS line 463) |
No new variant; orchestration is external |
ExecutionContext |
Has dedicated Hedging value (TPS line 300) |
Not addressed |
| |---|:---:|:---:|:---:| | ||
| | Document point reads (GetItem) | ✅ | ✅ | ✅ | | ||
| | Document point writes on multi-master (Create/Replace/Upsert/Delete/Patch) | ✅ | ✅ | ✅ | | ||
| | Queries (`QueryItems`) | ✅ | ✅ | ✅ | |
There was a problem hiding this comment.
For FeedOperations the model with hedging on top of execute_operation is non-trivial and needs careful integartion with teh query pipeline and Ashley's FeedRange spec. IMO this is a bit of an open question and will need some mroe thoughts and alignment between you and Ashley. The TRANSPORT_PIPELINE_SPEC model makes it a bit simpler - but in either case this needs some more investigation
simorenoh
left a comment
There was a problem hiding this comment.
Just nits - looks good, interested on what Fabian mentioned on hedging and PPCB basically playing together after a region has been picked several times through hedging.
And on the decision on enabling this by default like PPCB/ using a single additional region only.
|
|
||
| ### 2.3 Eligibility — `ShouldHedge()` | ||
|
|
||
| Hedging applies **only** to document-level point operations: |
There was a problem hiding this comment.
Since we mentioned queries/ read many above as part of the inclusion
| Hedging applies **only** to document-level point operations: | |
| Hedging applies **only** to document-level operations: |
|
|
||
| ```rust | ||
| /// Sentinel value used to disable hedging for a specific operation when a | ||
| /// client-level strategy is configured. |
There was a problem hiding this comment.
If we're enabling hedging by default per the other design docs this also applies to the entire client
| public static AvailabilityStrategy CrossRegionHedgingStrategy( | ||
| TimeSpan threshold, // Time before first hedge fires | ||
| TimeSpan? thresholdStep, // Time between subsequent hedges | ||
| bool enableMultiWriteRegionHedge = false); // Opt-in for writes on MM |
There was a problem hiding this comment.
seems risky but so long as we have an explicit config for this makes sense
NaluTripician
left a comment
There was a problem hiding this comment.
Hedging ↔ x-ms-cosmos-hub-region-processing-only header coordination
Reviewing #4330 alongside two related PRs surfaced a coordination gap that the spec
should record before any orchestrator code lands. Posting the proposed addition here
as a review with four inline suggestions you can apply à la carte; each one is
independently mergeable.
Context — the two related PRs
-
Rust PR #4389 added
thex-ms-cosmos-hub-region-processing-only: Trueheader. It is emitted on the
retry triggered by the first404 / 1002 (READ_SESSION_NOT_AVAILABLE)on a
single-master, data-plane operation and on every subsequent attempt within
that operation. The latch is aboolfield onOperationRetryState
(components.rs:108-125),
set inbuild_session_retry_state
(retry_evaluation.rs:355-381),
and consumed inapply_hub_region_header
(operation_pipeline.rs:945-968).
Full normative spec:
HUB_REGION_PROCESSING_HEADER_SPEC.md.
The header tells the gateway "this client has already discovered the hub region —
process the request only in that region", which lets the operation skip the
multi-region discovery round-trip on every retry after the first 1002. -
.NET azure-cosmos-dotnet-v3#5815 fixed the
exact failure mode this comment is about for the .NET v3 SDK. Before #5815, each
hedged request inCrossRegionHedgingAvailabilityStrategycarried its own copy of
the latch state, so every hedge independently re-ran the 404/1002 discovery cycle
and the header's "one round-trip per operation" guarantee held only inside an
individual hedge, not across the fan-out. The fix:"added a new
CrossRegionAvailabilityContext(with the property
ShouldAddHubRegionProcessingOnlyHeader) that is propagated through the
RequestMessage.Propertiesto every cloned hedge request" — quoted in §9.5.1.Because
Propertiesis aDictionary<string, object>whoseClone()is shallow,
every clone gets a shared reference to oneCrossRegionAvailabilityContext
instance; latching it once is observable from all hedges.
The gap as it lands today in Rust
Spec §8.2 of this PR explicitly lists OperationRetryState as cloned per hedge.
PR #4389 stores the hub-region latch as a bool on OperationRetryState. So the
two PRs in isolation are each correct, but composed they recreate the exact
pre-fix .NET v3 behavior — every hedge in the fan-out independently goes through its
own 404/1002 discovery cycle, and the header buys nothing under hedging except for
the one hedge that happens to observe 1002 first.
The proposed fix
Rust counterpart of .NET v3's CrossRegionAvailabilityContext: extend
OperationRetryState with a shared_hub_region_latch: Option<Arc<AtomicBool>>
that is Some only while running under execute_with_hedging(). Arc<AtomicBool>
is the Rust idiom equivalent to "shared mutable object propagated via
RequestMessage.Properties"; the Option lets the non-hedged path keep PR #4389’s
behavior bit-for-bit when shared_hub_region_latch = None.
The four suggestions on this review record the design in this PR's spec:
| Suggestion | Anchor | What it adds |
|---|---|---|
| §8.2 carve-out | line 1110 | New bullet in the "Items shared (via Arc or reference)" list pointing to §9.5. |
| §9.5 full section | line 1292 | The new normative section between §9.4 and §10. |
| §15.1 unit tests | line 1715 | Five rows covering: latch initialization, the two negative cases (non-hedged path, multi-master/metadata), cross-hedge propagation, and the no-1002-no-header invariant. |
| §15.2 fault-injection test | line 1732 | One row asserting end-to-end propagation under a 2-region SM data-plane fault injection. |
I've also left small inline suggestions on
#4389 for the three
load-bearing docstrings in the implementation (OperationRetryState::hub_region_processing_only,
build_session_retry_state, apply_hub_region_header) so the forward reference
to §9.5 lives next to the latch sites themselves.
No code in PR #4389 needs to ship a behavior change for this PR to merge — these
suggestions are doc only. The behavior change lives in the orchestrator PR that
introduces execute_with_hedging(), and §9.5 of this spec is what that PR will
point at for its acceptance criteria.
| - `CosmosOperation` — immutable; body is `Bytes` (cheaply cloneable) | ||
| - `LocationStateStore` — lock-free; multiple readers are safe | ||
| - `SessionManager` — designed for concurrent access | ||
| - `Credential` — `Arc`-wrapped |
There was a problem hiding this comment.
Suggested addition — §8.2 carve-out for the hub-region latch.
Adds the shared Arc<AtomicBool> to the “Items shared (via Arc or reference)” bullet list and forward-references the new §9.5 for full rationale. This is the core hand-off — without this carve-out, the §8.2 “items cloned per hedge” rule (which covers OperationRetryState) silently makes the hub-region latch per-hedge and defeats the discovery propagation.
| - `Credential` — `Arc`-wrapped | |
| - `Credential` — `Arc`-wrapped | |
| - **Hub-region-processing-only latch** — a single `Arc<AtomicBool>` is | |
| shared across the primary and every hedge for the lifetime of the | |
| outer operation. See §9.5 for the full rationale; the short version | |
| is that the per-`OperationRetryState` `hub_region_processing_only` | |
| field added by [PR #4389](https://github.com/Azure/azure-sdk-for-rust/pull/4389) | |
| is otherwise per-hedge, which would force every hedge to independently | |
| re-discover the hub region via its own 404/1002 cycle. .NET v3 hit and | |
| fixed this in [PR #5815](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5815) | |
| via the `CrossRegionAvailabilityContext` shared object; Rust must | |
| adopt the equivalent shared signal. |
| Implication: late hedges have less time budget. If the deadline is 5s and the | ||
| threshold is 3s, the hedge has only ~2s to complete. | ||
|
|
||
| --- |
There was a problem hiding this comment.
Suggested addition — new §9.5 “Hub-Region-Processing-Only Header”.
This is the substantive content. It records:
- §9.5.1 — the correctness gap: per-hedge
OperationRetryStateclones each carry an independenthub_region_processing_only: boolfrom PR #4389, so without coordination every hedge re-runs its own 404/1002 discovery cycle. Includes a verbatim quote from azure-cosmos-dotnet-v3#5815 where .NET hit and fixed the same gap viaCrossRegionAvailabilityContext. - §9.5.2 — the required Rust design: extend
OperationRetryStatewithshared_hub_region_latch: Option<Arc<AtomicBool>>, construct oneArc<AtomicBool>inexecute_with_hedging()before fan-out, CAS-set it inbuild_session_retry_statealongside the per-state flag (Release), OR it into emission inapply_hub_region_header(Acquire). Includes the two minimal code samples. - §9.5.3 — eligibility gates (data-plane + single-master +
regions.len() > 1); when any fails,shared_hub_region_latch = Noneand PR Emit x-ms-cosmos-hub-region-processing-only header #4389’s behavior is preserved bit-for-bit. - §9.5.4 — non-interference with the §8.4 local-only-retry invariant.
- §9.5.5 — concurrency / memory-ordering notes.
| --- | |
| ### 9.5 Hub-Region-Processing-Only Header | |
| The driver emits the `x-ms-cosmos-hub-region-processing-only: True` | |
| request header on retries triggered by a `404 / 1002 | |
| (READ_SESSION_NOT_AVAILABLE)` response, scoped to **single-master | |
| data-plane** operations. The header is specified in | |
| [`HUB_REGION_PROCESSING_HEADER_SPEC.md`](../../azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md) | |
| and implemented in [Rust PR | |
| #4389](https://github.com/Azure/azure-sdk-for-rust/pull/4389) / | |
| [.NET PR #5447](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5447) | |
| (parity baseline). | |
| #### 9.5.1 The hedging-specific correctness gap | |
| The Rust latch lives on `OperationRetryState` | |
| ([`components.rs::OperationRetryState::hub_region_processing_only`](../src/driver/pipeline/components.rs)) | |
| and is set in | |
| [`retry_evaluation.rs::build_session_retry_state`](../src/driver/pipeline/retry_evaluation.rs) | |
| when all four conditions hold (cf. spec §7.1): | |
| 1. `is_dataplane` | |
| 2. `!can_use_multiple_write_locations` (single-master account) | |
| 3. `session_token_retry_count == 0` (first 1002 within the operation) | |
| 4. `!hub_region_processing_only` (idempotency) | |
| It is consumed in | |
| [`operation_pipeline.rs::apply_hub_region_header`](../src/driver/pipeline/operation_pipeline.rs) | |
| on every subsequent transport attempt of the same operation. | |
| Per §8.2, **each hedge has its own `OperationRetryState`**. Without | |
| additional coordination, this means each hedge — primary, hedge 1, | |
| hedge 2, … — would independently observe its own first 1002, then | |
| independently re-issue the next attempt with the header set. Every | |
| hedge pays the full hub-discovery latency cost; the header's purpose | |
| (*bound the discovery cycle to a single 1002 round-trip per operation*) | |
| is defeated for everyone except the lucky hedge that observes 1002 first. | |
| This is the same gap .NET v3 had after its first hub-region header PR | |
| ([#5447](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5447)) and | |
| **explicitly fixed** in | |
| [PR #5815 — *Read Consistency Strategy: Adds hub region header for | |
| LastCommittedWriteRegion strategy*](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5815), | |
| in the section *"Hedging request with hub region header"*: | |
| > When `CrossRegionHedgingAvailabilityStrategy` is active, the primary | |
| > request may discover the hub region mid-flight … Hedged requests are | |
| > clones of the original and run with their own `ClientRetryPolicy` | |
| > instance, so they would normally repeat the entire hub discovery cycle | |
| > independently. To avoid this redundant retry overhead, we introduce a | |
| > `CrossRegionAvailabilityContext` — a lightweight shared object with a | |
| > volatile `bool ShouldAddHubRegionProcessingOnlyHeader` flag. This | |
| > context is injected into `RequestMessage.Properties` before the clone | |
| > loop in `CrossRegionHedgingAvailabilityStrategy`. Since `Clone()` | |
| > performs a shallow dictionary copy, all clones (primary + hedges) | |
| > share the same `CrossRegionAvailabilityContext` reference. When the | |
| > primary's `ClientRetryPolicy` sets the hub flag after 2× 404/1002, it | |
| > also sets the flag on the shared context. Each hedge's | |
| > `ClientRetryPolicy.OnBeforeSendRequest` reads this shared flag on | |
| > every attempt and attaches the | |
| > `x-ms-cosmos-hub-region-processing-only` header immediately — without | |
| > needing to go through its own 404/1002 discovery. | |
| The Rust orchestrator MUST adopt the equivalent design. | |
| #### 9.5.2 Required design — `Arc<AtomicBool>` shared latch | |
| Construct a single `Arc<AtomicBool>` in `execute_with_hedging()` | |
| **before any hedge is spawned**, and thread it into every pipeline | |
| invocation (primary and hedges). Concretely: | |
| ```rust | |
| // In execute_with_hedging(), before the spawn-and-race loop: | |
| let shared_hub_region_latch: Arc<AtomicBool> = Arc::new(AtomicBool::new(false)); | |
| // When constructing each hedge's pipeline params: | |
| let retry_state = OperationRetryState::initial(/* … */) | |
| .with_shared_hub_region_latch(shared_hub_region_latch.clone()); |
This requires a small extension to OperationRetryState:
pub struct OperationRetryState {
// … existing fields …
/// Per-operation hub-region-processing-only latch.
/// Sticky for the lifetime of this `OperationRetryState`.
pub hub_region_processing_only: bool,
/// Cross-hedge shared latch. `Some(_)` only when this operation is
/// running inside `execute_with_hedging()` — `None` on the
/// non-hedged code path so today's allocator behavior is preserved.
///
/// Mirrors .NET v3's `CrossRegionAvailabilityContext` injected into
/// `RequestMessage.Properties` before the clone loop
/// (azure-cosmos-dotnet-v3 PR #5815).
pub shared_hub_region_latch: Option<Arc<AtomicBool>>,
}The two existing call sites are then extended:
-
build_session_retry_state(latch-set side). When the four
trigger conditions fire and the new state sets
hub_region_processing_only = true, also CAS-set
shared_hub_region_latchtotrueif present:if let Some(shared) = &retry_state.shared_hub_region_latch { shared.store(true, Ordering::Release); }
Releaseis sufficient — the only thing being published is the bool
itself; no further state hangs off it. -
apply_hub_region_header(header-emission side). Emit the header
when either the per-state latch is set or the shared latch is
set:let emit = retry_state.hub_region_processing_only || retry_state .shared_hub_region_latch .as_ref() .map(|shared| shared.load(Ordering::Acquire)) .unwrap_or(false); if emit { transport_request.headers.insert( HeaderName::from_static(request_header_names::HUB_REGION_PROCESSING_ONLY), HeaderValue::from_static("True"), ); }
This preserves the §5/§7/§8 invariants of
HUB_REGION_PROCESSING_HEADER_SPEC.md (account-level scope, data-plane
scope, idempotency / sticky semantics) on a per-hedge basis while
also propagating the discovery from any hedge to every other hedge as
soon as it happens.
9.5.3 Eligibility — when the shared latch is actually wired
The shared latch is only populated when all of the following are
true at the start of execute_with_hedging():
| Condition | Why |
|---|---|
Operation is data-plane (is_dataplane) |
Mirrors the §1.5 scope of HUB_REGION_PROCESSING_HEADER_SPEC.md. |
Account is single-master (!can_use_multiple_write_locations) |
Mirrors AC-4 of HUB_REGION_PROCESSING_HEADER_SPEC.md; multi-master accounts have a separate recovery path and the header is never emitted. |
Hedging actually fans out (regions.len() > 1) |
When the orchestrator falls through to the single-region path (§6.4), the per-state latch alone is sufficient — there is no second hedge to propagate to. |
When any condition fails, shared_hub_region_latch is None and the
existing per-state behavior from PR #4389 is preserved bit-for-bit.
9.5.4 Interaction with §8.4 (Local-only retries inside a hedge)
The §8.4 local-only-retry contract is unaffected by the shared latch:
the latch governs only which request header is emitted, not the
endpoint resolution. ExcludeRegions continues to pin each hedge to
its own region across retries; the shared latch merely ensures every
hedge's retries — within their pinned region — also carry the
hub-region hint once any hedge has observed 1002. No new retry
trigger paths or region-fallback edges are introduced.
9.5.5 Concurrency notes
AtomicBoolwithRelease/Acquireordering is sufficient — the
bool is the only thing being shared and there is no dependent state.
Relaxedwould also be functionally correct (single-flag race with
monotonic 0 → 1 transition) butRelease/Acquireis preferred for
reader/code-author clarity and costs nothing on every architecture
the Rust SDK targets.- The latch is monotonic 0 → 1 and never reset within an operation —
matches the "sticky" semantics of the per-state latch
(components.rs:108-125). - The
Arcis scoped to one outerexecute_with_hedging()call, so it
is dropped when the orchestrator returns (no global state, no leak
across operations). - A losing hedge whose transport already responded after cancellation
(cf. §14.2) may still observe and CAS-set the shared latch — this is
benign: the orchestrator has already returned a winner, and the next
observer of the droppedArcis no one.
| | `hedging_config_requires_explicit_step` | `threshold_step` must be provided explicitly; constructor does not default it from `threshold` | | ||
| | `region_exclusion_for_hedge_n` | Correct ExcludeRegions per hedge | | ||
| | `exclude_regions_honored_by_every_retry_trigger` | For each retry trigger class — PPAF write retry, PPCB markdown failback, transport-layer 503, throttling 429, session-token 1002 — fault-inject the trigger inside a hedge and assert the retry attempt does **not** route to a region listed in the hedge's `ExcludeRegions`. Encodes the §8.4 cross-cutting invariant; new retry triggers added in later phases must extend this test. | | ||
| | `app_cancel_preserves_hedge_diagnostics` | Cancel the application token mid-fan-out; assert the returned error carries `HedgeDiagnostics` from the most-advanced in-flight hedge (covers §6.5 invariant #6). | |
There was a problem hiding this comment.
Suggested addition — five §15.1 unit-test rows for §9.5.
One row per acceptance criterion in §9.5.2/§9.5.3, including the cross-hedge propagation test (shared_hub_region_latch_propagates_first_1002_to_other_hedges) which is the Rust counterpart of .NET PR #5815’s CrossRegionAvailabilityContext_PropagatesHubHeaderFlagToHedgedRequests test.
| | `app_cancel_preserves_hedge_diagnostics` | Cancel the application token mid-fan-out; assert the returned error carries `HedgeDiagnostics` from the most-advanced in-flight hedge (covers §6.5 invariant #6). | | |
| | `app_cancel_preserves_hedge_diagnostics` | Cancel the application token mid-fan-out; assert the returned error carries `HedgeDiagnostics` from the most-advanced in-flight hedge (covers §6.5 invariant #6). | | |
| | `shared_hub_region_latch_initialized_when_eligible` | `execute_with_hedging()` invoked on a data-plane / single-master operation with `regions.len() > 1`; assert every hedge's `OperationRetryState.shared_hub_region_latch` is `Some(_)` and points to the same `Arc<AtomicBool>` instance (encodes §9.5.2 / §9.5.3). | | |
| | `shared_hub_region_latch_none_on_non_hedged_path` | `execute_with_hedging()` falls through to `execute_operation_pipeline` because `regions.len() <= 1`; assert `shared_hub_region_latch` is `None` (preserves PR #4389 baseline allocator behavior — §9.5.2). | | |
| | `shared_hub_region_latch_none_on_multi_master_or_metadata` | Multi-master *or* metadata pipeline; assert `shared_hub_region_latch` is `None` even when hedging fans out, matching `HUB_REGION_PROCESSING_HEADER_SPEC.md` §5 account-level / §1.5 data-plane gates (§9.5.3). | | |
| | `shared_hub_region_latch_propagates_first_1002_to_other_hedges` | Drive 1002 through `build_session_retry_state` on hedge 0; assert (a) hedge 0's per-state `hub_region_processing_only` is `true`, (b) the shared `Arc<AtomicBool>` is `true`, (c) on the next transport attempt, hedge 1 and hedge 2 — whose per-state latches are still `false` — both have `apply_hub_region_header` emit the header. Encodes the §9.5 cross-hedge invariant and is the Rust counterpart of .NET PR #5815's `CrossRegionAvailabilityContext_PropagatesHubHeaderFlagToHedgedRequests` test. | | |
| | `shared_hub_region_latch_no_1002_emits_no_header` | No hedge observes 1002; assert no hedge calls `apply_hub_region_header` with the header set. | |
| | `hedging_with_ppcb` | 503 on Region A reads; PPCB enabled | PPCB and hedging both apply; circuit breaker tripped AND hedge succeeds | | ||
| | `hedging_cancels_losers` | Delay on Region A | Region B wins; verify Region A task cancelled (hit_count ≤ expected) | | ||
| | `hedging_failback_to_primary` | Region A initially slow, then fast | First few reads hedged; after threshold tightened, primary wins again | | ||
| | `hedging_exclude_regions_under_503_retry` | Region B inside hedge returns 503 (triggers transport retry) while Region C is healthy and excluded by that hedge's `ExcludeRegions` | Hedge B's retry stays pinned to Region B (does NOT fall back to Region C) — fault-injection counterpart to the §8.4 invariant unit test. | |
There was a problem hiding this comment.
Suggested addition — one §15.2 fault-injection integration test for §9.5.
Fault-injects 1002 on the primary's first attempt against Region A and asserts the header propagates to the Region-B hedge — even though Region B never observes 1002 itself. The end-to-end counterpart of the §15.1 unit test, and the direct Rust analogue of .NET PR #5815's emulator-level coverage.
| | `hedging_exclude_regions_under_503_retry` | Region B inside hedge returns 503 (triggers transport retry) while Region C is healthy and excluded by that hedge's `ExcludeRegions` | Hedge B's retry stays pinned to Region B (does NOT fall back to Region C) — fault-injection counterpart to the §8.4 invariant unit test. | | |
| | `hedging_exclude_regions_under_503_retry` | Region B inside hedge returns 503 (triggers transport retry) while Region C is healthy and excluded by that hedge's `ExcludeRegions` | Hedge B's retry stays pinned to Region B (does NOT fall back to Region C) — fault-injection counterpart to the §8.4 invariant unit test. | | |
| | `hedging_hub_region_header_propagates_across_hedges` | 2-region SM data-plane account; fault-inject `404/1002` on the primary's first attempt against Region A, healthy 200 on Region B after threshold | Primary's retry against Region A emits `x-ms-cosmos-hub-region-processing-only: True` (per-state latch) **and** the hedge spawned against Region B emits the same header on every attempt — without itself ever observing a 1002 (per the shared `Arc<AtomicBool>` from §9.5). Encodes the cross-hedge propagation invariant under fault injection; counterpart of .NET PR #5815's `CrossRegionAvailabilityContext_PropagatesHubHeader…` emulator tests. | |
Summary
Adds
HEDGING_SPEC.mdto theazure_data_cosmos_drivercrate'sdocs/folder. This is a doc-only PR — no production code, no API changes, no test changes, noCargo.tomlchanges. Single file, +1569 lines.The spec is the design document for the cross-region hedging (
AvailabilityStrategy/HedgingStrategy) feature that will be implemented in a follow-up series of PRs. It is being landed on the previews branch ahead of implementation so reviewers can iterate on the design in-tree alongside the companion specs already there.Why hedging?
When a Cosmos DB region is degraded but not fully down (elevated tail latency, slow GC, partial network blip) the existing failover paths — PPAF (per-partition automatic failover) and PPCB (per-partition circuit breaker) — do not trigger because the region eventually returns successful responses. Applications see p99 / p99.9 latency spikes on requests routed to the slow region.
Cross-region hedging issues a speculative second request to an alternate region after a configurable threshold and returns whichever response arrives first, bounding tail latency at roughly
threshold + cross-region-RTT.Scope of this PR
In scope: Design document only.
Out of scope (follow-up PRs):
HedgingStrategy/AvailabilityStrategytypesshould_hedge()/is_final_result()pure functionsexecute_with_hedging()orchestratorHedgeDiagnosticscosmos_driver.rsWhat the spec covers (17 sections, ~1,569 lines)
CrossRegionHedgingAvailabilityStrategy), including PPAF/PPCB integrationHedgingStrategy, builder, environment variables)should_hedge()decision matrix) and default hedging enablement driven by PPAF with a full activation truth tablethreshold + N · step,tokio::select!race, drain loop403/403/3rows)ExcludeRegionsinvariant for retries inside a hedge)HedgeDiagnosticsshape, attachment contract, reservedcosmos.hedge.*tracing/metrics surface)tokio_util::sync::CancellationTokenhierarchy)Design highlights
does not auto-enable hedging — PPCB is failure-driven and does not by
itself signal a desire for latency hedging.
min(1000ms, request_timeout / 2)/500ms, mirroring .NET(Java's static
500ms / 100msis documented in the cross-SDK comparisonfor reference but not adopted).
tokio_util::sync::CancellationTokenfor cancellation,
Arc<Bytes>for zero-copy hedge body sharing.AvailabilityStrategy::Disabledenumvariant rather than .NET's nullable
TimeSpan?/ sentinel object pattern.HedgeDiagnosticsattached whenever a strategy was active,avoiding .NET's two-shape (fast-path vs drain-path) bookkeeping.
retry contract (§8.4), diagnostics attachment contract (§10.1),
preferred-regions precondition (§5.2).
cosmos.hedge.*for tracing eventsand metrics, ready for a future observability PR without breaking changes.
Cross-SDK alignment
The spec was reviewed against both the .NET v3 (
CrossRegionHedgingAvailabilityStrategy) and Java v4 (ThresholdBasedAvailabilityStrategy) implementations. Final alignment summary:min(1000ms, RT/2)/500ms500ms / 100msmin(1000ms, RT/2)/500ms(= .NET)Validation
Azure/azure-cosmos-dotnet-v3(main) andAzure/azure-sdk-for-java(main)§N.Nreferences resolve to existing spec sections.rs,.toml,Cargo.lock, or test changes; CI affecting only thedoc-render job
Branch / target
users/kundadebdatta/3935_add_hedging_specrelease/azure_data_cosmos-previewsCode changes to add hedging spec.→Code changes to update hedging spec.)Follow-up work tracker
The implementation will land in subsequent PRs roughly tracking §16 phases:
HedgingStrategytypes +should_hedge()+is_final_result()