Skip to content

Cosmos: Adds Cross-Region Hedging Design Spec to Driver Crate#4330

Open
kundadebdatta wants to merge 6 commits into
release/azure_data_cosmos-previewsfrom
users/kundadebdatta/3935_add_hedging_spec
Open

Cosmos: Adds Cross-Region Hedging Design Spec to Driver Crate#4330
kundadebdatta wants to merge 6 commits into
release/azure_data_cosmos-previewsfrom
users/kundadebdatta/3935_add_hedging_spec

Conversation

@kundadebdatta
Copy link
Copy Markdown
Member

@kundadebdatta kundadebdatta commented May 3, 2026

Summary

Adds HEDGING_SPEC.md to the azure_data_cosmos_driver crate's docs/ folder. This is a doc-only PR — no production code, no API changes, no test changes, no Cargo.toml changes. Single file, +1569 lines.

The spec is the design document for the cross-region hedging (AvailabilityStrategy / HedgingStrategy) feature that will be implemented in a follow-up series of PRs. It is being landed on the previews branch ahead of implementation so reviewers can iterate on the design in-tree alongside the companion specs already there.

Why hedging?

When a Cosmos DB region is degraded but not fully down (elevated tail latency, slow GC, partial network blip) the existing failover paths — PPAF (per-partition automatic failover) and PPCB (per-partition circuit breaker) — do not trigger because the region eventually returns successful responses. Applications see p99 / p99.9 latency spikes on requests routed to the slow region.

Cross-region hedging issues a speculative second request to an alternate region after a configurable threshold and returns whichever response arrives first, bounding tail latency at roughly threshold + cross-region-RTT.

Scope of this PR

In scope: Design document only.

Out of scope (follow-up PRs):

  • HedgingStrategy / AvailabilityStrategy types
  • should_hedge() / is_final_result() pure functions
  • execute_with_hedging() orchestrator
  • HedgeDiagnostics
  • Integration into cosmos_driver.rs
  • PPAF default-strategy auto-enable wiring
  • Unit + fault-injection tests

What the spec covers (17 sections, ~1,569 lines)

§ Topic
1 Goals, non-goals, and the phased operation-type rollout (Phase 1 reads + writes; Phase 2 Query + ReadMany; Phase 3 ChangeFeed + metadata; Future sprocs/triggers/UDFs)
2 Background: full walkthrough of the .NET v3 reference implementation (CrossRegionHedgingAvailabilityStrategy), including PPAF/PPCB integration
3 Architectural overview
4 Configuration surface (HedgingStrategy, builder, environment variables)
5 Eligibility rules (should_hedge() decision matrix) and default hedging enablement driven by PPAF with a full activation truth table
6 Hedging algorithm — primary at t=0, hedge fan-out at threshold + N · step, tokio::select! race, drain loop
7 Final-vs-transient status code classification (incl. explicit 403 / 403/3 rows)
8 Operation-pipeline integration, including the explicit local-only-retry contract (ExcludeRegions invariant for retries inside a hedge)
9 Interaction with PPAF, PPCB, session consistency, throughput control, end-to-end timeout
10 Diagnostics & observability (HedgeDiagnostics shape, attachment contract, reserved cosmos.hedge.* tracing/metrics surface)
11 Options API design and layered resolution priority table (operation > client > SDK default > none)
12 Cancellation & resource cleanup (lock-free tokio_util::sync::CancellationToken hierarchy)
13 Multi-write region write hedging (409/412 risk, idempotency considerations)
14 Error handling & edge cases (RU accounting, late-hedge budget)
15 Test plan
16 Implementation phases
17 Open questions (most resolved during spec review)

Design highlights

  • PPAF-only auto-enable, matching .NET v3 exactly. Enabling PPCB alone
    does not auto-enable hedging — PPCB is failure-driven and does not by
    itself signal a desire for latency hedging.
  • Default thresholds when auto-enabled by PPAF:
    min(1000ms, request_timeout / 2) / 500ms, mirroring .NET
    (Java's static 500ms / 100ms is documented in the cross-SDK comparison
    for reference but not adopted).
  • Lock-free design throughouttokio_util::sync::CancellationToken
    for cancellation, Arc<Bytes> for zero-copy hedge body sharing.
  • Type-safe disabled sentinel via AvailabilityStrategy::Disabled enum
    variant rather than .NET's nullable TimeSpan? / sentinel object pattern.
  • Always-full HedgeDiagnostics attached whenever a strategy was active,
    avoiding .NET's two-shape (fast-path vs drain-path) bookkeeping.
  • Explicit contracts that .NET and Java rely on implicitly: local-only
    retry contract (§8.4), diagnostics attachment contract (§10.1),
    preferred-regions precondition (§5.2).
  • Reserved telemetry namespace under cosmos.hedge.* for tracing events
    and metrics, ready for a future observability PR without breaking changes.

Cross-SDK alignment

The spec was reviewed against both the .NET v3 (CrossRegionHedgingAvailabilityStrategy) and Java v4 (ThresholdBasedAvailabilityStrategy) implementations. Final alignment summary:

Capability .NET v3 Java v4 Rust spec
Auto-enabled by PPAF
Auto-enabled by PPCB alone
Default thresholds min(1000ms, RT/2) / 500ms 500ms / 100ms min(1000ms, RT/2) / 500ms (= .NET)
Phase-1 read hedging
Phase-1 write hedging on multi-master
Query / ReadMany Phase 2
ChangeFeed + metadata ✅ / ❌ ❌ / ❌ Phase 3

Validation

  • Markdown only — renders correctly in GitHub preview
  • All cross-SDK source citations link to specific file/line ranges on
    Azure/azure-cosmos-dotnet-v3 (main) and Azure/azure-sdk-for-java (main)
  • All internal §N.N references resolve to existing spec sections
  • No .rs, .toml, Cargo.lock, or test changes; CI affecting only the
    doc-render job

Branch / target

  • Source branch: users/kundadebdatta/3935_add_hedging_spec
  • Target branch: release/azure_data_cosmos-previews
  • Commits: 2 (Code changes to add hedging spec.
    Code changes to update hedging spec.)
  • Files changed: 1
  • Diff: +1,569 / -0

Follow-up work tracker

The implementation will land in subsequent PRs roughly tracking §16 phases:

  1. Phase 1 — HedgingStrategy types + should_hedge() + is_final_result()
    • orchestrator + reads/writes (multi-master) + PPAF auto-enable
  2. Phase 2 — Query / ReadMany hedging
  3. Phase 3 — Change feed / metadata operation hedging
  4. Future — sprocs/triggers/UDFs, adaptive thresholds, telemetry surface

@kundadebdatta kundadebdatta changed the title Code changes to add hedging spec. Cosmos: Adds Cross-Region Hedging Design Spec to Driver Crate May 4, 2026
@kundadebdatta kundadebdatta self-assigned this May 4, 2026
@kundadebdatta kundadebdatta moved this from Todo to In Progress in CosmosDB Go/Rust Crew May 4, 2026
Copy link
Copy Markdown

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-checked the spec against the actual .NET source in azure-cosmos-dotnet-v3 (CrossRegionHedgingAvailabilityStrategy.cs, AvailabilityStrategy.cs, DocumentClient.InitializePartitionLevelFailoverWithDefaultHedging). Overall the spec is well-grounded — author clearly read the .NET code, not just the docs. One material issue around the SDK-default write-hedging behavior on multi-master, plus a handful of minor nits inline.

Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
request_number,
region: regions[request_number].clone(),
result: Err(cancelled_error()),
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing app-cancellation re-raise behavior from .NET. The .NET impl deliberately awaits the faulted task when the app token is the source of cancellation (lines 209–212 of CrossRegionHedgingAvailabilityStrategy.cs):

if (applicationProvidedCancellationToken.IsCancellationRequested)
{
    await (Task<HedgingResponse>)completedTask;
}

This is what allows RequestSenderAndResultCheckAsync to rethrow as CosmosOperationCanceledException with the trace attached (lines 358–363). The Rust pseudocode collapses both "hedge cancellation" and "app cancellation" into a generic cancelled_error(), losing the trace context that .NET preserves.

Suggest distinguishing app-token cancellation from hedge-token cancellation in the orchestrator and propagating the former with full diagnostics, mirroring .NET's behavior.

|---:|-----------|--------|
| 1 | No strategy resolved (or `AvailabilityStrategy::Disabled`) | No |
| 2 | Application preferred-region list empty (no fan-out targets) | No |
| 3 | `ResourceType != Document` | No |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule won't survive Phase 3. §16 Phase 3 adds metadata operations (Database / Container / Offer / Throughput), all of which have non-Document ResourceType. As written, this row is a hard reject for all non-Document ops — when Phase 3 lands, this row needs to become phase-/resource-type-gated rather than a blanket exclusion.

Worth either:

  • Tagging this row with "Phase 1 only — see §16 Phase 3 for metadata coverage", or
  • Restating it as ResourceType not in <phase-allowed set> so the eligibility rule's evolution is encoded in one place rather than getting rewritten in Phase 3.

Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md
@kundadebdatta kundadebdatta marked this pull request as ready for review May 5, 2026 22:58
@kundadebdatta kundadebdatta requested a review from a team as a code owner May 5, 2026 22:58
Copilot AI review requested due to automatic review settings May 5, 2026 22:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new in-repo design specification (HEDGING_SPEC.md) to the Cosmos driver crate docs, describing the planned cross-region hedging (availability strategy) feature and its intended integration with existing routing/retry systems.

Changes:

  • Adds a comprehensive design spec for cross-region request hedging (configuration, eligibility, algorithm, diagnostics, and phased rollout plan).
  • Documents intended interactions with PPAF/PPCB, session consistency, throughput control, deadlines, and cancellation.

Comment on lines +71 to +81
### Operation-type scope (phased)

| Operation type | Phase 1 | Phase 2 | Future |
|---|:---:|:---:|:---:|
| Document point reads (GetItem) | ✅ | ✅ | ✅ |
| Document point writes on multi-master (Create/Replace/Upsert/Delete/Patch) | ✅ | ✅ | ✅ |
| Queries (`QueryItems`) | ✅ | ✅ | ✅ |
| `ReadMany` | ✅ | ✅ | ✅ |
| Change feed | ✅ | ✅ | ✅ |
| Metadata operations (Database / Container / Offer / Throughput) | ❌ | ✅ | ✅ |
| Stored procedures / triggers / UDFs execution | ❌ | ❌ | 🟡 candidate |
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md
### Solution: Speculative Hedging

**Hedging** sends the same request to an alternate region after a latency threshold
is exceeded, and returns whichever response arrives first. This bounds tail latency
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is exceeded, and returns whichever response arrives first. This bounds tail latency
is exceeded, and returns whichever finite response arrives first. This bounds tail latency

**Hedging** sends the same request to an alternate region after a latency threshold
is exceeded, and returns whichever response arrives first. This bounds tail latency
at roughly `threshold + cross-region-RTT` instead of waiting for the slow region to
respond.
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should frequently choosing the hedged respinse from second region also trigger PPCB? Like when identifying at least for one partition the second region is always faster - why not skip even trying in the frist region after few iterations? This is a bit like Bhaskhars initial proposal for hedging - which was more dynamic what we have now - please follow up with Bhaskar end let him give you the docs he had - we should revaluate whether at elast some aspects of his initial idea would make sense to be now adopted in Rust. We were not able to convice customers of Java V4 at the time - but it is definitely worth looking at it again.

4. **Complementary to failover** — hedging handles *latency*; PPAF/PPCB handle
*failures*. They compose without interference.
5. **Resource-safe** — hedged requests that lose the race are cancelled promptly to
avoid wasted RU/s and transport resources.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intend makes sense - but it is not perfectly "safe" - so I would set expecttaions a bit more carefully

| `ReadMany` | ✅ | ✅ | ✅ |
| Change feed | ✅ | ✅ | ✅ |
| Metadata operations (Database / Container / Offer / Throughput) | ❌ | ✅ | ✅ |
| Stored procedures / triggers / UDFs execution | ❌ | ❌ | 🟡 candidate |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tirggers are never executed separately - a header on normal point operation is sent to service indicating that service should execute the trigger - so, this is completely independent of hedging - if you hedge the createItem it is hedged with or without triggers. Just remove triggers and UDFs here - only Stored Procedure execution is its own "operation"

operation coverage where it is safe and cheap. Sprocs / triggers / UDFs
are deferred to Future because their server-side execution model
interacts with hedging in non-obvious ways (server-side state,
idempotency). See §16 for the full rollout plan.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

triggers / UDF ireelevant

StoredProcedures or point writes all have the same issue with idempotency in Multi-Master. I think in Phase 1 (and maybe all we ever do) we should stick to read/query only. Multi_master with hedging could result in a significantly higher number of conflicts - and that is usually not something you would want - so, my 2 cents - it was a mistake to ever enable hedging for write sin multi-master in Java - and very few customers if any have enabled it (also requires opting into enabling retriable writes). I would scope this down to Pahse 1 - reads, Phase 2 metadata operations - and explicitly not enable hedging for writes.

so a PPAF-enabled deployment ends up running with all three (PPAF + PPCB +
hedging) active simultaneously.

**The Rust driver matches .NET exactly:** the SDK-default hedging strategy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me we should simplify this. Always enable PPAF (if server allows), PPCB and Hedging. PPCB and Hedging we could allow to opt-out as an escape hedge. PPAF will always follow server signal. Whether we ship with the escape hedge or force enablement of PPCB and hedging we can decide after some more stress tests.


**Rationale:** Hedging must operate above the retry loop because each hedged
request needs its own independent retry state, session tokens, and endpoint
resolution. The operation pipeline already handles per-region retries; hedging
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to dynamically add excluded regions - also we shoudl make sure that any cloning/allocations only happen when heding threshold kicks in - any request not needing heding should be free/super cheap. I would call this out here explicity as a design constraint

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am wondering whether we really need to hedge ever to more than one region - I mean two Azure regions Out is a scenario that ws extremely unlikely. And when cleints thrigger heding to more regions because the issue is on cleint side it usually makes things worse. My 2 cents - gate hedging to at most one region. Default routing policy in Rust uses proximity - if a customer intnetionally chooses a slow 2nd preferred region we should honor that choice even if 3rd region might be faster

will simplify this and IMO is teh right design form what we have learned so far.

cancelled when a winner is found.
3. **Immutable request cloning** — the `CosmosOperation` (which contains `&[u8]`
body, headers, partition key) is cheap to clone (bytes are `Arc`-backed).
4. **Respect existing systems** — hedging does not interfere with PPAF/PPCB,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might want to recosnider for PPCB - if first region is so slow that hedged second region wins repeatedly this should probably trigger PPCB? Like e2e timeouts would do.

/// Configuration error returned by fallible `HedgingStrategy` constructors.
#[derive(Debug, thiserror::Error)]
pub enum HedgingConfigError {
#[error("hedging threshold must be > 0, got {0:?}")]
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of making it fallible I would use newtypes for Duration that enforces the guard rails natively - that way you do not have to worry about validations.

> read account that has only one *write* region should still hedge writes
> only when the write list has ≥ 2 entries.

### 5.2 Default Hedging Enablement Driven by PPAF
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason why we combined PPAF and enabling hedging in :Net and Java was that PPAF was an opt-in customers have to do explicitly anyway - and it was the only way to enable hedging as automated as possible without being possibly breaking.

In Rust tehse are independnet features - and heding should be on indpednent of PPAF by default - maybe we allow an opt-out (could be by just allowing threshold that is artificially high) - no threshold-step needed anymore if we gate hedging to at most one region. That simplifies the API surface a lot IMO.

> still pending.

```rust
async fn execute_with_hedging(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i skipped this for now - will become much simpler when we allo hedging to at most one region.

/// Diagnostic information about a hedging execution, attached to the winning
/// response.
#[derive(Clone, Debug)]
pub struct HedgeDiagnostics {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems too verbose - didn't we align on a much narrower surface area in .Net/for IC3?

Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overal - but I think there are a few areas that need further iterations

  • No correlation at alll betwene hedging and PPAF
  • Limit heding to at most one region?
  • Don't allow heding for writes at all?
  • Can config surface area be minimized - do we need more than the threshold? Can that be the way to opt-out?
  • Should have second-region hedging winner impact PPCB?
  • Alignment with TRANSPORT_PIPELINE_SPEC.md
  • Alignment with FeedOperation spec Ashley is working on - hedging at the level proposed might not work well wihth queries (where hedging really shoud happen for each "Page" individually.

@github-project-automation github-project-automation Bot moved this from In Progress to Changes Requested in CosmosDB Go/Rust Crew May 8, 2026

## 3. Architectural Overview

### 3.1 Where Hedging Sits in the Driver
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts with TRANSPORT_PIPELINE_SPEC §4.2 - let us discuss what is the better aproach - but might need reconciliation with the pipeline spec.

🔴 Blocking — Direct contradiction with TRANSPORT_PIPELINE_SPEC.md §4.2 — needs reconciliation, not parallel specs

File: HEDGING_SPEC.md §3.1, §6 vs. TRANSPORT_PIPELINE_SPEC.md §4.2

The two specs disagree on essentially every design axis:

Axis TRANSPORT_PIPELINE_SPEC §4.2 HEDGING_SPEC (this PR)
Layer OperationAction::Hedge { secondary_routing } returned by evaluate_transport_result inside the pipeline loop Orchestrator wraps execute_operation_pipeline() from above
Fan-out Single secondary region (max 2 concurrent) Up to N regions, progressive timer
Default Enabled by default for all ops (writes only on MWR) Off by default; auto-enabled only by PPAF
Threshold Dynamic, P99-based, clamped 50–4000 ms Static (min(1000ms, RT/2) / 500ms)
Decision enum New OperationAction::Hedge variant (TPS line 463) No new variant; orchestration is external
ExecutionContext Has dedicated Hedging value (TPS line 300) Not addressed

|---|:---:|:---:|:---:|
| Document point reads (GetItem) | ✅ | ✅ | ✅ |
| Document point writes on multi-master (Create/Replace/Upsert/Delete/Patch) | ✅ | ✅ | ✅ |
| Queries (`QueryItems`) | ✅ | ✅ | ✅ |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FeedOperations the model with hedging on top of execute_operation is non-trivial and needs careful integartion with teh query pipeline and Ashley's FeedRange spec. IMO this is a bit of an open question and will need some mroe thoughts and alignment between you and Ashley. The TRANSPORT_PIPELINE_SPEC model makes it a bit simpler - but in either case this needs some more investigation

Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits - looks good, interested on what Fabian mentioned on hedging and PPCB basically playing together after a region has been picked several times through hedging.

And on the decision on enabling this by default like PPCB/ using a single additional region only.


### 2.3 Eligibility — `ShouldHedge()`

Hedging applies **only** to document-level point operations:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we mentioned queries/ read many above as part of the inclusion

Suggested change
Hedging applies **only** to document-level point operations:
Hedging applies **only** to document-level operations:


```rust
/// Sentinel value used to disable hedging for a specific operation when a
/// client-level strategy is configured.
Copy link
Copy Markdown
Member

@simorenoh simorenoh May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're enabling hedging by default per the other design docs this also applies to the entire client

public static AvailabilityStrategy CrossRegionHedgingStrategy(
TimeSpan threshold, // Time before first hedge fires
TimeSpan? thresholdStep, // Time between subsequent hedges
bool enableMultiWriteRegionHedge = false); // Opt-in for writes on MM
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems risky but so long as we have an explicit config for this makes sense

Copy link
Copy Markdown

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hedging ↔ x-ms-cosmos-hub-region-processing-only header coordination

Reviewing #4330 alongside two related PRs surfaced a coordination gap that the spec
should record before any orchestrator code lands. Posting the proposed addition here
as a review with four inline suggestions you can apply à la carte; each one is
independently mergeable.

Context — the two related PRs

  1. Rust PR #4389 added
    the x-ms-cosmos-hub-region-processing-only: True header. It is emitted on the
    retry triggered by the first 404 / 1002 (READ_SESSION_NOT_AVAILABLE) on a
    single-master, data-plane operation and on every subsequent attempt within
    that operation. The latch is a bool field on OperationRetryState
    (components.rs:108-125),
    set in build_session_retry_state
    (retry_evaluation.rs:355-381),
    and consumed in apply_hub_region_header
    (operation_pipeline.rs:945-968).
    Full normative spec:
    HUB_REGION_PROCESSING_HEADER_SPEC.md.
    The header tells the gateway "this client has already discovered the hub region —
    process the request only in that region", which lets the operation skip the
    multi-region discovery round-trip on every retry after the first 1002.

  2. .NET azure-cosmos-dotnet-v3#5815 fixed the
    exact failure mode this comment is about for the .NET v3 SDK. Before #5815, each
    hedged request in CrossRegionHedgingAvailabilityStrategy carried its own copy of
    the latch state, so every hedge independently re-ran the 404/1002 discovery cycle
    and the header's "one round-trip per operation" guarantee held only inside an
    individual hedge, not across the fan-out. The fix:

    "added a new CrossRegionAvailabilityContext (with the property
    ShouldAddHubRegionProcessingOnlyHeader) that is propagated through the
    RequestMessage.Properties to every cloned hedge request" — quoted in §9.5.1.

    Because Properties is a Dictionary<string, object> whose Clone() is shallow,
    every clone gets a shared reference to one CrossRegionAvailabilityContext
    instance; latching it once is observable from all hedges.

The gap as it lands today in Rust

Spec §8.2 of this PR explicitly lists OperationRetryState as cloned per hedge.
PR #4389 stores the hub-region latch as a bool on OperationRetryState. So the
two PRs in isolation are each correct, but composed they recreate the exact
pre-fix .NET v3 behavior — every hedge in the fan-out independently goes through its
own 404/1002 discovery cycle, and the header buys nothing under hedging except for
the one hedge that happens to observe 1002 first.

The proposed fix

Rust counterpart of .NET v3's CrossRegionAvailabilityContext: extend
OperationRetryState with a shared_hub_region_latch: Option<Arc<AtomicBool>>
that is Some only while running under execute_with_hedging(). Arc<AtomicBool>
is the Rust idiom equivalent to "shared mutable object propagated via
RequestMessage.Properties"; the Option lets the non-hedged path keep PR #4389’s
behavior bit-for-bit when shared_hub_region_latch = None.

The four suggestions on this review record the design in this PR's spec:

Suggestion Anchor What it adds
§8.2 carve-out line 1110 New bullet in the "Items shared (via Arc or reference)" list pointing to §9.5.
§9.5 full section line 1292 The new normative section between §9.4 and §10.
§15.1 unit tests line 1715 Five rows covering: latch initialization, the two negative cases (non-hedged path, multi-master/metadata), cross-hedge propagation, and the no-1002-no-header invariant.
§15.2 fault-injection test line 1732 One row asserting end-to-end propagation under a 2-region SM data-plane fault injection.

I've also left small inline suggestions on
#4389 for the three
load-bearing docstrings in the implementation (OperationRetryState::hub_region_processing_only,
build_session_retry_state, apply_hub_region_header) so the forward reference
to §9.5 lives next to the latch sites themselves.

No code in PR #4389 needs to ship a behavior change for this PR to merge — these
suggestions are doc only. The behavior change lives in the orchestrator PR that
introduces execute_with_hedging(), and §9.5 of this spec is what that PR will
point at for its acceptance criteria.

- `CosmosOperation` — immutable; body is `Bytes` (cheaply cloneable)
- `LocationStateStore` — lock-free; multiple readers are safe
- `SessionManager` — designed for concurrent access
- `Credential` — `Arc`-wrapped
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested addition — §8.2 carve-out for the hub-region latch.

Adds the shared Arc<AtomicBool> to the “Items shared (via Arc or reference)” bullet list and forward-references the new §9.5 for full rationale. This is the core hand-off — without this carve-out, the §8.2 “items cloned per hedge” rule (which covers OperationRetryState) silently makes the hub-region latch per-hedge and defeats the discovery propagation.

Suggested change
- `Credential``Arc`-wrapped
- `Credential``Arc`-wrapped
- **Hub-region-processing-only latch** — a single `Arc<AtomicBool>` is
shared across the primary and every hedge for the lifetime of the
outer operation. See §9.5 for the full rationale; the short version
is that the per-`OperationRetryState` `hub_region_processing_only`
field added by [PR #4389](https://github.com/Azure/azure-sdk-for-rust/pull/4389)
is otherwise per-hedge, which would force every hedge to independently
re-discover the hub region via its own 404/1002 cycle. .NET v3 hit and
fixed this in [PR #5815](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5815)
via the `CrossRegionAvailabilityContext` shared object; Rust must
adopt the equivalent shared signal.

Implication: late hedges have less time budget. If the deadline is 5s and the
threshold is 3s, the hedge has only ~2s to complete.

---
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested addition — new §9.5 “Hub-Region-Processing-Only Header”.

This is the substantive content. It records:

  • §9.5.1 — the correctness gap: per-hedge OperationRetryState clones each carry an independent hub_region_processing_only: bool from PR #4389, so without coordination every hedge re-runs its own 404/1002 discovery cycle. Includes a verbatim quote from azure-cosmos-dotnet-v3#5815 where .NET hit and fixed the same gap via CrossRegionAvailabilityContext.
  • §9.5.2 — the required Rust design: extend OperationRetryState with shared_hub_region_latch: Option<Arc<AtomicBool>>, construct one Arc<AtomicBool> in execute_with_hedging() before fan-out, CAS-set it in build_session_retry_state alongside the per-state flag (Release), OR it into emission in apply_hub_region_header (Acquire). Includes the two minimal code samples.
  • §9.5.3 — eligibility gates (data-plane + single-master + regions.len() > 1); when any fails, shared_hub_region_latch = None and PR Emit x-ms-cosmos-hub-region-processing-only header #4389’s behavior is preserved bit-for-bit.
  • §9.5.4 — non-interference with the §8.4 local-only-retry invariant.
  • §9.5.5 — concurrency / memory-ordering notes.
Suggested change
---
### 9.5 Hub-Region-Processing-Only Header
The driver emits the `x-ms-cosmos-hub-region-processing-only: True`
request header on retries triggered by a `404 / 1002
(READ_SESSION_NOT_AVAILABLE)` response, scoped to **single-master
data-plane** operations. The header is specified in
[`HUB_REGION_PROCESSING_HEADER_SPEC.md`](../../azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md)
and implemented in [Rust PR
#4389](https://github.com/Azure/azure-sdk-for-rust/pull/4389) /
[.NET PR #5447](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5447)
(parity baseline).
#### 9.5.1 The hedging-specific correctness gap
The Rust latch lives on `OperationRetryState`
([`components.rs::OperationRetryState::hub_region_processing_only`](../src/driver/pipeline/components.rs))
and is set in
[`retry_evaluation.rs::build_session_retry_state`](../src/driver/pipeline/retry_evaluation.rs)
when all four conditions hold (cf. spec §7.1):
1. `is_dataplane`
2. `!can_use_multiple_write_locations` (single-master account)
3. `session_token_retry_count == 0` (first 1002 within the operation)
4. `!hub_region_processing_only` (idempotency)
It is consumed in
[`operation_pipeline.rs::apply_hub_region_header`](../src/driver/pipeline/operation_pipeline.rs)
on every subsequent transport attempt of the same operation.
Per §8.2, **each hedge has its own `OperationRetryState`**. Without
additional coordination, this means each hedge — primary, hedge 1,
hedge 2, … — would independently observe its own first 1002, then
independently re-issue the next attempt with the header set. Every
hedge pays the full hub-discovery latency cost; the header's purpose
(*bound the discovery cycle to a single 1002 round-trip per operation*)
is defeated for everyone except the lucky hedge that observes 1002 first.
This is the same gap .NET v3 had after its first hub-region header PR
([#5447](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5447)) and
**explicitly fixed** in
[PR #5815*Read Consistency Strategy: Adds hub region header for
LastCommittedWriteRegion strategy*](https://github.com/Azure/azure-cosmos-dotnet-v3/pull/5815),
in the section *"Hedging request with hub region header"*:
> When `CrossRegionHedgingAvailabilityStrategy` is active, the primary
> request may discover the hub region mid-flight … Hedged requests are
> clones of the original and run with their own `ClientRetryPolicy`
> instance, so they would normally repeat the entire hub discovery cycle
> independently. To avoid this redundant retry overhead, we introduce a
> `CrossRegionAvailabilityContext` — a lightweight shared object with a
> volatile `bool ShouldAddHubRegionProcessingOnlyHeader` flag. This
> context is injected into `RequestMessage.Properties` before the clone
> loop in `CrossRegionHedgingAvailabilityStrategy`. Since `Clone()`
> performs a shallow dictionary copy, all clones (primary + hedges)
> share the same `CrossRegionAvailabilityContext` reference. When the
> primary's `ClientRetryPolicy` sets the hub flag after 2× 404/1002, it
> also sets the flag on the shared context. Each hedge's
> `ClientRetryPolicy.OnBeforeSendRequest` reads this shared flag on
> every attempt and attaches the
> `x-ms-cosmos-hub-region-processing-only` header immediately — without
> needing to go through its own 404/1002 discovery.
The Rust orchestrator MUST adopt the equivalent design.
#### 9.5.2 Required design — `Arc<AtomicBool>` shared latch
Construct a single `Arc<AtomicBool>` in `execute_with_hedging()`
**before any hedge is spawned**, and thread it into every pipeline
invocation (primary and hedges). Concretely:
```rust
// In execute_with_hedging(), before the spawn-and-race loop:
let shared_hub_region_latch: Arc<AtomicBool> = Arc::new(AtomicBool::new(false));
// When constructing each hedge's pipeline params:
let retry_state = OperationRetryState::initial(/**/)
.with_shared_hub_region_latch(shared_hub_region_latch.clone());

This requires a small extension to OperationRetryState:

pub struct OperationRetryState {
    // … existing fields …

    /// Per-operation hub-region-processing-only latch.
    /// Sticky for the lifetime of this `OperationRetryState`.
    pub hub_region_processing_only: bool,

    /// Cross-hedge shared latch. `Some(_)` only when this operation is
    /// running inside `execute_with_hedging()` — `None` on the
    /// non-hedged code path so today's allocator behavior is preserved.
    ///
    /// Mirrors .NET v3's `CrossRegionAvailabilityContext` injected into
    /// `RequestMessage.Properties` before the clone loop
    /// (azure-cosmos-dotnet-v3 PR #5815).
    pub shared_hub_region_latch: Option<Arc<AtomicBool>>,
}

The two existing call sites are then extended:

  • build_session_retry_state (latch-set side). When the four
    trigger conditions fire and the new state sets
    hub_region_processing_only = true, also CAS-set
    shared_hub_region_latch to true if present:

    if let Some(shared) = &retry_state.shared_hub_region_latch {
        shared.store(true, Ordering::Release);
    }

    Release is sufficient — the only thing being published is the bool
    itself; no further state hangs off it.

  • apply_hub_region_header (header-emission side). Emit the header
    when either the per-state latch is set or the shared latch is
    set:

    let emit = retry_state.hub_region_processing_only
        || retry_state
            .shared_hub_region_latch
            .as_ref()
            .map(|shared| shared.load(Ordering::Acquire))
            .unwrap_or(false);
    if emit {
        transport_request.headers.insert(
            HeaderName::from_static(request_header_names::HUB_REGION_PROCESSING_ONLY),
            HeaderValue::from_static("True"),
        );
    }

This preserves the §5/§7/§8 invariants of
HUB_REGION_PROCESSING_HEADER_SPEC.md (account-level scope, data-plane
scope, idempotency / sticky semantics) on a per-hedge basis while
also propagating the discovery from any hedge to every other hedge as
soon as it happens.

9.5.3 Eligibility — when the shared latch is actually wired

The shared latch is only populated when all of the following are
true at the start of execute_with_hedging():

Condition Why
Operation is data-plane (is_dataplane) Mirrors the §1.5 scope of HUB_REGION_PROCESSING_HEADER_SPEC.md.
Account is single-master (!can_use_multiple_write_locations) Mirrors AC-4 of HUB_REGION_PROCESSING_HEADER_SPEC.md; multi-master accounts have a separate recovery path and the header is never emitted.
Hedging actually fans out (regions.len() > 1) When the orchestrator falls through to the single-region path (§6.4), the per-state latch alone is sufficient — there is no second hedge to propagate to.

When any condition fails, shared_hub_region_latch is None and the
existing per-state behavior from PR #4389 is preserved bit-for-bit.

9.5.4 Interaction with §8.4 (Local-only retries inside a hedge)

The §8.4 local-only-retry contract is unaffected by the shared latch:
the latch governs only which request header is emitted, not the
endpoint resolution. ExcludeRegions continues to pin each hedge to
its own region across retries; the shared latch merely ensures every
hedge's retries — within their pinned region — also carry the
hub-region hint once any hedge has observed 1002. No new retry
trigger paths or region-fallback edges are introduced.

9.5.5 Concurrency notes

  • AtomicBool with Release/Acquire ordering is sufficient — the
    bool is the only thing being shared and there is no dependent state.
    Relaxed would also be functionally correct (single-flag race with
    monotonic 0 → 1 transition) but Release/Acquire is preferred for
    reader/code-author clarity and costs nothing on every architecture
    the Rust SDK targets.
  • The latch is monotonic 0 → 1 and never reset within an operation —
    matches the "sticky" semantics of the per-state latch
    (components.rs:108-125).
  • The Arc is scoped to one outer execute_with_hedging() call, so it
    is dropped when the orchestrator returns (no global state, no leak
    across operations).
  • A losing hedge whose transport already responded after cancellation
    (cf. §14.2) may still observe and CAS-set the shared latch — this is
    benign: the orchestrator has already returned a winner, and the next
    observer of the dropped Arc is no one.

| `hedging_config_requires_explicit_step` | `threshold_step` must be provided explicitly; constructor does not default it from `threshold` |
| `region_exclusion_for_hedge_n` | Correct ExcludeRegions per hedge |
| `exclude_regions_honored_by_every_retry_trigger` | For each retry trigger class — PPAF write retry, PPCB markdown failback, transport-layer 503, throttling 429, session-token 1002 — fault-inject the trigger inside a hedge and assert the retry attempt does **not** route to a region listed in the hedge's `ExcludeRegions`. Encodes the §8.4 cross-cutting invariant; new retry triggers added in later phases must extend this test. |
| `app_cancel_preserves_hedge_diagnostics` | Cancel the application token mid-fan-out; assert the returned error carries `HedgeDiagnostics` from the most-advanced in-flight hedge (covers §6.5 invariant #6). |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested addition — five §15.1 unit-test rows for §9.5.

One row per acceptance criterion in §9.5.2/§9.5.3, including the cross-hedge propagation test (shared_hub_region_latch_propagates_first_1002_to_other_hedges) which is the Rust counterpart of .NET PR #5815’s CrossRegionAvailabilityContext_PropagatesHubHeaderFlagToHedgedRequests test.

Suggested change
| `app_cancel_preserves_hedge_diagnostics` | Cancel the application token mid-fan-out; assert the returned error carries `HedgeDiagnostics` from the most-advanced in-flight hedge (covers §6.5 invariant #6). |
| `app_cancel_preserves_hedge_diagnostics` | Cancel the application token mid-fan-out; assert the returned error carries `HedgeDiagnostics` from the most-advanced in-flight hedge (covers §6.5 invariant #6). |
| `shared_hub_region_latch_initialized_when_eligible` | `execute_with_hedging()` invoked on a data-plane / single-master operation with `regions.len() > 1`; assert every hedge's `OperationRetryState.shared_hub_region_latch` is `Some(_)` and points to the same `Arc<AtomicBool>` instance (encodes §9.5.2 / §9.5.3). |
| `shared_hub_region_latch_none_on_non_hedged_path` | `execute_with_hedging()` falls through to `execute_operation_pipeline` because `regions.len() <= 1`; assert `shared_hub_region_latch` is `None` (preserves PR #4389 baseline allocator behavior — §9.5.2). |
| `shared_hub_region_latch_none_on_multi_master_or_metadata` | Multi-master *or* metadata pipeline; assert `shared_hub_region_latch` is `None` even when hedging fans out, matching `HUB_REGION_PROCESSING_HEADER_SPEC.md` §5 account-level / §1.5 data-plane gates (§9.5.3). |
| `shared_hub_region_latch_propagates_first_1002_to_other_hedges` | Drive 1002 through `build_session_retry_state` on hedge 0; assert (a) hedge 0's per-state `hub_region_processing_only` is `true`, (b) the shared `Arc<AtomicBool>` is `true`, (c) on the next transport attempt, hedge 1 and hedge 2 — whose per-state latches are still `false` — both have `apply_hub_region_header` emit the header. Encodes the §9.5 cross-hedge invariant and is the Rust counterpart of .NET PR #5815's `CrossRegionAvailabilityContext_PropagatesHubHeaderFlagToHedgedRequests` test. |
| `shared_hub_region_latch_no_1002_emits_no_header` | No hedge observes 1002; assert no hedge calls `apply_hub_region_header` with the header set. |

| `hedging_with_ppcb` | 503 on Region A reads; PPCB enabled | PPCB and hedging both apply; circuit breaker tripped AND hedge succeeds |
| `hedging_cancels_losers` | Delay on Region A | Region B wins; verify Region A task cancelled (hit_count ≤ expected) |
| `hedging_failback_to_primary` | Region A initially slow, then fast | First few reads hedged; after threshold tightened, primary wins again |
| `hedging_exclude_regions_under_503_retry` | Region B inside hedge returns 503 (triggers transport retry) while Region C is healthy and excluded by that hedge's `ExcludeRegions` | Hedge B's retry stays pinned to Region B (does NOT fall back to Region C) — fault-injection counterpart to the §8.4 invariant unit test. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested addition — one §15.2 fault-injection integration test for §9.5.

Fault-injects 1002 on the primary's first attempt against Region A and asserts the header propagates to the Region-B hedge — even though Region B never observes 1002 itself. The end-to-end counterpart of the §15.1 unit test, and the direct Rust analogue of .NET PR #5815's emulator-level coverage.

Suggested change
| `hedging_exclude_regions_under_503_retry` | Region B inside hedge returns 503 (triggers transport retry) while Region C is healthy and excluded by that hedge's `ExcludeRegions` | Hedge B's retry stays pinned to Region B (does NOT fall back to Region C) — fault-injection counterpart to the §8.4 invariant unit test. |
| `hedging_exclude_regions_under_503_retry` | Region B inside hedge returns 503 (triggers transport retry) while Region C is healthy and excluded by that hedge's `ExcludeRegions` | Hedge B's retry stays pinned to Region B (does NOT fall back to Region C) — fault-injection counterpart to the §8.4 invariant unit test. |
| `hedging_hub_region_header_propagates_across_hedges` | 2-region SM data-plane account; fault-inject `404/1002` on the primary's first attempt against Region A, healthy 200 on Region B after threshold | Primary's retry against Region A emits `x-ms-cosmos-hub-region-processing-only: True` (per-state latch) **and** the hedge spawned against Region B emits the same header on every attempt — without itself ever observing a 1002 (per the shared `Arc<AtomicBool>` from §9.5). Encodes the cross-hedge propagation invariant under fault injection; counterpart of .NET PR #5815's `CrossRegionAvailabilityContext_PropagatesHubHeader…` emulator tests. |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate high-availability

Projects

Status: Changes Requested

Development

Successfully merging this pull request may close these issues.

5 participants