Skip to content

Cosmos: hub-region-processing-only header spec (rebased to previews)#4320

Open
NaluTripician wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
NaluTripician:spec/issue-4303-hub-region-header-previews
Open

Cosmos: hub-region-processing-only header spec (rebased to previews)#4320
NaluTripician wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
NaluTripician:spec/issue-4303-hub-region-header-previews

Conversation

@NaluTripician
Copy link
Copy Markdown

@NaluTripician NaluTripician commented Apr 30, 2026

Rebases the hub-region-processing-only header spec from #4310 onto release/azure_data_cosmos-previews and pivots its implementation guidance from the SDK retry layer to the driver, where customer reads/writes actually flow on this branch.

Tracks #4303. Supersedes #4310.

Why retarget

The feature lands on release/azure_data_cosmos-previews, not main. previews is 79 commits ahead of main, and the SDK-vs-driver wiring differs substantially from what the spec was originally written against:

  • Every user-facing data-plane op (read_item, query, create, replace, delete, etc. across container/database/cosmos/offers clients) calls self.context.driver.execute_operation(...) directly. CosmosPipeline → BackOffRetryHandler → ClientRetryPolicy is not reached on the read path on previews — implementing the spec verbatim would build but have no runtime effect on customer reads.
  • azure_data_cosmos/Cargo.toml already declares azure_data_cosmos_driver = { workspace = true }.
  • The driver's CosmosHeadersPolicy / AuthorizationPolicy / TrackedTransportPolicy chain has been removed; apply_cosmos_headers() is a free function and request signing is a direct module call inside CosmosDriver::execute_operation.
  • The driver gained evaluate_transport_result (a status-code retry hook that already classifies 404/1002 → SessionRetry), RetryState (per-operation state), and AccountMetadataCache::write_region() — i.e. the prerequisites that the original spec listed as blocking a driver-resident implementation are now met or near-met.

What changed in the spec (commit 67420ab)

The spec is now anchored to the previews driver shape, with azure_data_cosmos-targeted instructions retained only as parity reference for the .NET behavioral shape.

  • Opening metadata — driver is the implementation home on previews; SDK is parity reference. Confidence block reframed; case-sensitivity for the header value explicitly marked assumed (matches .NET's bool.TrueString, no backend contract reference verified).
  • §1 Primary Target / §1.5 Two-crate landscape — replaced the old SDK-as-host table with a previews-architecture table (driver owns customer dispatch / headers / signing / status-code retry / transport retry / topology; SDK depends on driver via Cargo workspace). Removed-policy callout enumerates CosmosHeadersPolicy / AuthorizationPolicy / TrackedTransportPolicy with replacement-comment citations. Implication paragraph enumerates each prerequisite with a verified line number.
  • §2 Architectural Overview — new §2.1 (Driver-resident architecture, previews) with a Mermaid diagram and mapping table (emission at operation_pipeline::build_transport_request:442, trigger inside evaluate_transport_result's 404/1002 arm at retry_evaluation.rs:70, latch on RetryState, single-master gate via AccountMetadataCache, header constant in request_header_names at cosmos_headers.rs:17). New §2.2 reframes §3–§4 as parity reference. Old "Forward target" subsection (anchored to the obsolete Vec<Arc<dyn Policy>> chain) removed.
  • §4 Code Changes — opens with a branch-note callout: implementers must not edit azure_data_cosmos::ClientRetryPolicy for this feature on previews. §4.2(b) case-sensitivity reconciled with the Confidence block. §4.2(c) line numbers updated to previews (global_endpoint_manager.rs:39 / :303 / :350 / :430; main numbers retained parenthetically). §4.3 CHANGELOG retargeted to the existing 0.34.0 (Unreleased) block (was 0.32.0); fallback bump is 0.35.0.
  • ALT-6 — promoted to "primary path on previews". Prerequisite status updated: status-code retry hook ✅ (retry_evaluation.rs:31/:70), account topology ✅ partial (AccountMetadataCache::write_region() at :175; small additive multi-write accessor still needed), per-operation state ✅ (RetryState already threaded through evaluate_transport_result). Cross-walk table maps every cell to a concrete previews site; all cells are mechanical or small-additive (none "design-pending").
  • SE-006 / AG-5 / OQ-2 / Future Work — SE-006 marked 🟢 Resolved on previews. AG-5 (driver-migration anchor) reduced to a one-line cross-reference comment, since no migration is needed. OQ-2 retargeted to 0.34.0. Future Work "Driver migration" bullet retained only as a forward note for any future merge back into main.

Review-comment status

All 18 review threads have been replied-to and resolved (commit 67420ab):

  • 9 spec-author review comments: all addressed by the driver-pivot edits above.
  • 8 Copilot auto-review comments: all addressed (CHANGELOG version, driver-pipeline framing, SDK-vs-driver dependency, removed policies, status-code retry, case-sensitivity inconsistency, §4.1/§4.2 SDK landing, opening-metadata scope).
  • 1 acknowledgment-only thread (ALT-1..ALT-5 use SDK vocabulary): the alternatives reason about behavioral shape (latch vs no-latch, trigger location, retry-budget extension, skip-set rotation), which is home-agnostic; with §4 reframed as parity reference and ALT-6 promoted to primary, the SDK names there now read as parity vocabulary.

Stats: 1 file, +255 / -183 lines.

Rebases the spec from PR Azure#4310 onto release/azure_data_cosmos-previews.
Spec content is identical to PR Azure#4310's head; it has not yet been
revised against the previews-branch architecture. Known drift versus
previews source is being tracked in review comments on this PR for
follow-up edits.

Tracks Azure#4303. Supersedes Azure#4310.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Author

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local deep-review of the rebased spec against release/azure_data_cosmos-previews.

Severity summary: 4 🔴 Blocking, 3 🟡 Recommendation, 1 🟢 Suggestion, 1 💬 Observation.

The dominant issue is #1 (architectural mismatch): on previews, every user-facing data-plane op routes through CosmosDriver::execute_operation, bypassing CosmosPipeline/ClientRetryPolicy. The spec's chosen trigger and emission point are dead code on the customer read path on this branch. Findings #2#4 stack on top of #1. Finding #5 is a clean line-number sweep that is safe to apply independently.

Scope of verification: every spec source-line citation cross-checked against the previews-branch source; every architectural claim re-verified against cosmos_driver.rs, cosmos_headers.rs, request_signing.rs, container_client.rs, Cargo.toml, and clients/mod.rs.

Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design spec document for emitting x-ms-cosmos-hub-region-processing-only on retries after 404/1002 (session-not-available) to mitigate session-token storms during region failback, intended to be the planning artifact for a follow-up implementation PR.

Changes:

  • Introduces HUB_REGION_PROCESSING_HEADER_SPEC.md describing trigger conditions, latch behavior, and proposed implementation/testing strategy.
  • Documents a forward-looking migration plan for relocating the behavior to azure_data_cosmos_driver.

Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Pivots the spec to a driver-resident implementation on

release/azure_data_cosmos-previews and addresses all 17 review

threads:

- Opening metadata: rebased to previews; driver = primary path

- Section 1.5: previews two-crate landscape (driver owns customer dispatch)

- Section 2.1: new driver-resident architecture diagram and mapping table

- Section 2.2: section 4 reframed as parity reference

- Section 4: branch note clarifying SDK landing not reached on previews

- Section 4.2(b): case-sensitivity reconciled with opening Confidence block

- Section 4.2(c): updated global_endpoint_manager.rs line refs (39/303/350/430)

- Section 4.3: CHANGELOG retargeted to existing 0.34.0 (Unreleased)

- ALT-6: promoted to primary path; prerequisite status updated to met/near-met

- SE-006: marked resolved on previews (no future migration needed)

- AG-5: marked not applicable on previews; replaced with cross-ref comment

- OQ-2 / Future Work: realigned to 0.34.0 and resolved-on-previews wording

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread sdk/cosmos/azure_data_cosmos/docs/HUB_REGION_PROCESSING_HEADER_SPEC.md Outdated
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall IMO way too verbose for a relatively small feature. That said I am ok with most of it except few comments.

@github-project-automation github-project-automation Bot moved this from Todo to Changes Requested in CosmosDB Go/Rust Crew Apr 30, 2026
@NaluTripician
Copy link
Copy Markdown
Author

Overall IMO way too verbose for a relatively small feature. That said I am ok with most of it except few comments.

Will fix your comments, then move to a smaller, less verbose spec (will keep the longer one locally for context)

NaluTripician and others added 7 commits April 30, 2026 16:37
Adds three pieces of context flagged in review:

- New section 1.6 'Gateway-mode interaction (GW v1 vs GW v2)':

  Rust always uses Gateway mode. GW v2 returns failure to the SDK after

  one local-region replica iteration so the cross-region retry (and the

  latched header) fires promptly. GW v1 performs extensive intra-region

  retries until Gateway offers RemoteRegionPreferred, widening the

  pre-trigger window for latency-sensitive workloads. Client-side spec

  is correct for both; only timing-to-trigger differs.

- Primary Target broadened to enumerate read-shaped op types (read_item,

  point reads, query, read_many_*, change-feed) that share the trigger.

  Notes read_many_* and change-feed are not yet implemented in the

  driver; the latch fires automatically when they land because the

  trigger is centralized in evaluate_transport_result's 404/1002 arm.

- New PPAD non-goal: partition-scoped hub-region resolution is

  explicitly out of scope. The header is emitted account-scoped to

  match the .NET PR; partition-scoped routing happens server-side via

  PKR-level state. Implementation PR's tests must include a

  PPAD-enabled single-master scenario to validate routing. A

  partition-aware variant remains as future work if telemetry shows

  misrouting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Gateway mode is implicit for the Rust SDK; keep only the GW v1 vs GW v2
behavioral contrast in 1.6.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CI cspell job flagged three domain terms in the spec. Add a per-file
cspell:words directive to whitelist them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iews' into spec/issue-4303-hub-region-header-previews
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Changes Requested

Development

Successfully merging this pull request may close these issues.

3 participants