Cosmos: hub-region-processing-only header spec (rebased to previews)#4320
Conversation
Rebases the spec from PR Azure#4310 onto release/azure_data_cosmos-previews. Spec content is identical to PR Azure#4310's head; it has not yet been revised against the previews-branch architecture. Known drift versus previews source is being tracked in review comments on this PR for follow-up edits. Tracks Azure#4303. Supersedes Azure#4310. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NaluTripician
left a comment
There was a problem hiding this comment.
Local deep-review of the rebased spec against release/azure_data_cosmos-previews.
Severity summary: 4 🔴 Blocking, 3 🟡 Recommendation, 1 🟢 Suggestion, 1 💬 Observation.
The dominant issue is #1 (architectural mismatch): on previews, every user-facing data-plane op routes through CosmosDriver::execute_operation, bypassing CosmosPipeline/ClientRetryPolicy. The spec's chosen trigger and emission point are dead code on the customer read path on this branch. Findings #2–#4 stack on top of #1. Finding #5 is a clean line-number sweep that is safe to apply independently.
Scope of verification: every spec source-line citation cross-checked against the previews-branch source; every architectural claim re-verified against cosmos_driver.rs, cosmos_headers.rs, request_signing.rs, container_client.rs, Cargo.toml, and clients/mod.rs.
There was a problem hiding this comment.
Pull request overview
Adds a design spec document for emitting x-ms-cosmos-hub-region-processing-only on retries after 404/1002 (session-not-available) to mitigate session-token storms during region failback, intended to be the planning artifact for a follow-up implementation PR.
Changes:
- Introduces
HUB_REGION_PROCESSING_HEADER_SPEC.mddescribing trigger conditions, latch behavior, and proposed implementation/testing strategy. - Documents a forward-looking migration plan for relocating the behavior to
azure_data_cosmos_driver.
Pivots the spec to a driver-resident implementation on release/azure_data_cosmos-previews and addresses all 17 review threads: - Opening metadata: rebased to previews; driver = primary path - Section 1.5: previews two-crate landscape (driver owns customer dispatch) - Section 2.1: new driver-resident architecture diagram and mapping table - Section 2.2: section 4 reframed as parity reference - Section 4: branch note clarifying SDK landing not reached on previews - Section 4.2(b): case-sensitivity reconciled with opening Confidence block - Section 4.2(c): updated global_endpoint_manager.rs line refs (39/303/350/430) - Section 4.3: CHANGELOG retargeted to existing 0.34.0 (Unreleased) - ALT-6: promoted to primary path; prerequisite status updated to met/near-met - SE-006: marked resolved on previews (no future migration needed) - AG-5: marked not applicable on previews; replaced with cross-ref comment - OQ-2 / Future Work: realigned to 0.34.0 and resolved-on-previews wording Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Overall IMO way too verbose for a relatively small feature. That said I am ok with most of it except few comments.
Will fix your comments, then move to a smaller, less verbose spec (will keep the longer one locally for context) |
Adds three pieces of context flagged in review: - New section 1.6 'Gateway-mode interaction (GW v1 vs GW v2)': Rust always uses Gateway mode. GW v2 returns failure to the SDK after one local-region replica iteration so the cross-region retry (and the latched header) fires promptly. GW v1 performs extensive intra-region retries until Gateway offers RemoteRegionPreferred, widening the pre-trigger window for latency-sensitive workloads. Client-side spec is correct for both; only timing-to-trigger differs. - Primary Target broadened to enumerate read-shaped op types (read_item, point reads, query, read_many_*, change-feed) that share the trigger. Notes read_many_* and change-feed are not yet implemented in the driver; the latch fires automatically when they land because the trigger is centralized in evaluate_transport_result's 404/1002 arm. - New PPAD non-goal: partition-scoped hub-region resolution is explicitly out of scope. The header is emitted account-scoped to match the .NET PR; partition-scoped routing happens server-side via PKR-level state. Implementation PR's tests must include a PPAD-enabled single-master scenario to validate routing. A partition-aware variant remains as future work if telemetry shows misrouting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Gateway mode is implicit for the Rust SDK; keep only the GW v1 vs GW v2 behavioral contrast in 1.6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…3-hub-region-header-previews
…3-hub-region-header-previews
CI cspell job flagged three domain terms in the spec. Add a per-file cspell:words directive to whitelist them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iews' into spec/issue-4303-hub-region-header-previews
…3-hub-region-header-previews
Rebases the hub-region-processing-only header spec from #4310 onto
release/azure_data_cosmos-previewsand pivots its implementation guidance from the SDK retry layer to the driver, where customer reads/writes actually flow on this branch.Tracks #4303. Supersedes #4310.
Why retarget
The feature lands on
release/azure_data_cosmos-previews, notmain.previewsis 79 commits ahead ofmain, and the SDK-vs-driver wiring differs substantially from what the spec was originally written against:read_item,query,create,replace,delete, etc. across container/database/cosmos/offers clients) callsself.context.driver.execute_operation(...)directly.CosmosPipeline → BackOffRetryHandler → ClientRetryPolicyis not reached on the read path on previews — implementing the spec verbatim would build but have no runtime effect on customer reads.azure_data_cosmos/Cargo.tomlalready declaresazure_data_cosmos_driver = { workspace = true }.CosmosHeadersPolicy/AuthorizationPolicy/TrackedTransportPolicychain has been removed;apply_cosmos_headers()is a free function and request signing is a direct module call insideCosmosDriver::execute_operation.evaluate_transport_result(a status-code retry hook that already classifies 404/1002 →SessionRetry),RetryState(per-operation state), andAccountMetadataCache::write_region()— i.e. the prerequisites that the original spec listed as blocking a driver-resident implementation are now met or near-met.What changed in the spec (commit 67420ab)
The spec is now anchored to the previews driver shape, with
azure_data_cosmos-targeted instructions retained only as parity reference for the .NET behavioral shape.bool.TrueString, no backend contract reference verified).CosmosHeadersPolicy/AuthorizationPolicy/TrackedTransportPolicywith replacement-comment citations. Implication paragraph enumerates each prerequisite with a verified line number.operation_pipeline::build_transport_request:442, trigger insideevaluate_transport_result's 404/1002 arm atretry_evaluation.rs:70, latch onRetryState, single-master gate viaAccountMetadataCache, header constant inrequest_header_namesatcosmos_headers.rs:17). New §2.2 reframes §3–§4 as parity reference. Old "Forward target" subsection (anchored to the obsoleteVec<Arc<dyn Policy>>chain) removed.azure_data_cosmos::ClientRetryPolicyfor this feature on previews. §4.2(b) case-sensitivity reconciled with the Confidence block. §4.2(c) line numbers updated to previews (global_endpoint_manager.rs:39 / :303 / :350 / :430;mainnumbers retained parenthetically). §4.3 CHANGELOG retargeted to the existing0.34.0 (Unreleased)block (was0.32.0); fallback bump is0.35.0.retry_evaluation.rs:31/:70), account topology ✅ partial (AccountMetadataCache::write_region()at:175; small additive multi-write accessor still needed), per-operation state ✅ (RetryStatealready threaded throughevaluate_transport_result). Cross-walk table maps every cell to a concrete previews site; all cells are mechanical or small-additive (none "design-pending").0.34.0. Future Work "Driver migration" bullet retained only as a forward note for any future merge back intomain.Review-comment status
All 18 review threads have been replied-to and resolved (commit
67420ab):Stats: 1 file, +255 / -183 lines.