Skip to content

Decouple Cosmos account-metadata fetch from caller cancellation#4607

Closed
NaluTripician wants to merge 1 commit into
Azure:mainfrom
NaluTripician:nalutripician/cosmos-driver-metadata-cancel
Closed

Decouple Cosmos account-metadata fetch from caller cancellation#4607
NaluTripician wants to merge 1 commit into
Azure:mainfrom
NaluTripician:nalutripician/cosmos-driver-metadata-cancel

Conversation

@NaluTripician

Copy link
Copy Markdown
Contributor

Summary

Re-authored fix for #4253 against the current driver-crate architecture (the earlier PR #4606 was branched from a stale fork and edited files that no longer exist upstream — closing it in favor of this one).

A cold account-metadata fetch (CosmosDriver::fetch_account_properties) ran its network call and inline cross-region failover chain (refresh_account_properties -> handle_refresh_failure -> refresh_via_regional_endpoints) directly under the caller's future. Because cancellation in Rust is future-drop, dropping the caller while the first attempt against an unhealthy region was in flight dropped the whole chain, preempting the cross-region failover.

Fix

  • New generic helper spawn_detached_bounded(future, deadline, on_deadline) runs future on a task spawned through the runtime abstraction (whose handle detaches, not aborts, on drop) and delivers the result over a futures oneshot. A dropped caller no longer cancels the in-flight work; the failover completes. The caller observes its own cancellation only on the response path.
  • fetch_account_properties now runs refresh_account_properties (over cloned Arc runtime/transport + cloned AccountReference) through that helper, bounded by a 5-minute hard deadline so the detached work can never leak.
  • Behavior on the normal (non-dropped) path is preserved (still passes previous_props = None; shared transport HTTP-version state is mutated via the same Arc<ArcSwap<..>>).
  • Unit tests for the helper: happy path, caller-drop survival (the core [cosmos] Metadata retry: dropped caller future silently cancels cross-region failover #4253 regression), and hard-deadline.

⚠️ Open design questions for maintainers (why this is a draft)

  1. Detached task vs. structural cancellation. The 0.4.0 hedging work deliberately cancels the losing leg "structurally (no detached tasks)". This fix introduces a bounded detached task for the idempotent, GET-only account-metadata path. Is that acceptable here, or would you prefer a structural approach (e.g. a grace-window shield like the .NET #5806 fix) so no detached task is used?
  2. Scope. This covers the account-metadata path only. Container and partition-key-range metadata reads route back through execute_operation_pipeline (via execute_singleton_operation) and are not covered — fully covering them needs either an Arc<Self> handle to detach those fetches or a pipeline-level cancellation-decoupling, which is a larger change and likely the same design decision as (1). If the customer repro centers on the collection / partition-key-range caches, a follow-up covering those paths is required.

Validation

  • cargo clippy --all-targets clean; cargo fmt --check clean.
  • Full driver lib suite: 1811 passed, 0 failed (64 ignored emulator/integration); 3 new helper tests pass.

Fixes #4253

A cold account-metadata fetch (CosmosDriver::fetch_account_properties) ran its
network call and inline cross-region failover chain
(refresh_account_properties -> handle_refresh_failure ->
refresh_via_regional_endpoints) directly under the caller's future. Because Rust
cancellation is future-drop, dropping the caller while the first attempt against
an unhealthy region was in flight dropped the whole chain, preempting the
failover to the next region (issue Azure#4253; same class as dotnet #5805/#5806).

The on-demand fetch now runs on a detached, internally-bounded task via a new
spawn_detached_bounded helper: the work is spawned through the runtime
abstraction (whose handle detaches rather than aborts on drop) and its result is
delivered over a futures oneshot, so a dropped caller no longer cancels the
in-flight attempt and the cross-region failover completes. The caller observes
its own cancellation only on the response path. The detached work is bounded by
a 5-minute hard deadline so it can never leak.

Scope: this covers the idempotent (GET-only) account-metadata path. Container
and partition-key-range metadata reads route through the operation pipeline and
are not yet covered. Introducing a detached task is an intentional exception to
the structural-cancellation model used for hedging; opened as a draft to confirm
direction (bounded detached task vs. extending structural hedging to metadata)
with maintainers.

Adds unit tests for the helper (happy path, caller-drop survival, hard deadline).

Fixes Azure#4253

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@NaluTripician

Copy link
Copy Markdown
Contributor Author

Superseded by #4608, which reworks this fix to use the driver's structural cross-region hedging (extending hedge eligibility to container/collection reads) instead of a detached background task — aligning with the crate's no-detached-tasks model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[cosmos] Metadata retry: dropped caller future silently cancels cross-region failover

1 participant