Decouple Cosmos account-metadata fetch from caller cancellation#4607
Closed
NaluTripician wants to merge 1 commit into
Closed
Decouple Cosmos account-metadata fetch from caller cancellation#4607NaluTripician wants to merge 1 commit into
NaluTripician wants to merge 1 commit into
Conversation
A cold account-metadata fetch (CosmosDriver::fetch_account_properties) ran its network call and inline cross-region failover chain (refresh_account_properties -> handle_refresh_failure -> refresh_via_regional_endpoints) directly under the caller's future. Because Rust cancellation is future-drop, dropping the caller while the first attempt against an unhealthy region was in flight dropped the whole chain, preempting the failover to the next region (issue Azure#4253; same class as dotnet #5805/#5806). The on-demand fetch now runs on a detached, internally-bounded task via a new spawn_detached_bounded helper: the work is spawned through the runtime abstraction (whose handle detaches rather than aborts on drop) and its result is delivered over a futures oneshot, so a dropped caller no longer cancels the in-flight attempt and the cross-region failover completes. The caller observes its own cancellation only on the response path. The detached work is bounded by a 5-minute hard deadline so it can never leak. Scope: this covers the idempotent (GET-only) account-metadata path. Container and partition-key-range metadata reads route through the operation pipeline and are not yet covered. Introducing a detached task is an intentional exception to the structural-cancellation model used for hedging; opened as a draft to confirm direction (bounded detached task vs. extending structural hedging to metadata) with maintainers. Adds unit tests for the helper (happy path, caller-drop survival, hard deadline). Fixes Azure#4253 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
Superseded by #4608, which reworks this fix to use the driver's structural cross-region hedging (extending hedge eligibility to container/collection reads) instead of a detached background task — aligning with the crate's no-detached-tasks model. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-authored fix for #4253 against the current driver-crate architecture (the earlier PR #4606 was branched from a stale fork and edited files that no longer exist upstream — closing it in favor of this one).
A cold account-metadata fetch (
CosmosDriver::fetch_account_properties) ran its network call and inline cross-region failover chain (refresh_account_properties->handle_refresh_failure->refresh_via_regional_endpoints) directly under the caller's future. Because cancellation in Rust is future-drop, dropping the caller while the first attempt against an unhealthy region was in flight dropped the whole chain, preempting the cross-region failover.Fix
spawn_detached_bounded(future, deadline, on_deadline)runsfutureon a task spawned through the runtime abstraction (whose handle detaches, not aborts, on drop) and delivers the result over afuturesoneshot. A dropped caller no longer cancels the in-flight work; the failover completes. The caller observes its own cancellation only on the response path.fetch_account_propertiesnow runsrefresh_account_properties(over clonedArcruntime/transport + clonedAccountReference) through that helper, bounded by a 5-minute hard deadline so the detached work can never leak.previous_props = None; shared transport HTTP-version state is mutated via the sameArc<ArcSwap<..>>).execute_operation_pipeline(viaexecute_singleton_operation) and are not covered — fully covering them needs either anArc<Self>handle to detach those fetches or a pipeline-level cancellation-decoupling, which is a larger change and likely the same design decision as (1). If the customer repro centers on the collection / partition-key-range caches, a follow-up covering those paths is required.Validation
cargo clippy --all-targetsclean;cargo fmt --checkclean.Fixes #4253