[Internal] Add dynamic timeout escalation spec by tvaron3 · Pull Request #4244 · Azure/azure-sdk-for-rust

tvaron3 · 2026-04-21T21:33:25Z

Dynamic Timeout Spec

Adds DYNAMIC_TIMEOUTS_SPEC.md to sdk/cosmos/azure_data_cosmos_driver/docs/ describing the design for escalating per-attempt request timeouts on transport retries plus adaptive connection-timeout tuning.

This PR replaces the previously-closed #3871. It contains the original spec plus the revisions made in response to that PR's deep review (see "Changes since #3871" below).

Motivation

When a request times out and is retried with the same timeout, it often fails again — particularly on slow connections, regional latency spikes, or first-byte tail latencies. Escalating the timeout on each retry attempt gives the operation a meaningfully better chance of succeeding without making the initial attempt unnecessarily slow.

Key Design Decisions

Data plane request timeouts escalate 6s → 10s → 65s per retry attempt.
Metadata request timeouts escalate 5s → 10s → 20s, aligned with the Java SDK DatabaseAccount ladder.
No new retry budget — max_failover_retries = 3 already maps cleanly onto the 3-tier ladder; a hard MAX_TIMEOUT_RETRIES bound prevents unbounded growth if that default ever changes.
Fixed defaults — the ladder is not user-configurable. Existing ConnectionPoolOptions min/max bounds still clamp the effective values.
E2E deadline still wins — the operation-level deadline takes precedence over per-attempt ladder values.
Adaptive connection timeout starts at 1s and escalates to 5s after >3 consecutive connection failures per endpoint (one-time persistent transition in ShardedHttpTransport).
Per-attempt timeout via reqwest::RequestBuilder::timeout — threaded as a Duration parameter through the dispatch chain rather than being folded into request.deadline, so the deadline retains its E2E meaning.
Driver-internal DriverHttpClient wrapper exposing the concrete reqwest::Client so the dispatch site can actually call .timeout() (the previous Arc<dyn HttpClient> made the per-request setter unreachable).
Removes builder.timeout(config.request_timeout) at http_client_factory.rs:173 — that client-wide timeout would override every per-attempt value.

Cross-SDK Alignment

The Java SDK already implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.

Changes since #3871

The spec was substantially revised in response to the deep review on the previous PR. Highlights:

Driver-owned HTTP abstraction acknowledged throughout — references the new HttpClientFactory and the DriverHttpClient wrapper that exposes reqwest::Client to the dispatch site.
Per-attempt Duration plumbing — explicitly threaded through dispatch instead of mutating request.deadline; counter-ordering table added to make the interaction with E2E deadline unambiguous.
Hard MAX_TIMEOUT_RETRIES bound added to defend against future increases to max_failover_retries.
ErrorKind::Connection used for failure classification instead of reqwest::Error::is_connect() (the latter doesn't survive the abstraction boundary).
Transient-blip degradation for the adaptive connection timeout is explicitly acknowledged as an accepted trade-off.
min_*_request_timeout migration note clarifies how existing ConnectionPoolOptions values continue to clamp the new ladder.
azure_core::RetryPolicy rationale explains why a custom escalation policy is needed at the transport layer.
Thin-client ladder explicitly marked deferred.
"Users have no per-attempt knobs" added as an explicit non-goal.

Deferred

Query plan timeout escalation
Thin-client per-attempt timeout ladder
Implementation work (this PR is spec-only — implementation will land in follow-up issues, including typespec_client_core exposes no request timeout options #3878 referenced from transport_pipeline.rs)

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolve 3 of 4 open questions based on review feedback: - Commit to Context-based approach (Option A) for delivering per-attempt timeouts, removing Options B/C. No changes needed in azure_core or typespec_client_core. - Decide hedged attempts always start at tier 0 of the timeout ladder since they target independent regional endpoints. - Record effective per-attempt timeout values in DiagnosticsContext for observability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Update the dynamic timeout spec to reflect the new pipeline architecture from PR Azure#3887 (Driver Transport step 02): - Replace MAX_TRANSPORT_RETRIES/TransportRetry references with the new FailoverRetry/SessionRetry model and failover_retry_count - Move retry loop references from cosmos_driver.rs to the 7-stage loop in operation_pipeline.rs - Change timeout delivery from Context type-map to TransportRequest field, matching the new pipeline data flow - Update goal from 'increase retry count' to 'leverage existing retry budget' since max_failover_retries already defaults to 3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Propose adding an optional per-request timeout field to typespec_client_core::http::Request with set_timeout()/timeout() methods. The reqwest HttpClient implementation applies it via reqwest::RequestBuilder::timeout(). For connection timeout escalation, the reqwest client is built with the maximum ladder value (5s) and the per-request timeout provides the tighter overall bound on each attempt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove all connection timeout escalation from the spec. Connection timeouts remain static (configured via ConnectionPoolOptions). Only request timeouts are escalated per retry attempt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Connection timeouts use a failure-rate adaptive model instead of per-retry escalation: - Start at 1s (sufficient for any cloud/datacenter network) - ShardedHttpTransport monitors connection failure rate - On sustained failures, create new HttpClient instances with 5s connect_timeout (one-time persistent transition) - No azure_core changes needed — entirely internal to the sharded transport Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts

- Remove direct mode non-goal (won't be implemented) - Resolve connection failure threshold: >3 consecutive failures per endpoint triggers 1s to 5s escalation - Old shards are actively marked unhealthy on escalation for immediate reclamation by health sweep - Explicitly state per-endpoint failure tracking scope - Remove resolved open questions 2 and 3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- ShardedHttpTransport is no longer 'future integration' — it exists in sharded_transport.rs with per-shard health sweeps - Fix HttpClientFactory::create() → build() - Fix HttpClientConfig connect_timeout claim — connect_timeout is sourced from ConnectionPoolOptions::max_connect_timeout() inside the factory build() method, not as a config field - Reference existing per-request timeout mechanism in transport pipeline (azure_core::sleep racing the HTTP future) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix section numbering (missing §9) - Add rationale for 65s tier-3 jump - Add cross-reference for ladder saturation in §5 - Add typespec_client_core cross-crate PR strategy - Clarify adaptive connection timeout: idempotent transition, per-endpoint aggregation, no-fallback rationale Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Major architectural changes to align with codebase patterns: - Move timeout escalation to transport pipeline level, indexed by transport-level timeout retries (not operation failover) - Use deadline-based pattern (Instant) instead of adding a Duration field to TransportRequest — feeds into existing remaining_request_timeout() mechanism - Remove TimeoutLadder struct; use free function + constants matching DOP style - Remove AttemptTimeoutDiagnostics struct; record flat field directly on RequestDiagnostics - Clarify sleep-race is enforcement mechanism; Request::set_timeout is long-term replacement (never both active) - Use AtomicBool/AtomicU32 for adaptive connection timeout state matching lock-free hot-path pattern - Reference PipelineType for ladder selection - Explicit clamping order: pool bounds first, deadline last - Add note that min_* timeout fields are unused placeholders Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…iews' into tvaron3/dynamicTimeouts

- Fix metadata min timeout max bound: 65s → 6s (matches code) - Reconcile adaptive 1s initial vs existing 5s default: 1s is a new internal initial value, transitions to max_connect_timeout - Reference Step 6 (PR Azure#3957, now merged) as the introducing PR for ShardedHttpTransport - Pull latest from release/azure_data_cosmos-previews Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace typespec_client_core/azure_core dependency with driver-internal approach using reqwest::RequestBuilder::timeout() directly per request. No cross-crate changes needed. Key changes: - Remove all typespec_client_core Request/ClientMethodOptions proposals — driver controls reqwest directly via HttpClientFactory - Enforce per-attempt timeout via reqwest::RequestBuilder::timeout() instead of azure_core::sleep() race - Add options hierarchy section showing how users control timeout bounds (ConnectionPoolOptions) and e2e deadline (RuntimeOptions) - Add thin client (Gateway 2.0) future ladder note: 6s, 6s, 10s - Fix markdown lint: table alignment, heading levels, list formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Addresses 15 findings from the PR Deep Reviewer pass: - §1: Reframe Goal #4 as a transport-level timeout-retry counter; add non-goal clarifying users have no per-attempt knobs. - §3: Note that per-attempt enforcement needs concrete reqwest::Client access (HttpClient trait exposes no per-request timeout); add 'Why not azure_core::RetryPolicy' rationale. - §4: Mark thin-client (Gateway 2.0) ladder as deferred. - §5: Add counter interactions table with strict evaluation order, MAX_TIMEOUT_RETRIES hard bound, and a single canonical pseudocode that does not mutate request.deadline. - §6: Add 'Per-Attempt Deadline ≠ request.deadline' subsection. - §7: Cite min ≤ max invariant; document that min_*_request_timeout becomes a hard floor (migration note). - §8: Rewrite dispatch strategy — driver-internal DriverHttpClient wrapper exposing the concrete reqwest::Client; require removing the unconditional builder.timeout() in http_client_factory.rs:173; document per-attempt Duration threading through AdaptiveTransport and ShardedHttpTransport. - §9: Use ErrorKind::Connection (not reqwest::Error::is_connect); acknowledge long-lived-process transient-blip degradation as an accepted trade-off; distinguish new per-endpoint consecutive_connect_failures from the existing per-shard counter. - Header date bumped to 2026-04-21. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds an internal design spec documenting how azure_data_cosmos_driver should implement per-attempt request timeout escalation (laddered timeouts on transport retries) plus an adaptive connect-timeout elevation mechanism in ShardedHttpTransport.

Changes:

Introduces a new DYNAMIC_TIMEOUTS_SPEC.md describing timeout ladder values for data-plane vs metadata requests.
Specifies retry-loop integration details (counter ordering, clamping order vs ConnectionPoolOptions and E2E deadline).
Specifies an adaptive connection-timeout model (1s → 5s) based on consecutive connection failures.

Applies Copilot reviewer suggestions and fixes the CI Build Analyze (cspell) failure on the dynamic timeouts spec: - §1 Non-goals: use instance-method form `OperationOptions::new().with_end_to_end_latency_policy(...)` instead of the inaccurate associated-function form. - §4 Thin client ladder: rewritten to match the actual types (`HttpClientConfig::dataplane_gateway20`, `AdaptiveTransport::gateway20`, `TransportMode::Gateway20`) and clarify that "thin client" refers to a future Gateway20 mode with distinct retry semantics, not the existing HTTP/2 prior-knowledge Gateway20 transport. - §7 Dispatch signatures: drop stale line-number citations for `AdaptiveTransport::send_with_dispatch` and `ShardedHttpTransport::send` and switch to symbol-based references. Mark the hedging path as future (not yet implemented). - §9 Adaptive connection timeout: clamp the 1s initial value to `ConnectionPoolOptions::{min,max}_connect_timeout()` so the adaptive mechanism never starts outside the configured bounds. Update the flow diagram to reflect the clamp. - Replace "cutover" (now gone with the §4 rewrite) and "unmutated" (→ "unchanged") to clear the cspell failures reported by the `Build Analyze` job. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Drop the Java SDK cross-reference tables (gateway/direct mode timeouts) and the 'Why not azure_core::RetryPolicy?' subsection from the dynamic timeouts spec. These sections are no longer needed in the design document. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove §2 'Current Behavior' which detailed static timeout defaults and transport retry internals. These are implementation details that don't belong in the design spec. Renumber remaining sections 3-11 to 2-10 and update all cross-references. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

analogrelay

Looks good. Some general clarifications, but I think it's solid.

analogrelay · 2026-04-28T18:10:48Z

+Progressively increasing the timeout on each retry attempt gives the operation a better chance of
+succeeding without requiring the first attempt to use an unnecessarily long timeout — which would
+add latency to the common case.


It's not so much that increasing the initial timeout causes latency, because we still conclude the request when the response is received no matter what the timeout was. It's that long timeouts mean we can't react as quickly to network issues. Maybe we just need to use a fresh connection to resolve the issues. It's like hedging, but instead of issuing a second request to a different region, we cancel the current one and start a new one to try and "reset" things. A tighter initial timeout means we can react quickly to intermittent network issues. But keeping it small means we risk thrashing the network with a bunch of retries that make things worse. Scaling the timeout interval up gives us a balance of the two.

analogrelay · 2026-04-28T18:32:38Z

+When hedging is enabled (see `TRANSPORT_PIPELINE_SPEC.md` §3.2), the hedged attempt runs in a
+secondary region as a speculative execution. **Hedged attempts always start at tier 0 of the
+timeout ladder**, independent of the primary attempt's ladder position.


And, to be clear, Hedging is an Operation Pipeline action, right? So from the transport pipeline's perspective, there's no existing timeout ladder state for a hedged request.

analogrelay · 2026-04-28T20:45:28Z

+1. `HttpClientFactory::build()` returns a driver-owned wrapper that exposes the underlying
+   concrete `reqwest::Client` (e.g., `Arc<DriverHttpClient>` where `DriverHttpClient` holds a
+   `reqwest::Client` directly). The wrapper still implements `azure_core::http::HttpClient` for
+   any other consumer that needs the trait, but the driver's transport layer holds and uses the
+   concrete type.


Call it HttpClient, just move it into the driver :), and there's no need to still implement azure_core::http::HttpClient IMO.

analogrelay · 2026-04-28T20:47:24Z

+2. `ClientShard` (`sharded_transport.rs`) stores the `Arc<DriverHttpClient>` instead of
+   `Arc<dyn HttpClient>`. `ShardedHttpTransport::send()` calls
+   `client.dispatch_with_timeout(req, attempt_timeout)` which internally builds the
+   `reqwest::Request` and applies `RequestBuilder::timeout(attempt_timeout)` before `.send()`.


We could just put the timeout on our transport request structure, that would be similar to how reqwest does it. Just a style choice rather than a correctness thing though.

analogrelay · 2026-04-28T20:47:57Z

+This is intentionally a **driver-internal** concrete-reqwest dispatch path — it does not change
+`typespec_client_core` or the public `azure_core::http::HttpClient` trait. Consumers outside the
+driver who pass an `Arc<dyn HttpClient>` into the driver are not supported for timeout
+escalation; the driver constructs its dispatch clients itself via `HttpClientFactory` regardless
+of any externally supplied `HttpClient`.


We shouldn't allow using Arc<dyn azure_core::http::HttpClient>. We need our own abstraction now anyway.

…iews' into tvaron3/dynamicTimeouts

Sync code references with latest codebase: - Factory returns Arc<dyn TransportClient> not Arc<dyn HttpClient> - TODO is at transport_pipeline.rs:225-227 (not 247-249) - OperationOptions.end_to_end_latency_policy is a public field, not a builder method — update prose and pseudocode examples - Merge upstream/release/azure_data_cosmos-previews (session token helpers, upsert fix, response headers, query/item driver cutover) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document how reqwest per-request timeouts affect the underlying connection for HTTP/1.1 vs HTTP/2: - HTTP/1.1: dropping the response future cancels the hyper H1 callback, the dispatcher calls close() on the connection, and it is not returned to the pool. Each timeout kills a TCP conn. - HTTP/2: dropping the response future sends RST_STREAM(CANCEL) for that stream only; the multiplexed connection stays alive and is reused immediately. Also covers implications for ShardedHttpTransport shard health and h2 reset-stream accounting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

analogrelay · 2026-04-30T17:02:34Z

+- **Shard health**: The shard health tracking in `ShardedHttpTransport` should count
+  timeout-induced connection loss the same as any other connection failure for HTTP/1.1. For
+  HTTP/2, a stream-level timeout is **not** a shard-health signal — the connection is still
+  healthy.


We should probably track stream failures, and refresh connections if they hit a certain threshold. But that can be a future iteration on this.

tvaron3 and others added 15 commits March 5, 2026 15:00

Add dynamic timeout escalation spec

6d95b43

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'release/azure_data_cosmos-previews' of https://github.c…

9b1b79d

…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts

Merge remote-tracking branch 'upstream/release/azure_data_cosmos-prev…

e71c08f

…iews' into tvaron3/dynamicTimeouts

Copilot AI review requested due to automatic review settings April 21, 2026 21:33

tvaron3 requested a review from a team as a code owner April 21, 2026 21:33

github-actions Bot added the Cosmos The azure_cosmos crate label Apr 21, 2026

github-project-automation Bot added this to CosmosDB Rust SDK and Driver Apr 21, 2026

github-project-automation Bot moved this to Todo in CosmosDB Rust SDK and Driver Apr 21, 2026

Copilot started reviewing on behalf of tvaron3 April 21, 2026 21:33 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

tvaron3 and others added 3 commits April 21, 2026 16:07

analogrelay reviewed Apr 28, 2026

View reviewed changes

tvaron3 and others added 3 commits April 28, 2026 13:55

Merge remote-tracking branch 'upstream/release/azure_data_cosmos-prev…

236d357

…iews' into tvaron3/dynamicTimeouts

analogrelay reviewed Apr 30, 2026

View reviewed changes

analogrelay deleted the branch Azure:release/azure_data_cosmos-previews May 20, 2026 19:31

analogrelay closed this May 20, 2026

github-project-automation Bot moved this from Todo to Done in CosmosDB Rust SDK and Driver May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Internal] Add dynamic timeout escalation spec#4244

[Internal] Add dynamic timeout escalation spec#4244
tvaron3 wants to merge 21 commits into
Azure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts

tvaron3 commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

analogrelay left a comment

Uh oh!

analogrelay Apr 28, 2026

Uh oh!

analogrelay Apr 28, 2026

Uh oh!

analogrelay Apr 28, 2026

Uh oh!

analogrelay Apr 28, 2026

Uh oh!

analogrelay Apr 28, 2026

Uh oh!

analogrelay Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tvaron3 commented Apr 21, 2026

Dynamic Timeout Spec

Motivation

Key Design Decisions

Cross-SDK Alignment

Changes since #3871

Deferred

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

analogrelay left a comment

Choose a reason for hiding this comment

Uh oh!

analogrelay Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

analogrelay Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

analogrelay Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

analogrelay Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

analogrelay Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

analogrelay Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants