Skip to content

[Internal] Add dynamic timeout escalation spec#4244

Open
tvaron3 wants to merge 21 commits into
Azure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts
Open

[Internal] Add dynamic timeout escalation spec#4244
tvaron3 wants to merge 21 commits into
Azure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented Apr 21, 2026

Dynamic Timeout Spec

Adds DYNAMIC_TIMEOUTS_SPEC.md to sdk/cosmos/azure_data_cosmos_driver/docs/ describing the design for escalating per-attempt request timeouts on transport retries plus adaptive connection-timeout tuning.

This PR replaces the previously-closed #3871. It contains the original spec plus the revisions made in response to that PR's deep review (see "Changes since #3871" below).

Motivation

When a request times out and is retried with the same timeout, it often fails again — particularly on slow connections, regional latency spikes, or first-byte tail latencies. Escalating the timeout on each retry attempt gives the operation a meaningfully better chance of succeeding without making the initial attempt unnecessarily slow.

Key Design Decisions

  • Data plane request timeouts escalate 6s → 10s → 65s per retry attempt.
  • Metadata request timeouts escalate 5s → 10s → 20s, aligned with the Java SDK DatabaseAccount ladder.
  • No new retry budgetmax_failover_retries = 3 already maps cleanly onto the 3-tier ladder; a hard MAX_TIMEOUT_RETRIES bound prevents unbounded growth if that default ever changes.
  • Fixed defaults — the ladder is not user-configurable. Existing ConnectionPoolOptions min/max bounds still clamp the effective values.
  • E2E deadline still wins — the operation-level deadline takes precedence over per-attempt ladder values.
  • Adaptive connection timeout starts at 1s and escalates to 5s after >3 consecutive connection failures per endpoint (one-time persistent transition in ShardedHttpTransport).
  • Per-attempt timeout via reqwest::RequestBuilder::timeout — threaded as a Duration parameter through the dispatch chain rather than being folded into request.deadline, so the deadline retains its E2E meaning.
  • Driver-internal DriverHttpClient wrapper exposing the concrete reqwest::Client so the dispatch site can actually call .timeout() (the previous Arc<dyn HttpClient> made the per-request setter unreachable).
  • Removes builder.timeout(config.request_timeout) at http_client_factory.rs:173 — that client-wide timeout would override every per-attempt value.

Cross-SDK Alignment

The Java SDK already implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.

Changes since #3871

The spec was substantially revised in response to the deep review on the previous PR. Highlights:

  • Driver-owned HTTP abstraction acknowledged throughout — references the new HttpClientFactory and the DriverHttpClient wrapper that exposes reqwest::Client to the dispatch site.
  • Per-attempt Duration plumbing — explicitly threaded through dispatch instead of mutating request.deadline; counter-ordering table added to make the interaction with E2E deadline unambiguous.
  • Hard MAX_TIMEOUT_RETRIES bound added to defend against future increases to max_failover_retries.
  • ErrorKind::Connection used for failure classification instead of reqwest::Error::is_connect() (the latter doesn't survive the abstraction boundary).
  • Transient-blip degradation for the adaptive connection timeout is explicitly acknowledged as an accepted trade-off.
  • min_*_request_timeout migration note clarifies how existing ConnectionPoolOptions values continue to clamp the new ladder.
  • azure_core::RetryPolicy rationale explains why a custom escalation policy is needed at the transport layer.
  • Thin-client ladder explicitly marked deferred.
  • "Users have no per-attempt knobs" added as an explicit non-goal.

Deferred

tvaron3 and others added 15 commits March 5, 2026 15:00
Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing
escalating connection and request timeouts on transport retries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve 3 of 4 open questions based on review feedback:

- Commit to Context-based approach (Option A) for delivering
  per-attempt timeouts, removing Options B/C. No changes needed
  in azure_core or typespec_client_core.
- Decide hedged attempts always start at tier 0 of the timeout
  ladder since they target independent regional endpoints.
- Record effective per-attempt timeout values in DiagnosticsContext
  for observability.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update the dynamic timeout spec to reflect the new pipeline
architecture from PR Azure#3887 (Driver Transport step 02):

- Replace MAX_TRANSPORT_RETRIES/TransportRetry references with
  the new FailoverRetry/SessionRetry model and failover_retry_count
- Move retry loop references from cosmos_driver.rs to the 7-stage
  loop in operation_pipeline.rs
- Change timeout delivery from Context type-map to TransportRequest
  field, matching the new pipeline data flow
- Update goal from 'increase retry count' to 'leverage existing
  retry budget' since max_failover_retries already defaults to 3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Propose adding an optional per-request timeout field to
typespec_client_core::http::Request with set_timeout()/timeout()
methods. The reqwest HttpClient implementation applies it via
reqwest::RequestBuilder::timeout().

For connection timeout escalation, the reqwest client is built
with the maximum ladder value (5s) and the per-request timeout
provides the tighter overall bound on each attempt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove all connection timeout escalation from the spec.
Connection timeouts remain static (configured via
ConnectionPoolOptions). Only request timeouts are escalated
per retry attempt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Connection timeouts use a failure-rate adaptive model instead
of per-retry escalation:

- Start at 1s (sufficient for any cloud/datacenter network)
- ShardedHttpTransport monitors connection failure rate
- On sustained failures, create new HttpClient instances with
  5s connect_timeout (one-time persistent transition)
- No azure_core changes needed — entirely internal to the
  sharded transport

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove direct mode non-goal (won't be implemented)
- Resolve connection failure threshold: >3 consecutive failures
  per endpoint triggers 1s to 5s escalation
- Old shards are actively marked unhealthy on escalation for
  immediate reclamation by health sweep
- Explicitly state per-endpoint failure tracking scope
- Remove resolved open questions 2 and 3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ShardedHttpTransport is no longer 'future integration' — it
  exists in sharded_transport.rs with per-shard health sweeps
- Fix HttpClientFactory::create() → build()
- Fix HttpClientConfig connect_timeout claim — connect_timeout
  is sourced from ConnectionPoolOptions::max_connect_timeout()
  inside the factory build() method, not as a config field
- Reference existing per-request timeout mechanism in transport
  pipeline (azure_core::sleep racing the HTTP future)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix section numbering (missing §9)
- Add rationale for 65s tier-3 jump
- Add cross-reference for ladder saturation in §5
- Add typespec_client_core cross-crate PR strategy
- Clarify adaptive connection timeout: idempotent transition,
  per-endpoint aggregation, no-fallback rationale

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Major architectural changes to align with codebase patterns:

- Move timeout escalation to transport pipeline level, indexed
  by transport-level timeout retries (not operation failover)
- Use deadline-based pattern (Instant) instead of adding a
  Duration field to TransportRequest — feeds into existing
  remaining_request_timeout() mechanism
- Remove TimeoutLadder struct; use free function + constants
  matching DOP style
- Remove AttemptTimeoutDiagnostics struct; record flat field
  directly on RequestDiagnostics
- Clarify sleep-race is enforcement mechanism; Request::set_timeout
  is long-term replacement (never both active)
- Use AtomicBool/AtomicU32 for adaptive connection timeout state
  matching lock-free hot-path pattern
- Reference PipelineType for ladder selection
- Explicit clamping order: pool bounds first, deadline last
- Add note that min_* timeout fields are unused placeholders

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix metadata min timeout max bound: 65s → 6s (matches code)
- Reconcile adaptive 1s initial vs existing 5s default: 1s is
  a new internal initial value, transitions to max_connect_timeout
- Reference Step 6 (PR Azure#3957, now merged) as the introducing PR
  for ShardedHttpTransport
- Pull latest from release/azure_data_cosmos-previews

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace typespec_client_core/azure_core dependency with
driver-internal approach using reqwest::RequestBuilder::timeout()
directly per request. No cross-crate changes needed.

Key changes:
- Remove all typespec_client_core Request/ClientMethodOptions
  proposals — driver controls reqwest directly via HttpClientFactory
- Enforce per-attempt timeout via reqwest::RequestBuilder::timeout()
  instead of azure_core::sleep() race
- Add options hierarchy section showing how users control timeout
  bounds (ConnectionPoolOptions) and e2e deadline (RuntimeOptions)
- Add thin client (Gateway 2.0) future ladder note: 6s, 6s, 10s
- Fix markdown lint: table alignment, heading levels, list formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses 15 findings from the PR Deep Reviewer pass:

- §1: Reframe Goal Azure#4 as a transport-level timeout-retry counter; add
  non-goal clarifying users have no per-attempt knobs.
- §3: Note that per-attempt enforcement needs concrete reqwest::Client
  access (HttpClient trait exposes no per-request timeout); add
  'Why not azure_core::RetryPolicy' rationale.
- §4: Mark thin-client (Gateway 2.0) ladder as deferred.
- §5: Add counter interactions table with strict evaluation order,
  MAX_TIMEOUT_RETRIES hard bound, and a single canonical pseudocode
  that does not mutate request.deadline.
- §6: Add 'Per-Attempt Deadline ≠ request.deadline' subsection.
- §7: Cite min ≤ max invariant; document that min_*_request_timeout
  becomes a hard floor (migration note).
- §8: Rewrite dispatch strategy — driver-internal DriverHttpClient
  wrapper exposing the concrete reqwest::Client; require removing the
  unconditional builder.timeout() in http_client_factory.rs:173;
  document per-attempt Duration threading through AdaptiveTransport
  and ShardedHttpTransport.
- §9: Use ErrorKind::Connection (not reqwest::Error::is_connect);
  acknowledge long-lived-process transient-blip degradation as an
  accepted trade-off; distinguish new per-endpoint
  consecutive_connect_failures from the existing per-shard counter.
- Header date bumped to 2026-04-21.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 21, 2026 21:33
@tvaron3 tvaron3 requested a review from a team as a code owner April 21, 2026 21:33
@github-actions github-actions Bot added the Cosmos The azure_cosmos crate label Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an internal design spec documenting how azure_data_cosmos_driver should implement per-attempt request timeout escalation (laddered timeouts on transport retries) plus an adaptive connect-timeout elevation mechanism in ShardedHttpTransport.

Changes:

  • Introduces a new DYNAMIC_TIMEOUTS_SPEC.md describing timeout ladder values for data-plane vs metadata requests.
  • Specifies retry-loop integration details (counter ordering, clamping order vs ConnectionPoolOptions and E2E deadline).
  • Specifies an adaptive connection-timeout model (1s → 5s) based on consecutive connection failures.

Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated
tvaron3 and others added 3 commits April 21, 2026 16:07
Applies Copilot reviewer suggestions and fixes the CI Build Analyze
(cspell) failure on the dynamic timeouts spec:

- §1 Non-goals: use instance-method form
  `OperationOptions::new().with_end_to_end_latency_policy(...)` instead
  of the inaccurate associated-function form.
- §4 Thin client ladder: rewritten to match the actual types
  (`HttpClientConfig::dataplane_gateway20`, `AdaptiveTransport::gateway20`,
  `TransportMode::Gateway20`) and clarify that "thin client" refers to
  a future Gateway20 mode with distinct retry semantics, not the existing
  HTTP/2 prior-knowledge Gateway20 transport.
- §7 Dispatch signatures: drop stale line-number citations for
  `AdaptiveTransport::send_with_dispatch` and `ShardedHttpTransport::send`
  and switch to symbol-based references. Mark the hedging path as future
  (not yet implemented).
- §9 Adaptive connection timeout: clamp the 1s initial value to
  `ConnectionPoolOptions::{min,max}_connect_timeout()` so the adaptive
  mechanism never starts outside the configured bounds. Update the flow
  diagram to reflect the clamp.
- Replace "cutover" (now gone with the §4 rewrite) and "unmutated"
  (→ "unchanged") to clear the cspell failures reported by the
  `Build Analyze` job.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drop the Java SDK cross-reference tables (gateway/direct mode
timeouts) and the 'Why not azure_core::RetryPolicy?' subsection
from the dynamic timeouts spec. These sections are no longer
needed in the design document.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove §2 'Current Behavior' which detailed static timeout defaults
and transport retry internals. These are implementation details that
don't belong in the design spec. Renumber remaining sections 3-11
to 2-10 and update all cross-references.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some general clarifications, but I think it's solid.

Comment on lines +31 to +33
Progressively increasing the timeout on each retry attempt gives the operation a better chance of
succeeding without requiring the first attempt to use an unnecessarily long timeout — which would
add latency to the common case.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so much that increasing the initial timeout causes latency, because we still conclude the request when the response is received no matter what the timeout was. It's that long timeouts mean we can't react as quickly to network issues. Maybe we just need to use a fresh connection to resolve the issues. It's like hedging, but instead of issuing a second request to a different region, we cancel the current one and start a new one to try and "reset" things. A tighter initial timeout means we can react quickly to intermittent network issues. But keeping it small means we risk thrashing the network with a bunch of retries that make things worse. Scaling the timeout interval up gives us a balance of the two.

Comment on lines +208 to +210
When hedging is enabled (see `TRANSPORT_PIPELINE_SPEC.md` §3.2), the hedged attempt runs in a
secondary region as a speculative execution. **Hedged attempts always start at tier 0 of the
timeout ladder**, independent of the primary attempt's ladder position.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, to be clear, Hedging is an Operation Pipeline action, right? So from the transport pipeline's perspective, there's no existing timeout ladder state for a hedged request.

Comment on lines +422 to +426
1. `HttpClientFactory::build()` returns a driver-owned wrapper that exposes the underlying
concrete `reqwest::Client` (e.g., `Arc<DriverHttpClient>` where `DriverHttpClient` holds a
`reqwest::Client` directly). The wrapper still implements `azure_core::http::HttpClient` for
any other consumer that needs the trait, but the driver's transport layer holds and uses the
concrete type.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it HttpClient, just move it into the driver :), and there's no need to still implement azure_core::http::HttpClient IMO.

Comment on lines +427 to +430
2. `ClientShard` (`sharded_transport.rs`) stores the `Arc<DriverHttpClient>` instead of
`Arc<dyn HttpClient>`. `ShardedHttpTransport::send()` calls
`client.dispatch_with_timeout(req, attempt_timeout)` which internally builds the
`reqwest::Request` and applies `RequestBuilder::timeout(attempt_timeout)` before `.send()`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just put the timeout on our transport request structure, that would be similar to how reqwest does it. Just a style choice rather than a correctness thing though.

Comment on lines +441 to +445
This is intentionally a **driver-internal** concrete-reqwest dispatch path — it does not change
`typespec_client_core` or the public `azure_core::http::HttpClient` trait. Consumers outside the
driver who pass an `Arc<dyn HttpClient>` into the driver are not supported for timeout
escalation; the driver constructs its dispatch clients itself via `HttpClientFactory` regardless
of any externally supplied `HttpClient`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't allow using Arc<dyn azure_core::http::HttpClient>. We need our own abstraction now anyway.

tvaron3 and others added 3 commits April 28, 2026 13:55
Sync code references with latest codebase:
- Factory returns Arc<dyn TransportClient> not Arc<dyn HttpClient>
- TODO is at transport_pipeline.rs:225-227 (not 247-249)
- OperationOptions.end_to_end_latency_policy is a public field,
  not a builder method — update prose and pseudocode examples
- Merge upstream/release/azure_data_cosmos-previews (session token
  helpers, upsert fix, response headers, query/item driver cutover)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document how reqwest per-request timeouts affect the underlying
connection for HTTP/1.1 vs HTTP/2:

- HTTP/1.1: dropping the response future cancels the hyper H1
  callback, the dispatcher calls close() on the connection, and
  it is not returned to the pool. Each timeout kills a TCP conn.
- HTTP/2: dropping the response future sends RST_STREAM(CANCEL)
  for that stream only; the multiplexed connection stays alive
  and is reused immediately.

Also covers implications for ShardedHttpTransport shard health
and h2 reset-stream accounting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- **Shard health**: The shard health tracking in `ShardedHttpTransport` should count
timeout-induced connection loss the same as any other connection failure for HTTP/1.1. For
HTTP/2, a stream-level timeout is **not** a shard-health signal — the connection is still
healthy.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably track stream failures, and refresh connections if they hit a certain threshold. But that can be a future iteration on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants