Skip to content

metadata-timeout-propagation-fix#47529

Open
dibahlfi wants to merge 2 commits into
mainfrom
users/dibahl/metadata-timeout-propagation-fix
Open

metadata-timeout-propagation-fix#47529
dibahlfi wants to merge 2 commits into
mainfrom
users/dibahl/metadata-timeout-propagation-fix

Conversation

@dibahlfi

@dibahlfi dibahlfi commented Jun 16, 2026

Copy link
Copy Markdown
Member

Issue - when a customer passes timeouts to an operation like single query_items(...) call, the SDK was honoring them on data plane operations but it was not consistently enforcing them on behind-the-scenes metadata classs the SDK has to make - reading the container, listing partitions (/pkranges), and fetching the query plan. On those calls the SDK fell back to the client/policy defaults instead of the value client set.
After this change, a per-call timeout bounds the whole query — metadata calls included.

Concrete symptoms this removes
read_timeout ignored on /pkranges. A customer that sets a tight client default and raises read_timeout for one slow query would see that query die on a sub-second routing fetch, with an error naming the wrong number (the client default, not the value they set) — pointing them at the page fetch that never ran.

connection_timeout ignored on all setup calls. A service that sets a fail-fast socket-open budget got the full 5s default on exactly the calls a cold operation is made of.

timeout= deadline didn't cover setup. A hard end-to-end bound meant to keep a slow dependency from tying up a request thread didn't actually bound /pkranges or the query plan, so slow setup could run past it.

What changed:
The per-call timeouts (and the operation's start mark) are carried into the container read, /pkranges, and query-plan calls, so they reach the request layer the same way the page fetch always has.
The query-plan call now omits read_timeout when you didn't set one, instead of forwarding None (which previously disabled that call's read timeout). It now falls back to the policy default like every other call.
The async retry loop runs its after-the-call deadline check on the normal request path, restoring sync/async parity.

No new timers or defaults are added.
Failover stays fast. The forced-short failover probes (3s account probe, 6s recovery probe) are untouched, so nothing a caller passes through a query can slow a regional failover.
No public API change.

Testing:
Unit : value is picked up and handed along at every step; the /pkranges options formatter keeps the timers; an unset timer is never forced to None; sync and async deadline parity.
Fault-injection integration (sync + async): the values actually reach the wire on the container read, query plan, and /pkranges - on a cold client and after a partition-split refresh - the account probe keeps its short budget, and a tight timeout = stops the query during setup.

- propagate per-call read_timeout, connection_timeout, and timeout (operation deadline) options across query setup metadata calls (container read, query plan, /pkranges) in sync and async paths
- extend test coverage for timeout propagation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dibahlfi dibahlfi requested a review from a team as a code owner June 16, 2026 17:50
Copilot AI review requested due to automatic review settings June 16, 2026 17:50
@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes and validates propagation of per-call timeout settings (read_timeout, connection_timeout, and operation timeout/deadline) across the Cosmos query “setup” metadata requests (container read, query plan fetch, and /pkranges) for both sync and async execution paths.

Changes:

  • Adds shared helpers in _base.py to carry/copy per-call timeout settings (and operation start time) into downstream request kwargs/options.
  • Wires these helpers into query execution (query plan), container read, and hybrid-search /pkranges metadata flows (sync + async).
  • Adds unit + live/emulator tests to assert timeout propagation and deadline behavior; updates CHANGELOG entry.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_base.py Adds helper utilities/constants to propagate per-call timeouts and operation start time into options/kwargs and ensures pk-range options retain them.
sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py Introduces Kwargs.CONNECTION_TIMEOUT constant and updates inline documentation/comments for kwargs constants.
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py Uses the new helper to copy per-call timeouts/start time from options into request kwargs for query feed calls.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py Async equivalent: copies per-call timeouts/start time from options into request kwargs for query feed calls.
sdk/cosmos/azure-cosmos/azure/cosmos/container.py Uses shared helper to forward per-call timeouts/start time into container read requests built from options.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py Async equivalent: forwards per-call timeouts/start time into container read requests.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/execution_dispatcher.py Updates query-plan dispatch to forward additional per-call timeout/deadline settings via kwargs.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/execution_dispatcher.py Async equivalent: updates query-plan dispatch timeout/deadline forwarding.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/hybrid_search_aggregator.py Ensures hybrid-search all-ranges /pkranges path carries per-call timeout/deadline options.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/hybrid_search_aggregator.py Async equivalent: carries per-call timeout/deadline options for hybrid-search all-ranges /pkranges.
sdk/cosmos/azure-cosmos/tests/test_timeout_propagation_unit.py Adds mock-light unit tests validating timeout/deadline propagation behavior through key helpers and dispatchers.
sdk/cosmos/azure-cosmos/tests/test_metadata_timeout_propagation.py Adds end-to-end fault-injection tests to confirm timeouts reach metadata setup calls and operation deadlines halt setup.
sdk/cosmos/azure-cosmos/tests/test_container_rid_header_unit.py Extends pk-range option sanitization tests for timeouts and improves mocks to seed required internal sidecars.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the timeout propagation bug fix under “Bugs Fixed”.

Comment thread sdk/cosmos/azure-cosmos/tests/test_timeout_propagation_unit.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/execution_dispatcher.py Outdated
@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent

1 similar comment
@xinlian12

Copy link
Copy Markdown
Member

@sdkReviewAgent

class Kwargs:
"""Keyword arguments used in the azure-cosmos package"""

# Whether to retry write operations if they fail. Used either at client level or request level.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💬 Observation — Scope: unrelated style change bundled with the bugfix

This hunk does two things: it adds the new CONNECTION_TIMEOUT (which the PR needs), and it converts the pre-existing RETRY_WRITE and AVAILABILITY_STRATEGY from PEP 224 attribute docstrings (triple-quoted strings after the assignment) into # comments before the assignment. The diff confirms it:

-        RETRY_WRITE: Literal["retry_write"] = "retry_write"
-        """Whether to retry write operations if they fail. Used either at client level or request level."""
+        # Whether to retry write operations if they fail. Used either at client level or request level.
+        RETRY_WRITE: Literal["retry_write"] = "retry_write"

The style conversion is unrelated to timeout propagation and has a small cost: # comments are not available to Sphinx autoattribute extensions or PyCharm-style attribute-doc tooling that pick up PEP 224 strings, whereas the previous form was. The class is private (_Constants.Kwargs), so external docs are unlikely to suffer — but the change makes this PR slightly harder to review (mixed-purpose hunks) and quietly downgrades IDE/Sphinx behavior for these constants.

Suggested: either revert the conversion for the two existing keys and add only CONNECTION_TIMEOUT in the existing PEP 224 style, or pull the style conversion into a separate PR. Either keeps this PR focused on the bug it advertises.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

# The three per-call timeout keys a caller can set on one request. The deadline
# tuple below adds OperationStartTime to these; the carry helpers iterate that
# 4-key tuple, not this one.
_PER_CALL_TIMEOUT_OPTION_KEYS: Tuple[str, ...] = (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Maintainability: _PER_CALL_TIMEOUT_OPTION_KEYS is vestigial

This 3-key tuple has no production consumer. Both helpers (_carry_per_call_timeout_options, _copy_per_call_timeouts_to_kwargs) iterate _PER_CALL_DEADLINE_OPTION_KEYS (the 4-key tuple at line 1102). The 3-key tuple is referenced only in tests/test_timeout_propagation_unit.py:122 to assert its own shape.

The two named constants look like a hierarchy ("just the timeouts" vs "timeouts + anchor"), but only the 4-key one is real. A new contributor reasonably reaching for "the canonical set of per-call timeout option keys" would import this tuple and write a carry loop that silently drops OperationStartTime — defeating the operation-deadline semantics this PR is establishing.

Suggested fix: either inline the three keys directly into _PER_CALL_DEADLINE_OPTION_KEYS and drop the 3-key tuple + its test, or, if you want to keep the distinction, add a comment that this tuple is for "name-only" use and not for iteration. The current shape is one rename away from a regression that wouldn't be caught by the existing tests (which exercise the 4-key path).

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

# kwargs, where _Request reads them. A value is copied only when set, so
# an unset timeout falls back to the client/policy default instead of
# None; setdefault keeps any explicit kwarg the caller already placed.
base._copy_per_call_timeouts_to_kwargs(options, kwargs)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation — Coverage: change-feed /pkranges setup still drops per-call timeouts

The PR description promises "a per-call timeout bounds the whole query — metadata calls included", and the new helper here lifts the per-call timeouts out of options into kwargs so the page-fetch path picks them up. Good — for regular queries.

But a few lines below, the change-feed branch builds a fresh feed_options dict and copies only excludedLocations into it:

change_feed_state: Optional[ChangeFeedState] = options.get("changeFeedState")
if change_feed_state is not None:
    feed_options = {}
    if 'excludedLocations' in options:
        feed_options['excludedLocations'] = options['excludedLocations']
    change_feed_state.populate_request_headers(self._routing_map_provider, headers, feed_options)

populate_request_headers then drives routing_provider.get_overlapping_ranges(..., feed_options) to do the /pkranges lookup (_change_feed/change_feed_state.py:289-293). Because feed_options doesn't include read_timeout / connection_timeout / timeout / OperationStartTime, this metadata /pkranges fetch falls back to the client/policy defaults — the exact symptom the PR was written to remove.

A customer calling container.query_items_change_feed(feed_range=fr, read_timeout=30, connection_timeout=2, timeout=10) on a cold client (or after a routing-cache invalidation from a 410 / partition split) will still see the misleading error message the PR's commit log calls out — just on the change-feed surface instead of the query surface.

The async sibling at aio/_cosmos_client_connection_async.py:3110-3114 has the same shape and the same gap.

Suggested fix: add base._carry_per_call_timeout_options(options, feed_options) after the excludedLocations copy in both files. Mechanically identical to what the PR already does for the hybrid all-ranges path in _execution_context/hybrid_search_aggregator.py. A parallel test modeled on test_metadata_timeout_propagation.py would lock it in.

If this is intentionally out of scope, please file a tracking issue and call it out in the CHANGELOG so customers know the change-feed surface isn't covered yet.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

"""
if not source:
return
for key in _PER_CALL_DEADLINE_OPTION_KEYS:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion — Defensive consistency: _carry_per_call_timeout_options copies explicit None while its sibling skips it

The two helpers handle None values asymmetrically:

  • _carry_per_call_timeout_options (this function, lines 1124-1126): copies any key that is present, including explicit None.
  • _copy_per_call_timeouts_to_kwargs (lines 1149-1152): skips None values via if value is not None.

This is fine today because every terminal lift into request kwargs goes through the None-skipping helper before reaching _synchronized_request._Request, where kwargs.pop("read_timeout", connection_policy.ReadTimeout) would otherwise return None and disable the socket read timeout (the exact bug Copilot flagged earlier in this PR for the query-plan dispatcher).

The latent risk is that a future code path might use the result of _carry_per_call_timeout_options to build options that are handed directly to the request layer (or used as the source for another carry that doesn't go through _copy_per_call_timeouts_to_kwargs). The intermediate dict would carry None through and re-introduce the same bug — silently, because the helpers' names don't hint at the asymmetry.

Suggested fix: make _carry_per_call_timeout_options also skip None (mirror the if value is not None guard). One-line change; eliminates the asymmetry; no behavior change today.

⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

@xinlian12

Copy link
Copy Markdown
Member

Review complete (49:30)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants