Untyped ValueError/AssertionError bug by dibahlfi · Pull Request #47091 · Azure/azure-sdk-for-python

dibahlfi · 2026-05-22T22:14:51Z

This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:

ValueError("Ranges overlap ...")
AssertionError("code bug: returned overlapping ranges ... is empty")

Both are caused by the same class of transient metadata inconsistency, but with different shapes:

overlap: two ranges claim the same key space
gap: no range claims part of key space

Why this change is needed

We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:

overlap could escape as a hard ValueError
gap could flow through as an incomplete map and later crash in SmartRoutingMapProvider with the empty-overlap assertion

So transient metadata blips could fail user workloads instead of being treated as retryable.

What changed

Unified handling for full-load transient snapshot inconsistencies (overlap + gap):
detect inconsistency
retry with bounded jittered backoff
if still inconsistent after retry budget, raise typed CosmosHttpResponseError(status_code=503) so existing retry policy can absorb it
Kept incremental-path safety:
overlap from incremental merge is converted into incremental-merge failure/fallback behavior (with guarded conversion so unrelated ValueErrors still surface normally)
Added dedup-by-range-id before validation to tolerate duplicate IDs across paginated snapshots from slightly skewed gateway views.
Applied consistently in sync and async routing map providers.

Reviewer-relevant behavior change
Before:

transient overlap/gap could surface as internal exceptions (ValueError / assertion)

After:

transient inconsistency is retried internally
persistent inconsistency surfaces as retryable 503 (instead of non-retryable internal crash)

Retry budget and jittered backoff
When the gateway returns an inconsistent /pkranges snapshot, it's almost always because a partition split is mid-propagation across the gateway's internal caches — different gateway nodes see slightly different partition layouts for a few hundred milliseconds, and a request that lands on a lagging node sees overlap or gap. A short pause lets the gateway catch up to itself. This is different from the usual "too many requests, back off and spread out" scenario, so the retry schedule is tuned to always wait a minimum amount of time before retrying (the floor) rather than risk retrying so fast we just see the same stale snapshot again..

4 attempts total (3 sleeps before surfacing 503). Worst-case cumulative blocking time is ~1.4 s; expected ~0.775 s when all three retries occur.
Exponential schedule with a base of 0.2 s: deterministic upper bounds are 0.2 s, 0.4 s, 0.8 s.
Per-sleep cap of 2.0 s so a future bump to the attempt count cannot accidentally produce multi-second blocking sleeps.
Floored full jitter: each sleep is drawn uniformly from [floor, upper] where floor = min(0.05 s, upper / 4).
The non-zero floor eliminates the near-zero-sleep tail of pure full jitter, which would otherwise burn a retry attempt before the gateway has had any time to converge.
The uniform distribution across [floor, upper] (rather than max(floor, uniform(0, upper))) preserves full jitter's fleet-wide herd dispersion without creating a probability spike at the floor.
The upper / 4 clamp keeps the jitter range at ≥75% of the deterministic upper bound on the smallest attempts, so the floor never collapses an attempt into a near-constant wait.

Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.

dibahlfi · 2026-05-22T22:15:31Z

@sdkReviewAgent-2

dibahlfi · 2026-05-22T22:16:06Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-22T22:16:23Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.

Changes:

Introduces _OverlapDetected as a dedicated overlap sentinel and improves overlap diagnostics in CollectionRoutingMap.is_complete_set_of_range.
Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py	Adds overlap sentinel, dedups full-load ranges by id, converts overlap `ValueError` into `_OverlapDetected`, and improves overlap error messages.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py	Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap `ValueError` into `_IncrementalMergeFailed`.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py	Sync provider catches `_OverlapDetected`, applies retry/backoff, and surfaces 503 on budget exhaustion.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py	Async provider mirrors the same `_OverlapDetected` retry/backoff behavior via `await asyncio.sleep`.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py	Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py	Adds async equivalents plus an incremental-overlap regression test for `_IncrementalMergeFailed` conversion.
sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py	Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics.

xinlian12 · 2026-05-22T23:08:34Z

✅ Review complete (52:47)

Posted 4 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-23T03:40:59Z

/azp run python - cosmos - tests

dibahlfi · 2026-05-23T03:41:09Z

@sdkReviewAgent-2

azure-pipelines · 2026-05-23T03:41:14Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-23T04:33:17Z

✅ Review complete (51:24)

Posted 2 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-23T04:51:33Z

/azp run python - cosmos - tests

dibahlfi · 2026-05-23T04:51:37Z

@sdkReviewAgent-2

azure-pipelines · 2026-05-23T04:51:47Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2026-05-25T14:33:12Z

✅ Review complete (40:25)

Posted 1 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-27T01:03:59Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T01:04:14Z

Azure Pipelines successfully started running 1 pipeline(s).

…etry tests Two emulator-pipeline failures: 1) routing_map_provider.py / aio: change _shared_cache_lock from Lock to RLock. `__init__` holds this lock while indexing _shared_routing_map_cache; indexing can synchronously trigger GC of a prior PartitionKeyRangeCache on the same thread (e.g. MagicMock attribute access via _mock_set_magics, or any production scenario where GC sweeps a prior instance during __init__). The swept instance's __del__ -> release() re-acquires the same lock -> deadlock under non-reentrant Lock. Adds regression tests in test_shared_pk_range_cache{,_async}.py that hang under Lock and pass under RLock. 2) test_service_retry_policies{,_async}.py: when host points at the emulator (localhost/127.0.0.1), _setup_read_regions / _setup_write_regions now route every 'region' to the real emulator URL instead of synthesizing localhost-westus:8081 etc. After the perf change keyed the shared pkrange cache by endpoint URL and the testing-gap commit started calling update_location_cache(), the SDK actually routed pkranges/ReadFeed to the synthetic per-region URLs, which fail DNS resolution. The real ServiceRequestError surfaced before the mock could run, leaving mf.counter at 0 (sync) or drifting (async). Mapping all regions to the actual host keeps both the routing-map cache valid and the request reachable. Also tightens test_per_partition_circuit_breaker_mm_async to skip when the recovery-phase precondition isn't met instead of asserting a flaky timing window.

dibahlfi · 2026-05-27T04:42:42Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T04:42:57Z

Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi · 2026-05-27T04:57:34Z

@sdkReviewAgent-2

xinlian12 · 2026-05-27T14:29:59Z

✅ Review complete (32:07)

Posted 3 inline comment(s).

_{Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage}

dibahlfi · 2026-05-27T16:39:45Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T16:40:02Z

Azure Pipelines successfully started running 1 pipeline(s).

tvaron3

LGTM

kushagraThapar

LGTM, thanks @dibahlfi !

dibahlfi · 2026-05-27T22:48:14Z

also not going to change the code for offline status as gateway is actually rmeoving offlined ranges and we dont have the same check in other SDKs.

dibahlfi · 2026-05-27T23:28:28Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-27T23:28:43Z

Azure Pipelines successfully started running 1 pipeline(s).

…fra issue on slit pipelines.

dibahlfi · 2026-05-28T06:48:42Z

/azp run python - cosmos - tests

azure-pipelines · 2026-05-28T06:48:56Z

Azure Pipelines successfully started running 1 pipeline(s).

fixing ValueError bug

34447a0

dibahlfi requested a review from a team as a code owner May 22, 2026 22:14

Copilot AI review requested due to automatic review settings May 22, 2026 22:14

Copilot started reviewing on behalf of dibahlfi May 22, 2026 22:15 View session

github-actions Bot added the Cosmos label May 22, 2026

github-project-automation Bot added this to CosmosDB Python Eco-System May 22, 2026

Copilot AI reviewed May 22, 2026

View reviewed changes

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py

xinlian12 reviewed May 22, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated

dibahlfi changed the title ~~fixing ValueError bug~~ Untyped ValueError/AssertionError bug May 23, 2026

fixing copilot comments

15a1c68

xinlian12 reviewed May 23, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py

xinlian12 reviewed May 23, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py

fixing tests

d439dad

xinlian12 reviewed May 25, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py

fixing test

2a52bdc

dibahlfi added 2 commits May 26, 2026 22:48

fixing testing gap

6f3494c

xinlian12 reviewed May 27, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py Outdated

xinlian12 reviewed May 27, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py

xinlian12 reviewed May 27, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated

addressing pylint errors

668cfc4

tvaron3 reviewed May 27, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/dev_requirements.txt Outdated

tvaron3 reviewed May 27, 2026

View reviewed changes

Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py

tvaron3 approved these changes May 27, 2026

View reviewed changes

kushagraThapar approved these changes May 27, 2026

View reviewed changes

adding new RLock specific tests

dbe2df9

tvaron3 approved these changes May 28, 2026

View reviewed changes

fixing two dormant issues which got explosed by rlock and the test in…

7e19f16

…fra issue on slit pipelines.

simorenoh approved these changes May 28, 2026

View reviewed changes

simorenoh merged commit 779c8e1 into main May 28, 2026
58 checks passed

simorenoh deleted the users/dibahl/pkrangevalue-error-fix branch May 28, 2026 15:44

github-project-automation Bot moved this to Done in CosmosDB Python Eco-System May 28, 2026

Conversation

dibahlfi commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dibahlfi commented May 22, 2026

Uh oh!

dibahlfi commented May 22, 2026

Uh oh!

azure-pipelines Bot commented May 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented May 22, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

azure-pipelines Bot commented May 23, 2026

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented May 23, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

dibahlfi commented May 23, 2026

Uh oh!

azure-pipelines Bot commented May 23, 2026

Uh oh!

Uh oh!

xinlian12 commented May 25, 2026

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

azure-pipelines Bot commented May 27, 2026

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

azure-pipelines Bot commented May 27, 2026

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented May 27, 2026

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

azure-pipelines Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

tvaron3 left a comment

Choose a reason for hiding this comment

Uh oh!

kushagraThapar left a comment

Choose a reason for hiding this comment

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

dibahlfi commented May 27, 2026

Uh oh!

azure-pipelines Bot commented May 27, 2026

Uh oh!

dibahlfi commented May 28, 2026

Uh oh!

azure-pipelines Bot commented May 28, 2026

dibahlfi commented May 22, 2026 •

edited

Loading