Skip to content

Untyped ValueError/AssertionError bug#47091

Merged
simorenoh merged 14 commits into
mainfrom
users/dibahl/pkrangevalue-error-fix
May 28, 2026
Merged

Untyped ValueError/AssertionError bug#47091
simorenoh merged 14 commits into
mainfrom
users/dibahl/pkrangevalue-error-fix

Conversation

@dibahlfi

@dibahlfi dibahlfi commented May 22, 2026

Copy link
Copy Markdown
Member

This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:

  1. ValueError("Ranges overlap ...")
  2. AssertionError("code bug: returned overlapping ranges ... is empty")

Both are caused by the same class of transient metadata inconsistency, but with different shapes:

  • overlap: two ranges claim the same key space
  • gap: no range claims part of key space

Why this change is needed

We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:

  • overlap could escape as a hard ValueError
  • gap could flow through as an incomplete map and later crash in SmartRoutingMapProvider with the empty-overlap assertion

So transient metadata blips could fail user workloads instead of being treated as retryable.

What changed

  • Unified handling for full-load transient snapshot inconsistencies (overlap + gap):
  • detect inconsistency
  • retry with bounded jittered backoff
  • if still inconsistent after retry budget, raise typed CosmosHttpResponseError(status_code=503) so existing retry policy can absorb it
  • Kept incremental-path safety:
  • overlap from incremental merge is converted into incremental-merge failure/fallback behavior (with guarded conversion so unrelated ValueErrors still surface normally)
  • Added dedup-by-range-id before validation to tolerate duplicate IDs across paginated snapshots from slightly skewed gateway views.
  • Applied consistently in sync and async routing map providers.

Reviewer-relevant behavior change
Before:

  • transient overlap/gap could surface as internal exceptions (ValueError / assertion)

After:

  • transient inconsistency is retried internally
  • persistent inconsistency surfaces as retryable 503 (instead of non-retryable internal crash)

Retry budget and jittered backoff
When the gateway returns an inconsistent /pkranges snapshot, it's almost always because a partition split is mid-propagation across the gateway's internal caches — different gateway nodes see slightly different partition layouts for a few hundred milliseconds, and a request that lands on a lagging node sees overlap or gap. A short pause lets the gateway catch up to itself. This is different from the usual "too many requests, back off and spread out" scenario, so the retry schedule is tuned to always wait a minimum amount of time before retrying (the floor) rather than risk retrying so fast we just see the same stale snapshot again..

4 attempts total (3 sleeps before surfacing 503). Worst-case cumulative blocking time is ~1.4 s; expected ~0.775 s when all three retries occur.
Exponential schedule with a base of 0.2 s: deterministic upper bounds are 0.2 s, 0.4 s, 0.8 s.
Per-sleep cap of 2.0 s so a future bump to the attempt count cannot accidentally produce multi-second blocking sleeps.
Floored full jitter: each sleep is drawn uniformly from [floor, upper] where floor = min(0.05 s, upper / 4).
The non-zero floor eliminates the near-zero-sleep tail of pure full jitter, which would otherwise burn a retry attempt before the gateway has had any time to converge.
The uniform distribution across [floor, upper] (rather than max(floor, uniform(0, upper))) preserves full jitter's fleet-wide herd dispersion without creating a probability spike at the floor.
The upper / 4 clamp keeps the jitter range at ≥75% of the deterministic upper bound on the smallest attempts, so the floor never collapses an attempt into a near-constant wait.

Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.

@dibahlfi dibahlfi requested a review from a team as a code owner May 22, 2026 22:14
Copilot AI review requested due to automatic review settings May 22, 2026 22:14
@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.

Changes:

  • Introduces _OverlapDetected as a dedicated overlap sentinel and improves overlap diagnostics in CollectionRoutingMap.is_complete_set_of_range.
  • Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
  • Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py Adds overlap sentinel, dedups full-load ranges by id, converts overlap ValueError into _OverlapDetected, and improves overlap error messages.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap ValueError into _IncrementalMergeFailed.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Sync provider catches _OverlapDetected, applies retry/backoff, and surfaces 503 on budget exhaustion.
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py Async provider mirrors the same _OverlapDetected retry/backoff behavior via await asyncio.sleep.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing.
sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py Adds async equivalents plus an incremental-overlap regression test for _IncrementalMergeFailed conversion.
sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics.

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py Outdated
@xinlian12

Copy link
Copy Markdown
Member

Review complete (52:47)

Posted 4 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi dibahlfi changed the title fixing ValueError bug Untyped ValueError/AssertionError bug May 23, 2026
@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12

Copy link
Copy Markdown
Member

Review complete (51:24)

Posted 2 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py
@xinlian12

Copy link
Copy Markdown
Member

Review complete (40:25)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

dibahlfi added 2 commits May 26, 2026 22:48
…etry tests

Two emulator-pipeline failures:

1) routing_map_provider.py / aio: change _shared_cache_lock from Lock to RLock. `__init__` holds this lock while indexing _shared_routing_map_cache; indexing can synchronously trigger GC of a prior PartitionKeyRangeCache on the same thread (e.g. MagicMock attribute access via _mock_set_magics, or any production scenario where GC sweeps a prior instance during __init__). The swept instance's __del__ -> release() re-acquires the same lock -> deadlock under non-reentrant Lock. Adds regression tests in test_shared_pk_range_cache{,_async}.py that hang under Lock and pass under RLock.

2) test_service_retry_policies{,_async}.py: when host points at the emulator (localhost/127.0.0.1), _setup_read_regions / _setup_write_regions now route every 'region' to the real emulator URL instead of synthesizing localhost-westus:8081 etc. After the perf change keyed the shared pkrange cache by endpoint URL and the testing-gap commit started calling update_location_cache(), the SDK actually routed pkranges/ReadFeed to the synthetic per-region URLs, which fail DNS resolution. The real ServiceRequestError surfaced before the mock could run, leaving mf.counter at 0 (sync) or drifting (async). Mapping all regions to the actual host keeps both the routing-map cache valid and the request reachable.

Also tightens test_per_partition_circuit_breaker_mm_async to skip when the recovery-phase precondition isn't met instead of asserting a flaky timing window.
@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

Comment thread sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_retry_utility.py
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py Outdated
@xinlian12

Copy link
Copy Markdown
Member

Review complete (32:07)

Posted 3 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure-cosmos/dev_requirements.txt Outdated

@tvaron3 tvaron3 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kushagraThapar kushagraThapar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @dibahlfi !

@dibahlfi

Copy link
Copy Markdown
Member Author

also not going to change the code for offline status as gateway is actually rmeoving offlined ranges and we dont have the same check in other SDKs.

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@simorenoh simorenoh merged commit 779c8e1 into main May 28, 2026
58 checks passed
@simorenoh simorenoh deleted the users/dibahl/pkrangevalue-error-fix branch May 28, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants