Untyped ValueError/AssertionError bug#47091
Conversation
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR addresses a Cosmos routing-map construction failure where transient, paginated /pkranges snapshot inconsistencies could surface as a bare ValueError("Ranges overlap") (or be swallowed into empty results), by converting overlap into a typed sentinel + bounded retry/backoff and ultimately surfacing a transient CosmosHttpResponseError(503) when the retry budget is exhausted.
Changes:
- Introduces
_OverlapDetectedas a dedicated overlap sentinel and improves overlap diagnostics inCollectionRoutingMap.is_complete_set_of_range. - Adds a shared overlap retry/backoff policy helper (with jitter) and integrates it into both sync and async routing-map providers.
- Adds regression + end-to-end tests covering transient overlap recovery, persistent overlap → 503, and incremental-merge overlap fallback behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/collection_routing_map.py | Adds overlap sentinel, dedups full-load ranges by id, converts overlap ValueError into _OverlapDetected, and improves overlap error messages. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/_routing_map_provider_common.py | Centralizes overlap retry budget/backoff policy (with jitter) and converts incremental-merge overlap ValueError into _IncrementalMergeFailed. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/routing_map_provider.py | Sync provider catches _OverlapDetected, applies retry/backoff, and surfaces 503 on budget exhaustion. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py | Async provider mirrors the same _OverlapDetected retry/backoff behavior via await asyncio.sleep. |
| sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit.py | Adds sync-focused unit/e2e tests for overlap retry/backoff and persistent-overlap 503 surfacing. |
| sdk/cosmos/azure-cosmos/tests/test_routing_map_provider_unit_async.py | Adds async equivalents plus an incremental-overlap regression test for _IncrementalMergeFailed conversion. |
| sdk/cosmos/azure-cosmos/tests/routing/test_collection_routing_map.py | Adds routing-map builder regression tests for the three observed overlap modes and improved overlap diagnostics. |
|
✅ Review complete (52:47) Posted 4 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (51:24) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
@sdkReviewAgent-2 |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (40:25) Posted 1 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…etry tests
Two emulator-pipeline failures:
1) routing_map_provider.py / aio: change _shared_cache_lock from Lock to RLock. `__init__` holds this lock while indexing _shared_routing_map_cache; indexing can synchronously trigger GC of a prior PartitionKeyRangeCache on the same thread (e.g. MagicMock attribute access via _mock_set_magics, or any production scenario where GC sweeps a prior instance during __init__). The swept instance's __del__ -> release() re-acquires the same lock -> deadlock under non-reentrant Lock. Adds regression tests in test_shared_pk_range_cache{,_async}.py that hang under Lock and pass under RLock.
2) test_service_retry_policies{,_async}.py: when host points at the emulator (localhost/127.0.0.1), _setup_read_regions / _setup_write_regions now route every 'region' to the real emulator URL instead of synthesizing localhost-westus:8081 etc. After the perf change keyed the shared pkrange cache by endpoint URL and the testing-gap commit started calling update_location_cache(), the SDK actually routed pkranges/ReadFeed to the synthetic per-region URLs, which fail DNS resolution. The real ServiceRequestError surfaced before the mock could run, leaving mf.counter at 0 (sync) or drifting (async). Mapping all regions to the actual host keeps both the routing-map cache valid and the request reachable.
Also tightens test_per_partition_circuit_breaker_mm_async to skip when the recovery-phase precondition isn't met instead of asserting a flaky timing window.
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent-2 |
|
✅ Review complete (32:07) Posted 3 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
kushagraThapar
left a comment
There was a problem hiding this comment.
LGTM, thanks @dibahlfi !
|
also not going to change the code for offline status as gateway is actually rmeoving offlined ranges and we dont have the same check in other SDKs. |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…fra issue on slit pipelines.
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This PR fixes two related crash paths in the partition key range cache that can happen when the gateway returns a transiently inconsistent paginated /pkranges snapshot:
Both are caused by the same class of transient metadata inconsistency, but with different shapes:
Why this change is needed
We’ve seen this in a real high-scale scenario (very high partition count, long-running async scan).
The SDK previously handled these inconsistencies asymmetrically:
So transient metadata blips could fail user workloads instead of being treated as retryable.
What changed
Reviewer-relevant behavior change
Before:
After:
Retry budget and jittered backoff
When the gateway returns an inconsistent /pkranges snapshot, it's almost always because a partition split is mid-propagation across the gateway's internal caches — different gateway nodes see slightly different partition layouts for a few hundred milliseconds, and a request that lands on a lagging node sees overlap or gap. A short pause lets the gateway catch up to itself. This is different from the usual "too many requests, back off and spread out" scenario, so the retry schedule is tuned to always wait a minimum amount of time before retrying (the floor) rather than risk retrying so fast we just see the same stale snapshot again..
4 attempts total (3 sleeps before surfacing 503). Worst-case cumulative blocking time is ~1.4 s; expected ~0.775 s when all three retries occur.
Exponential schedule with a base of 0.2 s: deterministic upper bounds are 0.2 s, 0.4 s, 0.8 s.
Per-sleep cap of 2.0 s so a future bump to the attempt count cannot accidentally produce multi-second blocking sleeps.
Floored full jitter: each sleep is drawn uniformly from [floor, upper] where floor = min(0.05 s, upper / 4).
The non-zero floor eliminates the near-zero-sleep tail of pure full jitter, which would otherwise burn a retry attempt before the gateway has had any time to converge.
The uniform distribution across [floor, upper] (rather than max(floor, uniform(0, upper))) preserves full jitter's fleet-wide herd dispersion without creating a probability spike at the floor.
The upper / 4 clamp keeps the jitter range at ≥75% of the deterministic upper bound on the smallest attempts, so the floor never collapses an attempt into a near-constant wait.
Scope
This PR is intentionally scoped to PK-range cache resilience for transient /pkranges inconsistency handling; it does not change public API surface.