cosmos: fix partition-key equality/IN query panic (#4574)#4638
cosmos: fix partition-key equality/IN query panic (#4574)#4638ananth7592 wants to merge 4 commits into
Conversation
PR Azure#4588 renamed PartitionFailoverConfig to PartitionFailoverOptions (now private fields, u32 thresholds) but left test helpers in operation_pipeline.rs referencing the old name, breaking the crate test build. Re-add a #[cfg(test)] alias and use the public getter (write_failure_threshold() as i32) so the tests compile again. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A cross-partition query filtering on the partition key with an equality or IN predicate (e.g. `WHERE c.pk = @pk`) made the gateway return a closed point EPK range [X, X]. The planner had no point representation, so it built a half-open [X, X) range that collapsed to the empty set; intersecting it with topology returned None and the .expect() panicked the query worker, hanging the caller indefinitely. Add a dedicated FeedRangeRepr::EpkPoint variant for a bare hashed EPK (no logical partition key) and make routing point-aware: - query_range_to_feed_range emits EpkPoint for a closed point. - Topology resolves a point by containment (min <= X < max) to its owning partition; the request targets the whole partition by partition-key-range id and lets the SQL predicate filter (matching the .NET SDK), so no start/end-epk header is sent. - Co-located IN values de-duplicate to one request per partition, preventing duplicate documents. - FeedRange::overlaps is point-aware, so a point can never masquerade as an empty interval. Per review feedback, inclusivity flags are not threaded through the driver; the single legitimate point shape gets its own variant instead. As defense-in-depth, the planner's topology-intersection .expect() now returns a CosmosError so any future unservable plan surfaces as an error instead of a worker-thread panic. Fixes Azure#4574. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a worker-thread panic (and resulting caller hang) in the Cosmos driver when a cross-partition query filters on the partition key with an equality or IN predicate (e.g. WHERE c.id IN (@a, @b)). The gateway returns a closed point EPK range [X, X] for such predicates, which the half-open FeedRange model collapsed to the empty set [X, X), causing intersect_feed_ranges(...).expect(...) to panic. The fix introduces a first-class point representation and makes routing point-aware, matching the .NET SDK's "EpkRange spans exactly one physical partition" behavior.
Changes:
- Adds a new
FeedRangeRepr::EpkPointvariant with point-awareoverlaps/containment, plus a cache lookup (resolve_range_containing) and topology branch that route a point to its owning partition bymin <= X < maxcontainment. - Makes the planner emit whole-partition requests for points (pk-range-id only, no
start/end-epk), de-duplicates co-locatedINvalues, and replaces the two panicking.expect()calls with a newCLIENT_QUERY_PLAN_RANGE_NOT_COVERED_BY_TOPOLOGY(500/20307) error. - Includes a separate commit adding a
#[cfg(test)]PartitionFailoverConfigalias to fix a pre-existing test-build break from #4588.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
models/feed_range.rs |
New EpkPoint variant, epk_point/as_epk_point/single_epk accessors, point-aware overlaps, and tests. |
error/cosmos_status.rs |
New CLIENT_QUERY_PLAN_RANGE_NOT_COVERED_BY_TOPOLOGY sub-status (20307) and 500 status pairing. |
driver/routing/partition_endpoint_state.rs |
Test-only compatibility alias re-introducing PartitionFailoverConfig to fix #4588 test breakage. |
driver/pipeline/operation_pipeline.rs |
Test helper updated to call write_failure_threshold() accessor with as i32 cast. |
driver/dataflow/topology.rs |
Routes bare EPK points via containment vs. interval overlap; adds boundary test. |
driver/dataflow/query_plan.rs |
Updates QueryRange comment to note is_max_inclusive is now honored. |
driver/dataflow/planner.rs |
Point-aware fan-out + de-dup, query_range_to_feed_range point conversion, error helper, tests. |
driver/cache/partition_key_range_cache.rs |
Adds resolve_range_containing for point containment resolution. |
azure_data_cosmos_driver/CHANGELOG.md |
Adds the #4574 bug-fix entry. |
The change is well-documented and thoroughly tested (fresh-path routing, de-dup, boundary resolution, and a regression test preserving the logical-partition path). My comments are all minor: a changelog link that deviates from the /pull/ convention, reducible branch duplication in plan_fresh, and a missing test for the resume-path point handling. Because the change touches core partition-routing/planning logic with subtle correctness implications (boundary containment, de-duplication, resume semantics) and alters a pub API (overlaps) consumed elsewhere, it warrants human review.
- Add resume-path regression tests for IN-predicate point routing: one
where the cursor has drained the first point's partition (clipping),
one where co-located IN points dedupe while carrying the saved
continuation. Confirms the resume path neither drops nor duplicates a
partition.
- Refactor plan_fresh to compute the request range first, then a single
RequestTarget construction shared by the point and half-open branches
(mirrors plan_resume; no behavior change).
- Point-aware diagnostics: topology-not-overlapping error now renders a
single-EPK point as `point X` instead of `[X, X)` (which reads as
empty and hides the equality predicate shape).
- Document the EpkPoint vs degenerate Range{min==max} duality on
single_epk, and the assumption that the gateway never emits a
non-inclusive min==max range (the structured error is the safety net).
- CHANGELOG entry links to Azure/pull/4638 (repo convention).
- Add a removal TODO on the #[cfg(test)] PartitionFailoverConfig alias.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Fix should just force normalization to EPK (MinInclusive and MaxExclusive)
NaluTripician
left a comment
There was a problem hiding this comment.
Reviewed this end-to-end (traced the change through feed_range.rs, planner.rs, topology.rs, the routing-map cache, and effective_partition_key.rs). Nice fix — giving the bare hashed point its own EpkPoint variant instead of overloading a degenerate [X, X) range is the right call, the containment routing matches .NET, and converting the worker-thread .expect() into a propagating CosmosError genuinely unblocks the caller (it rides the same ? path that validate_query_plan already uses, so it reaches the caller rather than dying on the worker thread).
No blocking issues. A few things I confirmed are safe so they don't get re-litigated:
- Hash V1/V2 is a non-issue here — the point EPK is the gateway's already-hashed hex string, compared lexically against partition bounds in the same space; this path never re-hashes.
- Boundary resolution uses
min <= X < maxconsistently withresolve_overlapping_rangesandoverlaps; a point at a partitionminresolves to the upper partition everywhere. - No remaining worker-thread panics in the planner routing path — both
.expect()sites are gone and the point branch short-circuits beforeintersect_feed_ranges. - Resume path only ever operates on the resolved half-open partition range, so
EpkPointnever leaks into cursor/coverage math.
Three things worth addressing, none blocking:
1. (please confirm before merge) Dedup only guards the point branch. Half-open ranges never consult point_targeted_partitions. If the gateway can ever emit a single plan containing both a closed point and an overlapping half-open range that resolve to the same physical partition (e.g. WHERE c.pk = @x OR c.pk > @y), the point emits a whole-partition request and the half-open emits a clipped sub-range request — both carrying the full SQL predicate — and the same documents come back twice. Inline comment on the exact spot. If that mixed shape is provably unreachable, a one-line invariant assertion/comment would lock it in (same defensiveness you already added for the degenerate-range case).
2. (minor, latent) EpkPoint doesn't round-trip through serialization. to_json serializes via min_inclusive()/max_exclusive(), so a point becomes {min:X, max:X}, and from_json rebuilds it as a degenerate Range{X,X} (not an EpkPoint) — the original empty-set shape. It's safe today only because the planner serializes the resolved partition's half-open range, never the EpkPoint itself. A debug_assert!/comment on to_json guarding against serializing a point would stop a future to_string() from silently corrupting it.
3. (minor) is_subset_of wasn't made point-aware the way overlaps was — latent inconsistency (it'd treat the exclusive upper bound as contained for a point). No production caller passes a point today, so just future-proofing.
Test-wise the added cases genuinely prove the headline claims (no-epk-header whole-partition targeting, co-located dedup, cross-partition one-each, the min-boundary resolution the old code missed, the logical-PK regression, both resume variants). Gaps that'd be nice to add: empty IN list, IN across 3+ partitions, and — if reachable — the mixed point/half-open plan from #1.
Address review feedback on Azure#4638: rather than giving a gateway equality / `IN` point its own `FeedRange` shape, normalize the closed point `[X, X]` to the non-empty half-open range `[X, successor(X))` at the single point of entry (`query_range_to_feed_range`). The range then flows through the existing `[min, max)` routing, intersection, and overlap paths unchanged, so the empty-set collapse that panicked the query worker (Azure#4574) cannot occur and the `[min, max)` invariant holds universally. - Add `EffectivePartitionKey::successor()`: a big-endian byte increment (decode hex, increment the right-most non-`0xFF` byte, carry) that forms the exclusive upper bound. This mirrors the Java SDK's `FeedRangeInternal.addToEffectivePartitionKey(_, +1)`. The per-component leading-nibble mask (`<= 0x3F`) guarantees a hashed EPK is never all-`FF`, so the increment is width-preserving and never carries out of the width, including across hierarchical (MultiHash) component boundaries. - Remove the `FeedRangeRepr::EpkPoint` variant and its machinery (`epk_point`, `as_epk_point`, `single_epk`, point-aware `overlaps`), the topology point-containment branch, and the now-unused `PartitionKeyRangeCache::resolve_range_containing`. This eliminates the transitional "two spellings of a point" duality (explicit point vs. degenerate `Range { min == max }`). Behavior change: co-located `IN` values are no longer coalesced into one whole-partition request; each value is routed as its own `[X, successor(X))` effective-partition-key sub-range request. Each request is bounded to its own EPK, so results are not duplicated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
A cross-partition query that filters on the partition key with an equality or
INpredicate (e.g.SELECT * FROM c WHERE c.pk = @pk, or the original reproWHERE c.id IN (@a, @b)) panicked the query worker and hung the caller indefinitely:The partition key is in the SQL text (not supplied as a request-option partition key), so this is the predicate-routing path, distinct from the already-working
FeedScope::Partitionpath.Root cause
The gateway query plan returns a closed point EPK range
[X, X](isMinInclusive == isMaxInclusive == true) for any partition-key equality (INis just N such points).FeedRangeis half-open[min, max), so the point became[X, X)— the empty set.intersect_feed_rangesrequiresmin < max, returnedNone, and the.expect()panicked. Because the panic is on the worker thread, no completion is posted and the caller hangs.Fix
Per review feedback (thanks @FabianMeiswinkel), keep
[MinInclusive, MaxExclusive)as the single, universal feed-range representation and normalize the closed point into it, rather than introducing a distinct point shape:query_range_to_feed_rangenormalizes a closed point[X, X]to the non-empty half-open range[X, successor(X))at the single point of entry. It then flows through the existing[min, max)routing, intersection, and overlap paths unchanged, so the empty-set collapse cannot occur.EffectivePartitionKey::successor()forms the exclusive upper bound by a big-endian byte increment (decode hex, increment the right-most non-0xFFbyte, carry). This mirrors the Java SDK'sFeedRangeInternal.addToEffectivePartitionKey(_, +1). Because each hash component's leading nibble is masked to<= 0x3F, a hashed EPK is never all-FF, so the increment is width-preserving and never carries out of the width — including across hierarchical (MultiHash) component boundaries..expect()remains a returnedCosmosError(CLIENT_QUERY_PLAN_RANGE_NOT_COVERED_BY_TOPOLOGY) so any future unservable plan surfaces as an error rather than a worker-thread panic.This removes the transitional machinery from the earlier revision of this PR: the
FeedRangeRepr::EpkPointvariant, its accessors (epk_point,as_epk_point,single_epk), point-awareoverlaps, the topology point-containment branch, and the now-unusedPartitionKeyRangeCache::resolve_range_containing. The "two spellings of a point" duality (explicit point vs. degenerateRange { min == max }) is gone.Cross-SDK verification
The approach matches how the Java SDK normalizes a
FeedRangepoint:Range.getPointRange(X)thennormalizeRange()thenaddToEffectivePartitionKey(max, +1)produces[X, X+1), computed by the same big-endian byte increment. (Java's query path keeps a closed[X, X]and routes by partition-key-range id because it has the raw PK value to send in thex-ms-documentdb-partitionkeyheader; our gateway query plan only returns the irreversible hashed EPK, so the FeedRange normalization path is the applicable precedent.) Appending"FF"remains reserved for hierarchical prefix ranges and is unchanged.Behavior change
Co-located
INvalues are no longer coalesced into a single whole-partition request. Each value is routed as its own[X, successor(X))effective-partition-key sub-range request (x-ms-start-epk/x-ms-end-epk). Because each request is bounded to its own EPK, results are not duplicated. (The previous revision coalesced co-located points to one whole-partition request; that optimization is dropped in favor of the simpler universal representation.)Testing & coverage
Unit (added/updated):
successor_*(5 tests) onEffectivePartitionKey— last-digit increment, carry across trailing0xFF, strict-greater + width-preserving, 64-char HPK width,MINsuccessor.query_range_to_feed_range_normalizes_closed_point— closed[X, X]to[X, successor(X)).closed_point_query_range_routes_to_epk_sub_range— point to normalized[30, 31)clipped to the owning partition.in_predicate_colocated_points_emit_one_request_each/..._across_partitions_emit_one_request_each—INto one EPK-scoped request per value.resolves_normalized_point_to_owning_partition_including_boundary— normalized-range resolution incl. the partition-boundary case.resume_in_predicate_*— resume path with per-sub-range continuation tokens.plans_logical_partition_pipeline_for_partition_scoped_query— regression: the complete-PKFeedScope::Partitionpath still routes via the logical-PK header, unchanged.Full suites (green): 1861 driver lib unit tests; 43
in_memory_emulatorintegration tests; 93 SDK crate tests;cargo clippy --all-targets,cargo doc(-D warnings),cargo fmt --check, and cSpell all clean.Real Cosmos account (manual, against
/pkcontainer; not committed):The two predicate queries that previously panicked now return correct results; the logical-PK path is unchanged.
Fixes #4574.