Skip to content

cosmos: fix partition-key equality/IN query panic (#4574)#4638

Open
ananth7592 wants to merge 4 commits into
Azure:mainfrom
ananth7592:fix/4574-partition-key-query-planner-panic
Open

cosmos: fix partition-key equality/IN query panic (#4574)#4638
ananth7592 wants to merge 4 commits into
Azure:mainfrom
ananth7592:fix/4574-partition-key-query-planner-panic

Conversation

@ananth7592

@ananth7592 ananth7592 commented Jun 20, 2026

Copy link
Copy Markdown
Member

Problem

A cross-partition query that filters on the partition key with an equality or IN predicate (e.g. SELECT * FROM c WHERE c.pk = @pk, or the original repro WHERE c.id IN (@a, @b)) panicked the query worker and hung the caller indefinitely:

thread 'tokio-rt-worker' panicked at .../planner.rs:
topology provider must return ranges that overlap the query plan EPK range

The partition key is in the SQL text (not supplied as a request-option partition key), so this is the predicate-routing path, distinct from the already-working FeedScope::Partition path.

Root cause

The gateway query plan returns a closed point EPK range [X, X] (isMinInclusive == isMaxInclusive == true) for any partition-key equality (IN is just N such points). FeedRange is half-open [min, max), so the point became [X, X) — the empty set. intersect_feed_ranges requires min < max, returned None, and the .expect() panicked. Because the panic is on the worker thread, no completion is posted and the caller hangs.

Fix

Per review feedback (thanks @FabianMeiswinkel), keep [MinInclusive, MaxExclusive) as the single, universal feed-range representation and normalize the closed point into it, rather than introducing a distinct point shape:

  • query_range_to_feed_range normalizes a closed point [X, X] to the non-empty half-open range [X, successor(X)) at the single point of entry. It then flows through the existing [min, max) routing, intersection, and overlap paths unchanged, so the empty-set collapse cannot occur.
  • EffectivePartitionKey::successor() forms the exclusive upper bound by a big-endian byte increment (decode hex, increment the right-most non-0xFF byte, carry). This mirrors the Java SDK's FeedRangeInternal.addToEffectivePartitionKey(_, +1). Because each hash component's leading nibble is masked to <= 0x3F, a hashed EPK is never all-FF, so the increment is width-preserving and never carries out of the width — including across hierarchical (MultiHash) component boundaries.
  • As defense-in-depth, the planner's topology-intersection .expect() remains a returned CosmosError (CLIENT_QUERY_PLAN_RANGE_NOT_COVERED_BY_TOPOLOGY) so any future unservable plan surfaces as an error rather than a worker-thread panic.

This removes the transitional machinery from the earlier revision of this PR: the FeedRangeRepr::EpkPoint variant, its accessors (epk_point, as_epk_point, single_epk), point-aware overlaps, the topology point-containment branch, and the now-unused PartitionKeyRangeCache::resolve_range_containing. The "two spellings of a point" duality (explicit point vs. degenerate Range { min == max }) is gone.

Cross-SDK verification

The approach matches how the Java SDK normalizes a FeedRange point: Range.getPointRange(X) then normalizeRange() then addToEffectivePartitionKey(max, +1) produces [X, X+1), computed by the same big-endian byte increment. (Java's query path keeps a closed [X, X] and routes by partition-key-range id because it has the raw PK value to send in the x-ms-documentdb-partitionkey header; our gateway query plan only returns the irreversible hashed EPK, so the FeedRange normalization path is the applicable precedent.) Appending "FF" remains reserved for hierarchical prefix ranges and is unchanged.

Behavior change

Co-located IN values are no longer coalesced into a single whole-partition request. Each value is routed as its own [X, successor(X)) effective-partition-key sub-range request (x-ms-start-epk / x-ms-end-epk). Because each request is bounded to its own EPK, results are not duplicated. (The previous revision coalesced co-located points to one whole-partition request; that optimization is dropped in favor of the simpler universal representation.)

Testing & coverage

Unit (added/updated):

  • successor_* (5 tests) on EffectivePartitionKey — last-digit increment, carry across trailing 0xFF, strict-greater + width-preserving, 64-char HPK width, MIN successor.
  • query_range_to_feed_range_normalizes_closed_point — closed [X, X] to [X, successor(X)).
  • closed_point_query_range_routes_to_epk_sub_range — point to normalized [30, 31) clipped to the owning partition.
  • in_predicate_colocated_points_emit_one_request_each / ..._across_partitions_emit_one_request_eachIN to one EPK-scoped request per value.
  • resolves_normalized_point_to_owning_partition_including_boundary — normalized-range resolution incl. the partition-boundary case.
  • resume_in_predicate_* — resume path with per-sub-range continuation tokens.
  • plans_logical_partition_pipeline_for_partition_scoped_queryregression: the complete-PK FeedScope::Partition path still routes via the logical-PK header, unchanged.

Full suites (green): 1861 driver lib unit tests; 43 in_memory_emulator integration tests; 93 SDK crate tests; cargo clippy --all-targets, cargo doc (-D warnings), cargo fmt --check, and cSpell all clean.

Real Cosmos account (manual, against /pk container; not committed):

>>> RUNNING [sanity: full-container scan]: SELECT * FROM c
<<< OK     [sanity: full-container scan]: returned 5 document(s)
>>> RUNNING [#4574 repro: WHERE c.pk = @pk (equality predicate)]: SELECT * FROM c WHERE c.pk = @pk
<<< OK     [#4574 repro: WHERE c.pk = @pk (equality predicate)]: returned 2 document(s)
>>> RUNNING [#4574 repro: WHERE c.pk IN (@a, @b) (IN predicate)]: SELECT * FROM c WHERE c.pk IN (@a, @b)
<<< OK     [#4574 repro: WHERE c.pk IN (@a, @b) (IN predicate)]: returned 4 document(s)
>>> RUNNING [contrast: FeedScope::Partition(alpha) (logical-PK path)]: SELECT * FROM c
<<< OK     [contrast: FeedScope::Partition(alpha) (logical-PK path)]: returned 2 document(s)
issue #4574 reproduction PASSED: predicate + IN queries return correct results without panicking.

The two predicate queries that previously panicked now return correct results; the logical-PK path is unchanged.

Fixes #4574.

ananth7592 and others added 2 commits June 19, 2026 20:13
PR Azure#4588 renamed PartitionFailoverConfig to PartitionFailoverOptions
(now private fields, u32 thresholds) but left test helpers in
operation_pipeline.rs referencing the old name, breaking the crate test
build. Re-add a #[cfg(test)] alias and use the public getter
(write_failure_threshold() as i32) so the tests compile again.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A cross-partition query filtering on the partition key with an equality
or IN predicate (e.g. `WHERE c.pk = @pk`) made the gateway return a
closed point EPK range [X, X]. The planner had no point representation,
so it built a half-open [X, X) range that collapsed to the empty set;
intersecting it with topology returned None and the .expect() panicked
the query worker, hanging the caller indefinitely.

Add a dedicated FeedRangeRepr::EpkPoint variant for a bare hashed EPK
(no logical partition key) and make routing point-aware:

- query_range_to_feed_range emits EpkPoint for a closed point.
- Topology resolves a point by containment (min <= X < max) to its
  owning partition; the request targets the whole partition by
  partition-key-range id and lets the SQL predicate filter (matching the
  .NET SDK), so no start/end-epk header is sent.
- Co-located IN values de-duplicate to one request per partition,
  preventing duplicate documents.
- FeedRange::overlaps is point-aware, so a point can never masquerade as
  an empty interval.

Per review feedback, inclusivity flags are not threaded through the
driver; the single legitimate point shape gets its own variant instead.
As defense-in-depth, the planner's topology-intersection .expect() now
returns a CosmosError so any future unservable plan surfaces as an error
instead of a worker-thread panic.

Fixes Azure#4574.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 20, 2026 03:14
@ananth7592 ananth7592 requested a review from a team as a code owner June 20, 2026 03:14
@github-actions github-actions Bot added the Cosmos The azure_cosmos crate label Jun 20, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a worker-thread panic (and resulting caller hang) in the Cosmos driver when a cross-partition query filters on the partition key with an equality or IN predicate (e.g. WHERE c.id IN (@a, @b)). The gateway returns a closed point EPK range [X, X] for such predicates, which the half-open FeedRange model collapsed to the empty set [X, X), causing intersect_feed_ranges(...).expect(...) to panic. The fix introduces a first-class point representation and makes routing point-aware, matching the .NET SDK's "EpkRange spans exactly one physical partition" behavior.

Changes:

  • Adds a new FeedRangeRepr::EpkPoint variant with point-aware overlaps/containment, plus a cache lookup (resolve_range_containing) and topology branch that route a point to its owning partition by min <= X < max containment.
  • Makes the planner emit whole-partition requests for points (pk-range-id only, no start/end-epk), de-duplicates co-located IN values, and replaces the two panicking .expect() calls with a new CLIENT_QUERY_PLAN_RANGE_NOT_COVERED_BY_TOPOLOGY (500/20307) error.
  • Includes a separate commit adding a #[cfg(test)] PartitionFailoverConfig alias to fix a pre-existing test-build break from #4588.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
models/feed_range.rs New EpkPoint variant, epk_point/as_epk_point/single_epk accessors, point-aware overlaps, and tests.
error/cosmos_status.rs New CLIENT_QUERY_PLAN_RANGE_NOT_COVERED_BY_TOPOLOGY sub-status (20307) and 500 status pairing.
driver/routing/partition_endpoint_state.rs Test-only compatibility alias re-introducing PartitionFailoverConfig to fix #4588 test breakage.
driver/pipeline/operation_pipeline.rs Test helper updated to call write_failure_threshold() accessor with as i32 cast.
driver/dataflow/topology.rs Routes bare EPK points via containment vs. interval overlap; adds boundary test.
driver/dataflow/query_plan.rs Updates QueryRange comment to note is_max_inclusive is now honored.
driver/dataflow/planner.rs Point-aware fan-out + de-dup, query_range_to_feed_range point conversion, error helper, tests.
driver/cache/partition_key_range_cache.rs Adds resolve_range_containing for point containment resolution.
azure_data_cosmos_driver/CHANGELOG.md Adds the #4574 bug-fix entry.

The change is well-documented and thoroughly tested (fresh-path routing, de-dup, boundary resolution, and a regression test preserving the logical-partition path). My comments are all minor: a changelog link that deviates from the /pull/ convention, reducible branch duplication in plan_fresh, and a missing test for the resume-path point handling. Because the change touches core partition-routing/planning logic with subtle correctness implications (boundary containment, de-duplication, resume semantics) and alters a pub API (overlaps) consumed elsewhere, it warrants human review.

Comment thread sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/dataflow/planner.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/dataflow/planner.rs Outdated
- Add resume-path regression tests for IN-predicate point routing: one
  where the cursor has drained the first point's partition (clipping),
  one where co-located IN points dedupe while carrying the saved
  continuation. Confirms the resume path neither drops nor duplicates a
  partition.
- Refactor plan_fresh to compute the request range first, then a single
  RequestTarget construction shared by the point and half-open branches
  (mirrors plan_resume; no behavior change).
- Point-aware diagnostics: topology-not-overlapping error now renders a
  single-EPK point as `point X` instead of `[X, X)` (which reads as
  empty and hides the equality predicate shape).
- Document the EpkPoint vs degenerate Range{min==max} duality on
  single_epk, and the assumption that the gateway never emits a
  non-inclusive min==max range (the structured error is the safety net).
- CHANGELOG entry links to Azure/pull/4638 (repo convention).
- Add a removal TODO on the #[cfg(test)] PartitionFailoverConfig alias.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/models/feed_range.rs Outdated

@FabianMeiswinkel FabianMeiswinkel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix should just force normalization to EPK (MinInclusive and MaxExclusive)

@github-project-automation github-project-automation Bot moved this from Todo to Changes Requested in CosmosDB Rust SDK and Driver Jun 24, 2026

@NaluTripician NaluTripician left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed this end-to-end (traced the change through feed_range.rs, planner.rs, topology.rs, the routing-map cache, and effective_partition_key.rs). Nice fix — giving the bare hashed point its own EpkPoint variant instead of overloading a degenerate [X, X) range is the right call, the containment routing matches .NET, and converting the worker-thread .expect() into a propagating CosmosError genuinely unblocks the caller (it rides the same ? path that validate_query_plan already uses, so it reaches the caller rather than dying on the worker thread).

No blocking issues. A few things I confirmed are safe so they don't get re-litigated:

  • Hash V1/V2 is a non-issue here — the point EPK is the gateway's already-hashed hex string, compared lexically against partition bounds in the same space; this path never re-hashes.
  • Boundary resolution uses min <= X < max consistently with resolve_overlapping_ranges and overlaps; a point at a partition min resolves to the upper partition everywhere.
  • No remaining worker-thread panics in the planner routing path — both .expect() sites are gone and the point branch short-circuits before intersect_feed_ranges.
  • Resume path only ever operates on the resolved half-open partition range, so EpkPoint never leaks into cursor/coverage math.

Three things worth addressing, none blocking:

1. (please confirm before merge) Dedup only guards the point branch. Half-open ranges never consult point_targeted_partitions. If the gateway can ever emit a single plan containing both a closed point and an overlapping half-open range that resolve to the same physical partition (e.g. WHERE c.pk = @x OR c.pk > @y), the point emits a whole-partition request and the half-open emits a clipped sub-range request — both carrying the full SQL predicate — and the same documents come back twice. Inline comment on the exact spot. If that mixed shape is provably unreachable, a one-line invariant assertion/comment would lock it in (same defensiveness you already added for the degenerate-range case).

2. (minor, latent) EpkPoint doesn't round-trip through serialization. to_json serializes via min_inclusive()/max_exclusive(), so a point becomes {min:X, max:X}, and from_json rebuilds it as a degenerate Range{X,X} (not an EpkPoint) — the original empty-set shape. It's safe today only because the planner serializes the resolved partition's half-open range, never the EpkPoint itself. A debug_assert!/comment on to_json guarding against serializing a point would stop a future to_string() from silently corrupting it.

3. (minor) is_subset_of wasn't made point-aware the way overlaps was — latent inconsistency (it'd treat the exclusive upper bound as contained for a point). No production caller passes a point today, so just future-proofing.

Test-wise the added cases genuinely prove the headline claims (no-epk-header whole-partition targeting, co-located dedup, cross-partition one-each, the min-boundary resolution the old code missed, the logical-PK regression, both resume variants). Gaps that'd be nice to add: empty IN list, IN across 3+ partitions, and — if reachable — the mixed point/half-open plan from #1.

Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/dataflow/planner.rs Outdated
Address review feedback on Azure#4638: rather than giving a gateway equality /
`IN` point its own `FeedRange` shape, normalize the closed point `[X, X]`
to the non-empty half-open range `[X, successor(X))` at the single point of
entry (`query_range_to_feed_range`). The range then flows through the
existing `[min, max)` routing, intersection, and overlap paths unchanged,
so the empty-set collapse that panicked the query worker (Azure#4574) cannot
occur and the `[min, max)` invariant holds universally.

- Add `EffectivePartitionKey::successor()`: a big-endian byte increment
  (decode hex, increment the right-most non-`0xFF` byte, carry) that forms
  the exclusive upper bound. This mirrors the Java SDK's
  `FeedRangeInternal.addToEffectivePartitionKey(_, +1)`. The per-component
  leading-nibble mask (`<= 0x3F`) guarantees a hashed EPK is never all-`FF`,
  so the increment is width-preserving and never carries out of the width,
  including across hierarchical (MultiHash) component boundaries.
- Remove the `FeedRangeRepr::EpkPoint` variant and its machinery
  (`epk_point`, `as_epk_point`, `single_epk`, point-aware `overlaps`), the
  topology point-containment branch, and the now-unused
  `PartitionKeyRangeCache::resolve_range_containing`. This eliminates the
  transitional "two spellings of a point" duality (explicit point vs.
  degenerate `Range { min == max }`).

Behavior change: co-located `IN` values are no longer coalesced into one
whole-partition request; each value is routed as its own
`[X, successor(X))` effective-partition-key sub-range request. Each request
is bounded to its own EPK, so results are not duplicated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Changes Requested

Development

Successfully merging this pull request may close these issues.

Cosmos Rust SDK v0.35 still panics for full-container query with partition-key IN predicate

4 participants