[WIP] Routing: Fixes HPK partition key query returning empty results during partition split#5850
Closed
ananth7592 wants to merge 2 commits into
Closed
[WIP] Routing: Fixes HPK partition key query returning empty results during partition split#5850ananth7592 wants to merge 2 commits into
ananth7592 wants to merge 2 commits into
Conversation
… partition split When the routing map is stale (e.g., after a partition split), GetTargetPartitionKeyRangesAsync could return an empty list instead of null. Unlike null (which triggers a NotFoundException and forces a retry), an empty list silently flows through to EmptyQueryPipelineStage, causing queries with a valid partition key to return zero results without any backend contact. Two changes: 1. CosmosQueryClientCore.GetTargetPartitionKeyRangesAsync: pass the forceRefresh flag through to TryGetOverlappingRangesAsync so that an explicit force-refresh actually refreshes the routing map cache. 2. CosmosQueryExecutionContextFactory.GetTargetPartitionKeyRangesAsync: when EffectiveRangesForPartitionKey is set (a partition key was specified) but the routing map returns zero overlapping ranges, retry immediately with forceRefresh=true before returning an empty result. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…PK point queries at LengthAware boundaries Root cause: When useLengthAwareRangeComparer=true and a full 3-component HPK key's 96-char EPK equals TrimEnd(128-char split boundary), the LengthAwareMinComparer places the EPK at the start of PKRange2 (index 1) via exact match. However, Range<string>.CheckOverlapping uses ordinal comparison and rejects PKRange2 because boundary_128 > EPK_96 ordinal (the 128-char boundary is lexicographically greater than its 96-char prefix). This causes the loop to run but add nothing, returning 0 ranges and silently creating an EmptyQueryPipelineStage. Fix: After the binary-search loop, if no ranges were added but the MinComparer found an exact match (minIndexRaw >= 0) at range[minIndex], fall back to checking range[minIndex-1] with ordinal CheckOverlapping. This handles the mismatch between LengthAware placement (PKRange2) and ordinal overlap checking (PKRange1 is the correct ordinal position, confirmed by production evidence: the document was found in PKRange1 in the 23:10 query). The fallback only triggers in this specific edge case: - LengthAware comparers enabled (non-internal SDK builds, default true) - Full 3-component HPK key whose EPK == TrimEnd(split boundary) - Ordinal CheckOverlapping returns false for the LengthAware-placed range Tests added/updated: - TestGetOverlappingRanges_PostSplitMap_FullHpkEpk_MustNotReturnEmpty: all 4 DataRows now pass including the boundary case (was: LengthAware returned 0) - TestGetOverlappingRanges_PostSplitMap_EpkAtBoundary_MustNotReturnEmpty: both DataRow(false) and DataRow(true) now pass (was: true FAILED) - All 18 routing-map tests pass; 2642 other unit tests unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced May 21, 2026
Member
Author
|
Not a right fix to do |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
During or shortly after a partition split, HPK (Hierarchical Partition Key) point queries (full 3-component partition key) can silently return zero results with no backend contact and no exception.
Observed in production with SDK 3.58.0. Diagnostics confirmed:
Root Cause
Primary bug: LengthAware vs ordinal mismatch in GetOverlappingRanges
After a partition split, the routing map has a 128-char boundary (H1+H2+H3 + 32 trailing zeros). A full 3-component HPK point query produces a 96-char EPK = H1+H2+H3 (no trailing zeros).
GetOverlappingRanges uses two different comparison semantics:
The document actually lives in PKRange1 because EPK_96 < boundary_128 ordinal, but GetOverlappingRanges never checks PKRange1. Routes to EmptyQueryPipelineStage -> silent empty result.
Secondary bug: forceRefresh not plumbed through
CosmosQueryClientCore.GetTargetPartitionKeyRangesAsync did not pass forceRefresh to TryGetOverlappingRangesAsync, so the stale-cache retry path was a no-op.
Fix
Primary fix: CollectionRoutingMap.GetOverlappingRanges fallback
After the main loop, if no ranges were added AND LengthAwareMinComparer found an exact match (minIndexRaw >= 0) AND minIndex > 0, fall back to check orderedRanges[minIndex - 1] with ordinal CheckOverlapping. This returns PKRange1 without affecting other query shapes:
Secondary fix: CosmosQueryClientCore + CosmosQueryExecutionContextFactory
Tests Added
Five new test methods in CollectionRoutingMapTest.cs covering the exact production scenario (128-char boundary, 96-char EPK). These previously FAILED with useLengthAwareRangeComparer=true. All 18 routing-map tests now pass.