Skip to content

[Cosmos] Fix pkranges fetch: use container name for URL, RID for cache key#4047

Merged
tvaron3 merged 6 commits into
release/azure_data_cosmos-previewsfrom
fix/pkranges-use-container-name
Mar 27, 2026
Merged

[Cosmos] Fix pkranges fetch: use container name for URL, RID for cache key#4047
tvaron3 merged 6 commits into
release/azure_data_cosmos-previewsfrom
fix/pkranges-use-container-name

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented Mar 26, 2026

Summary

Fixes #4031 — the pkranges fetch was using the collection RID in a name-based URL hierarchy (dbs/perfdb/colls/<RID>/pkranges), which Cosmos DB rejects with 404 because mixed name/RID addressing is not supported.

The previous fix (#4041) corrected URL encoding (.item().item_by_rid()) but did not fix the fundamental mixed-addressing issue. This PR resolves it by passing the container name for URL construction while keeping the RID as the cache key and request context value.

Root Cause

PR #4005 changed container_connection.rs::send() to pass self.container_ref.rid() to pk_range_cache.try_lookup(). The cache used this RID to build the pkranges URL:

dbs/perfdb/colls/pLLZAIuPigw=/pkranges
     ^^^^^^       ^^^^^^^^^^^^^
     NAME          RID  ← mixed addressing → 404

All other SDK and driver operations use name-based URLs. The pkranges fetch was the only code path using a RID in a name-based link hierarchy.

Impact (observed on continuous benchmarks)

  • 1.8M 404 requests/hour from failed pkranges fetches
  • Errors silently swallowed by try_lookupOk(routing_map.ok())
  • Errors not cached → retried on every request (write lock contention on AsyncCache)
  • Loss of client-side partition key routing → gateway must route all requests
  • Throughput regression from ~110M to ~4.2M requests/hour

Changes

partition_key_range_cache.rs

  • Added collection_name: &str parameter to try_lookup, get_routing_map_for_collection, resolve_partition_key_range_by_id, and resolve_overlapping_ranges
  • Changed .item_by_rid(collection_rid).item(collection_name) for pkranges URL construction
  • Cache key remains the RID (collection_rid.to_string())
  • resource_id on the request remains the RID
  • Updated tracing::warn! to include both collection_name and collection_rid
  • Replaced 3 RID-encoding unit tests with 2 tests verifying name-based URL construction

container_connection.rs

  • Extracted collection_name from self.container_ref.name() alongside existing collection_rid
  • Passes collection_name to all pk_range_cache method calls
  • resolved_collection_rid on request context still uses the RID (unchanged)

cosmos_fault_injection.rs

  • Added fault_injection_pkrange_readfeed_is_exercised integration test
  • Injects a transient error on MetadataPartitionKeyRanges ReadFeed with hit_limit=1
  • Verifies the fault rule is hit (proving pkrange fetch code path executes)
  • Verifies subsequent item operations succeed (proving end-to-end pkrange resolution works)

Test Results

  • ✅ 31 unit tests pass (including 2 new)
  • ✅ Build succeeds
  • Integration test requires emulator (will run in CI)

The pkranges fetch URL was using the collection RID in a name-based link
hierarchy (e.g., dbs/perfdb/colls/<RID>/pkranges), which Cosmos DB rejects
with 404 because mixed name/RID addressing is not supported.

This fix passes the container name for URL construction (matching how all
other SDK operations work) while keeping the RID as the cache key and for
the resolved_collection_rid request context.

Changes:
- partition_key_range_cache: add collection_name parameter to try_lookup,
  get_routing_map_for_collection, resolve_partition_key_range_by_id, and
  resolve_overlapping_ranges; use .item(name) for URL, RID for cache key
- container_connection: extract container name alongside RID, pass both
  to pk_range_cache methods
- Updated unit tests to verify name-based URL and RID-based cache key
- Added fault injection integration test for pkrange resolution

Fixes: #4031

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 26, 2026

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure_data_cosmos

Add passthrough status tracking to FaultInjectionRule so spy rules
can verify real service responses without injecting faults.

Replace error-injection test with spy-rule test that verifies the
pkranges request returns 200, proving the URL is correct.

- Add passthrough_statuses field to FaultInjectionRule
- Record response status in FaultClient for true spy rules only
- Normalize collection_resource_id → collection_rid param naming
- Route test through fault client pipeline for correct interception

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 marked this pull request as ready for review March 26, 2026 20:25
@tvaron3 tvaron3 requested a review from a team as a code owner March 26, 2026 20:25
Copilot AI review requested due to automatic review settings March 26, 2026 20:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Cosmos partition-key-ranges (pkranges) fetch failures caused by mixed name/RID addressing by using the container name for URL construction while continuing to use the container RID for cache keys and request context.

Changes:

  • Updates PartitionKeyRangeCache APIs to accept collection_name for building name-based pkranges URLs while keeping RID-based cache keys.
  • Updates ContainerConnection to pass both container name and RID into pk-range cache lookups.
  • Extends fault injection to support “spy” passthrough status recording and adds an emulator test to assert the pkranges path is exercised and succeeds.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs Build pkranges links with container name (not RID) while keeping RID as cache key/context; update unit tests accordingly.
sdk/cosmos/azure_data_cosmos/src/handler/container_connection.rs Pass container name + RID into pk-range cache calls during request routing.
sdk/cosmos/azure_data_cosmos/src/fault_injection/rule.rs Add storage/access for passthrough response status codes on matched “spy” rules.
sdk/cosmos/azure_data_cosmos/src/fault_injection/http_client.rs Record real service response status codes for matched passthrough (spy) rules.
sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_fault_injection.rs Add emulator test covering the pkranges readfeed path via a spy rule and validating it returns 200.

Comment thread sdk/cosmos/azure_data_cosmos/src/fault_injection/rule.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos/src/fault_injection/http_client.rs Outdated
tvaron3 and others added 4 commits March 26, 2026 17:18
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-project-automation github-project-automation Bot moved this from Todo to Approved in CosmosDB Go/Rust Crew Mar 27, 2026
@tvaron3 tvaron3 merged commit 5128111 into release/azure_data_cosmos-previews Mar 27, 2026
18 checks passed
@tvaron3 tvaron3 deleted the fix/pkranges-use-container-name branch March 27, 2026 19:37
@github-project-automation github-project-automation Bot moved this from Approved to Done in CosmosDB Go/Rust Crew Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants