Fix pk_range_cache to use .item_by_rid() for correct URL fetching#4032
Conversation
|
@copilot continue |
- Change .item(collection_rid) to .item_by_rid(collection_rid) in get_routing_map_for_collection() so RIDs are not URL-encoded (e.g., '=' → '%3D') when building the pkranges link. This was causing 404s from Cosmos DB on every partition key range fetch. - Add tracing::warn! in try_lookup() before swallowing errors so silent fetch failures are visible in diagnostics. Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com> Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-rust/sessions/7e76c701-3188-407a-8da0-e18abad78ebb
Done in commit
Build and clippy pass cleanly. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Done. Commit
All 439 unit tests pass and Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Add three tests in partition_key_range_cache.rs that verify: - item_by_rid() preserves '=' in RIDs like "pLLZAIuPigw=" (no %3D) - item() incorrectly URL-encodes '=' to '%3D' (documenting the bug) - item_by_rid() also preserves '+' and '/' in base64 RIDs Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com> Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-rust/sessions/f515482f-d87e-4105-9421-6ff7dac1508b
There was a problem hiding this comment.
Pull request overview
Fixes Cosmos partition key range (pkranges) routing map fetches when the cache key is a collection RID by ensuring the RID is not URL-encoded in the resource link, and adds diagnostics + unit tests to prevent regressions.
Changes:
- Construct pkranges resource links with
.item_by_rid(collection_rid)to avoid percent-encoding RIDs (fixes 404s on pkranges fetch). - Emit a
tracing::warn!when routing map fetch fails before the error is converted intoOk(None). - Add unit tests validating correct vs incorrect pkranges link construction for RID inputs containing
=,+,/.
Adds test-string words used in the pkranges URL encoding unit tests to sdk/cosmos/.cspell.json's ignoreWords list so CI spell checks pass. Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com> Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-rust/sessions/97fd20b9-499f-4c04-a672-232da75c3fe6
Addressed the cspell CI failures in commit |
|
/check-enforcer override |
…ure#4032) Fixes two issues in `sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs`: In `get_routing_map_for_collection()`, changed `.item(collection_rid)` to `.item_by_rid(collection_rid)`. The `.item()` method URL-encodes the value via `LinkSegment::new()`, so RIDs like `pLLZAIuPigw=` were being encoded to `pLLZAIuPigw%3D` in the URL, causing Cosmos DB to return 404 on every partition key range fetch. The `.item_by_rid()` method uses `LinkSegment::identity()` (no encoding), producing the correct URL: ``` dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct ``` Added `tracing::warn!` before the `.ok()` call in `try_lookup()` that was silently swallowing errors. Routing map fetch failures now emit a warning with the `collection_rid` and error details, making silent failures visible in diagnostics. Added three unit tests in `partition_key_range_cache.rs` that directly verify the URL construction behavior: - `pkranges_link_rid_with_equals_is_not_encoded` — verifies `item_by_rid()` preserves `=` in the RID path (e.g., `pLLZAIuPigw=` → `dbs/perfdb/colls/pLLZAIuPigw=/pkranges`) - `pkranges_link_item_encodes_equals_incorrectly` — documents the bug: `item()` encodes `=` to `%3D`, producing a path that causes 404s - `pkranges_link_rid_with_plus_is_not_encoded` — verifies `item_by_rid()` also preserves `+` and `/` in base64 RIDs PR Azure#4005 changed the `pk_range_cache` key from container name to collection RID. The URL construction code was not updated to use `.item_by_rid()`, causing RID URL-encoding and subsequent 404s on every pkranges fetch. Because errors were silently swallowed and `AsyncCache` does not cache errors, this failed on every single request, resulting in write lock contention and loss of client-side partition key routing (~7% throughput regression observed in benchmarks). <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> ---- *This section details on the original issue you should resolve* <issue_title>[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request</issue_title> <issue_description>## Bug Report PR [Azure#4005](Azure#4005) changed the `pk_range_cache` key from container **name** to collection **RID**. However, the code that constructs the URL for fetching partition key ranges still uses `.item(collection_rid)` instead of `.item_by_rid(collection_rid)`. This causes the RID to be URL-encoded (e.g., `=` → `%3D`), resulting in a **404 from Cosmos DB** on every partition key range fetch attempt. Because `try_lookup` silently swallows errors via `Ok(routing_map.ok())` and `AsyncCache` does not cache errors, this failure repeats on **every single request**, causing: 1. **1.6M extra 404 requests/hour** observed on a benchmark account after deploying the change 2. **Write lock contention** in `AsyncCache` as every concurrent operation serializes through the failed fetch path 3. **Loss of client-side partition key routing** — the gateway must route all requests instead 4. **~7% throughput regression** observed in continuous benchmarks (110M → 102M requests/hour) In `partition_key_range_cache.rs`, `get_routing_map_for_collection()`: ```rust let pk_range_link = self .database_link // dbs/perfdb .feed(ResourceType::Containers) .item(collection_rid) // ← BUG: .item() URL-encodes the RID .feed(ResourceType::PartitionKeyRanges); ``` `.item()` calls `LinkSegment::new()` which URL-encodes the value. RIDs like `pLLZAIuPigw=` get the `=` encoded to `%3D`: ``` dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges ← Cosmos DB returns 404 ``` Should use `.item_by_rid()` which calls `LinkSegment::identity()` (no encoding): ``` dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct ``` In `partition_key_range_cache.rs` line 147, `try_lookup()`: ```rust Ok(routing_map.ok()) // Converts Err(404) → Ok(None), invisible to caller ``` The caller in `container_connection.rs` sees `Ok(None)` and skips the routing block entirely: ```rust let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?; if let Some(routing_map) = routing_map { // SKIPPED — no client-side partition key range resolution } ``` `AsyncCache::get()` only caches successful values. When `compute()` returns `Err`, the error propagates and the cache remains empty. Every subsequent request: 1. Read lock → cache miss 2. Acquire **write lock** (serializes all concurrent operations on the same key) 3. HTTP request to Cosmos DB → **404** 4. Error propagated, cache stays empty 5. Error swallowed as `Ok(None)` 6. Routing bypassed Continuous benchmark on `cosmos-perf-rg` (4 pods, concurrency=100, 400K RU/s): | Hour (UTC) | 404 Count | Notes | |---|---|---| | 13:02 – 17:02 | 3,500 – 6,300 | Normal background | | **18:02** | **1,645,604** | After deploying commit `98d01c8` | Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side. ```diff // partition_key_range_cache.rs, get_routing_map_for_collection() let pk_range_link = self .database_link .feed(ResourceType::Containers) - .item(collection_rid) + .item_by_rid(collection_rid) .feed(ResourceType::PartitionKeyRanges); ``` Additionally, consider: - Logging errors in `try_lookup` before swallowing them, to make silent failures visible - Adding negative caching (or a backoff) in `AsyncCache` to avoid retrying failed fetches on every request Commit [`98d01c8`](Azure@98d01c8) on `release/azure_data_cosmos-previews` branch (PR Azure#4005). </issue_description> <agent_instructions> @copilot fix this issue. The target branch is release/azure_data_cosmos-previews. Required changes: 1. Primary fix — In sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs, in get_routing_map_for_collection(), change .item(collection_rid) to .item_by_rid(collection_rid) so the collection RID is not URL-encoded when constructing the pkranges resource link. 2. Add error logging in try_lookup — Before the .ok() on the last line of try_lookup() in the same file, add a tracing::warn! that logs when the routing map fetch fails, including the collection_rid and the error. This ensures silent failures are visible in diagnostics. Example: let routing_map = self.routing_map_cache.get(/* ... */).await; if let Err(ref e) = routing_map { tracing::warn!( collection_rid, ... </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes Azure#4031 <!-- START COPILOT CODING AGENT TIPS --> --- 📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs), [Azure Boards](https://gh.io/cca-azure-boards-docs) or [Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in one click without leaving your project management tool. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
…ure#4032) Fixes two issues in `sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs`: In `get_routing_map_for_collection()`, changed `.item(collection_rid)` to `.item_by_rid(collection_rid)`. The `.item()` method URL-encodes the value via `LinkSegment::new()`, so RIDs like `pLLZAIuPigw=` were being encoded to `pLLZAIuPigw%3D` in the URL, causing Cosmos DB to return 404 on every partition key range fetch. The `.item_by_rid()` method uses `LinkSegment::identity()` (no encoding), producing the correct URL: ``` dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct ``` Added `tracing::warn!` before the `.ok()` call in `try_lookup()` that was silently swallowing errors. Routing map fetch failures now emit a warning with the `collection_rid` and error details, making silent failures visible in diagnostics. Added three unit tests in `partition_key_range_cache.rs` that directly verify the URL construction behavior: - `pkranges_link_rid_with_equals_is_not_encoded` — verifies `item_by_rid()` preserves `=` in the RID path (e.g., `pLLZAIuPigw=` → `dbs/perfdb/colls/pLLZAIuPigw=/pkranges`) - `pkranges_link_item_encodes_equals_incorrectly` — documents the bug: `item()` encodes `=` to `%3D`, producing a path that causes 404s - `pkranges_link_rid_with_plus_is_not_encoded` — verifies `item_by_rid()` also preserves `+` and `/` in base64 RIDs PR Azure#4005 changed the `pk_range_cache` key from container name to collection RID. The URL construction code was not updated to use `.item_by_rid()`, causing RID URL-encoding and subsequent 404s on every pkranges fetch. Because errors were silently swallowed and `AsyncCache` does not cache errors, this failed on every single request, resulting in write lock contention and loss of client-side partition key routing (~7% throughput regression observed in benchmarks). <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> ---- *This section details on the original issue you should resolve* <issue_title>[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request</issue_title> <issue_description>## Bug Report PR [Azure#4005](Azure#4005) changed the `pk_range_cache` key from container **name** to collection **RID**. However, the code that constructs the URL for fetching partition key ranges still uses `.item(collection_rid)` instead of `.item_by_rid(collection_rid)`. This causes the RID to be URL-encoded (e.g., `=` → `%3D`), resulting in a **404 from Cosmos DB** on every partition key range fetch attempt. Because `try_lookup` silently swallows errors via `Ok(routing_map.ok())` and `AsyncCache` does not cache errors, this failure repeats on **every single request**, causing: 1. **1.6M extra 404 requests/hour** observed on a benchmark account after deploying the change 2. **Write lock contention** in `AsyncCache` as every concurrent operation serializes through the failed fetch path 3. **Loss of client-side partition key routing** — the gateway must route all requests instead 4. **~7% throughput regression** observed in continuous benchmarks (110M → 102M requests/hour) In `partition_key_range_cache.rs`, `get_routing_map_for_collection()`: ```rust let pk_range_link = self .database_link // dbs/perfdb .feed(ResourceType::Containers) .item(collection_rid) // ← BUG: .item() URL-encodes the RID .feed(ResourceType::PartitionKeyRanges); ``` `.item()` calls `LinkSegment::new()` which URL-encodes the value. RIDs like `pLLZAIuPigw=` get the `=` encoded to `%3D`: ``` dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges ← Cosmos DB returns 404 ``` Should use `.item_by_rid()` which calls `LinkSegment::identity()` (no encoding): ``` dbs/perfdb/colls/pLLZAIuPigw=/pkranges ← correct ``` In `partition_key_range_cache.rs` line 147, `try_lookup()`: ```rust Ok(routing_map.ok()) // Converts Err(404) → Ok(None), invisible to caller ``` The caller in `container_connection.rs` sees `Ok(None)` and skips the routing block entirely: ```rust let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?; if let Some(routing_map) = routing_map { // SKIPPED — no client-side partition key range resolution } ``` `AsyncCache::get()` only caches successful values. When `compute()` returns `Err`, the error propagates and the cache remains empty. Every subsequent request: 1. Read lock → cache miss 2. Acquire **write lock** (serializes all concurrent operations on the same key) 3. HTTP request to Cosmos DB → **404** 4. Error propagated, cache stays empty 5. Error swallowed as `Ok(None)` 6. Routing bypassed Continuous benchmark on `cosmos-perf-rg` (4 pods, concurrency=100, 400K RU/s): | Hour (UTC) | 404 Count | Notes | |---|---|---| | 13:02 – 17:02 | 3,500 – 6,300 | Normal background | | **18:02** | **1,645,604** | After deploying commit `98d01c8` | Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side. ```diff // partition_key_range_cache.rs, get_routing_map_for_collection() let pk_range_link = self .database_link .feed(ResourceType::Containers) - .item(collection_rid) + .item_by_rid(collection_rid) .feed(ResourceType::PartitionKeyRanges); ``` Additionally, consider: - Logging errors in `try_lookup` before swallowing them, to make silent failures visible - Adding negative caching (or a backoff) in `AsyncCache` to avoid retrying failed fetches on every request Commit [`98d01c8`](Azure@98d01c8) on `release/azure_data_cosmos-previews` branch (PR Azure#4005). </issue_description> <agent_instructions> @copilot fix this issue. The target branch is release/azure_data_cosmos-previews. Required changes: 1. Primary fix — In sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs, in get_routing_map_for_collection(), change .item(collection_rid) to .item_by_rid(collection_rid) so the collection RID is not URL-encoded when constructing the pkranges resource link. 2. Add error logging in try_lookup — Before the .ok() on the last line of try_lookup() in the same file, add a tracing::warn! that logs when the routing map fetch fails, including the collection_rid and the error. This ensures silent failures are visible in diagnostics. Example: let routing_map = self.routing_map_cache.get(/* ... */).await; if let Err(ref e) = routing_map { tracing::warn!( collection_rid, ... </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes Azure#4031 <!-- START COPILOT CODING AGENT TIPS --> --- 📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs), [Azure Boards](https://gh.io/cca-azure-boards-docs) or [Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in one click without leaving your project management tool. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
Fixes two issues in
sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs:Changes Made
Primary fix — correct pkranges resource link URL construction
In
get_routing_map_for_collection(), changed.item(collection_rid)to.item_by_rid(collection_rid). The.item()method URL-encodes the value viaLinkSegment::new(), so RIDs likepLLZAIuPigw=were being encoded topLLZAIuPigw%3Din the URL, causing Cosmos DB to return 404 on every partition key range fetch. The.item_by_rid()method usesLinkSegment::identity()(no encoding), producing the correct URL:Error logging in try_lookup
Added
tracing::warn!before the.ok()call intry_lookup()that was silently swallowing errors. Routing map fetch failures now emit a warning with thecollection_ridand error details, making silent failures visible in diagnostics.Unit tests
Added three unit tests in
partition_key_range_cache.rsthat directly verify the URL construction behavior:pkranges_link_rid_with_equals_is_not_encoded— verifiesitem_by_rid()preserves=in the RID path (e.g.,pLLZAIuPigw=→dbs/perfdb/colls/pLLZAIuPigw=/pkranges)pkranges_link_item_encodes_equals_incorrectly— documents the bug:item()encodes=to%3D, producing a path that causes 404spkranges_link_rid_with_plus_is_not_encoded— verifiesitem_by_rid()also preserves+and/in base64 RIDsRoot Cause
PR #4005 changed the
pk_range_cachekey from container name to collection RID. The URL construction code was not updated to use.item_by_rid(), causing RID URL-encoding and subsequent 404s on every pkranges fetch. Because errors were silently swallowed andAsyncCachedoes not cache errors, this failed on every single request, resulting in write lock contention and loss of client-side partition key routing (~7% throughput regression observed in benchmarks).Original prompt
This section details on the original issue you should resolve
<issue_title>[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request</issue_title>
<issue_description>## Bug Report
Summary
PR #4005 changed the
pk_range_cachekey from container name to collection RID. However, the code that constructs the URL for fetching partition key ranges still uses.item(collection_rid)instead of.item_by_rid(collection_rid). This causes the RID to be URL-encoded (e.g.,=→%3D), resulting in a 404 from Cosmos DB on every partition key range fetch attempt.Because
try_lookupsilently swallows errors viaOk(routing_map.ok())andAsyncCachedoes not cache errors, this failure repeats on every single request, causing:AsyncCacheas every concurrent operation serializes through the failed fetch pathRoot Cause (3-step chain)
Step 1: Wrong URL encoding —
.item()vs.item_by_rid()In
partition_key_range_cache.rs,get_routing_map_for_collection():.item()callsLinkSegment::new()which URL-encodes the value. RIDs likepLLZAIuPigw=get the=encoded to%3D:Should use
.item_by_rid()which callsLinkSegment::identity()(no encoding):Step 2: Error silently swallowed
In
partition_key_range_cache.rsline 147,try_lookup():The caller in
container_connection.rsseesOk(None)and skips the routing block entirely:Step 3: Errors not cached → retried on every request
AsyncCache::get()only caches successful values. Whencompute()returnsErr, the error propagates and the cache remains empty. Every subsequent request:Ok(None)Evidence from Benchmarks
Continuous benchmark on
cosmos-perf-rg(4 pods, concurrency=100, 400K RU/s):98d01c8Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side.
Suggested Fix
// partition_key_range_cache.rs, get_routing_map_for_collection() let pk_range_link = self .database_link .feed(ResourceType::Containers) - .item(collection_rid) + .item_by_rid(collection_rid) .feed(ResourceType::PartitionKeyRanges);Additionally, consider:
try_lookupbefore swallowing them, to make silent failures visibleAsyncCacheto avoid retrying failed fetches on every requestAffected Version
Commit
98d01c8onrelease/azure_data_cosmos-previewsbranch (PR #4005).</issue_description>
<agent_instructions>
@copilot fix this issue. The target branch is release/azure_data_cosmos-previews.
Required changes:
RID is not URL-encoded when constructing the pkranges resource link.
error. This ensures silent failures are visible in diagnostics. Example: let routing_map = self.routing_map_cache.get(/* ...
*/).await;
if let Err(ref e) = routing_map {
tracing::warn!(
collection_rid,
...
📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.