Skip to content

Fix pk_range_cache to use .item_by_rid() for correct URL fetching#4032

Merged
tvaron3 merged 4 commits into
mainfrom
copilot/fix-pk-range-cache-404-issue
Mar 25, 2026
Merged

Fix pk_range_cache to use .item_by_rid() for correct URL fetching#4032
tvaron3 merged 4 commits into
mainfrom
copilot/fix-pk-range-cache-404-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 25, 2026

Fixes two issues in sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs:

Changes Made

Primary fix — correct pkranges resource link URL construction

In get_routing_map_for_collection(), changed .item(collection_rid) to .item_by_rid(collection_rid). The .item() method URL-encodes the value via LinkSegment::new(), so RIDs like pLLZAIuPigw= were being encoded to pLLZAIuPigw%3D in the URL, causing Cosmos DB to return 404 on every partition key range fetch. The .item_by_rid() method uses LinkSegment::identity() (no encoding), producing the correct URL:

dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct

Error logging in try_lookup

Added tracing::warn! before the .ok() call in try_lookup() that was silently swallowing errors. Routing map fetch failures now emit a warning with the collection_rid and error details, making silent failures visible in diagnostics.

Unit tests

Added three unit tests in partition_key_range_cache.rs that directly verify the URL construction behavior:

  • pkranges_link_rid_with_equals_is_not_encoded — verifies item_by_rid() preserves = in the RID path (e.g., pLLZAIuPigw=dbs/perfdb/colls/pLLZAIuPigw=/pkranges)
  • pkranges_link_item_encodes_equals_incorrectly — documents the bug: item() encodes = to %3D, producing a path that causes 404s
  • pkranges_link_rid_with_plus_is_not_encoded — verifies item_by_rid() also preserves + and / in base64 RIDs

Root Cause

PR #4005 changed the pk_range_cache key from container name to collection RID. The URL construction code was not updated to use .item_by_rid(), causing RID URL-encoding and subsequent 404s on every pkranges fetch. Because errors were silently swallowed and AsyncCache does not cache errors, this failed on every single request, resulting in write lock contention and loss of client-side partition key routing (~7% throughput regression observed in benchmarks).

Original prompt

This section details on the original issue you should resolve

<issue_title>[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request</issue_title>
<issue_description>## Bug Report

Summary

PR #4005 changed the pk_range_cache key from container name to collection RID. However, the code that constructs the URL for fetching partition key ranges still uses .item(collection_rid) instead of .item_by_rid(collection_rid). This causes the RID to be URL-encoded (e.g., =%3D), resulting in a 404 from Cosmos DB on every partition key range fetch attempt.

Because try_lookup silently swallows errors via Ok(routing_map.ok()) and AsyncCache does not cache errors, this failure repeats on every single request, causing:

  1. 1.6M extra 404 requests/hour observed on a benchmark account after deploying the change
  2. Write lock contention in AsyncCache as every concurrent operation serializes through the failed fetch path
  3. Loss of client-side partition key routing — the gateway must route all requests instead
  4. ~7% throughput regression observed in continuous benchmarks (110M → 102M requests/hour)

Root Cause (3-step chain)

Step 1: Wrong URL encoding — .item() vs .item_by_rid()

In partition_key_range_cache.rs, get_routing_map_for_collection():

let pk_range_link = self
    .database_link                       // dbs/perfdb
    .feed(ResourceType::Containers)
    .item(collection_rid)                // ← BUG: .item() URL-encodes the RID
    .feed(ResourceType::PartitionKeyRanges);

.item() calls LinkSegment::new() which URL-encodes the value. RIDs like pLLZAIuPigw= get the = encoded to %3D:

dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges  ← Cosmos DB returns 404

Should use .item_by_rid() which calls LinkSegment::identity() (no encoding):

dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct

Step 2: Error silently swallowed

In partition_key_range_cache.rs line 147, try_lookup():

Ok(routing_map.ok())  // Converts Err(404) → Ok(None), invisible to caller

The caller in container_connection.rs sees Ok(None) and skips the routing block entirely:

let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?;
if let Some(routing_map) = routing_map {
    // SKIPPED — no client-side partition key range resolution
}

Step 3: Errors not cached → retried on every request

AsyncCache::get() only caches successful values. When compute() returns Err, the error propagates and the cache remains empty. Every subsequent request:

  1. Read lock → cache miss
  2. Acquire write lock (serializes all concurrent operations on the same key)
  3. HTTP request to Cosmos DB → 404
  4. Error propagated, cache stays empty
  5. Error swallowed as Ok(None)
  6. Routing bypassed

Evidence from Benchmarks

Continuous benchmark on cosmos-perf-rg (4 pods, concurrency=100, 400K RU/s):

Hour (UTC) 404 Count Notes
13:02 – 17:02 3,500 – 6,300 Normal background
18:02 1,645,604 After deploying commit 98d01c8

Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression). Server-side latency actually decreased (fewer effective requests reaching the service), confirming the bottleneck is client-side.

Suggested Fix

// partition_key_range_cache.rs, get_routing_map_for_collection()
let pk_range_link = self
    .database_link
    .feed(ResourceType::Containers)
-   .item(collection_rid)
+   .item_by_rid(collection_rid)
    .feed(ResourceType::PartitionKeyRanges);

Additionally, consider:

  • Logging errors in try_lookup before swallowing them, to make silent failures visible
  • Adding negative caching (or a backoff) in AsyncCache to avoid retrying failed fetches on every request

Affected Version

Commit 98d01c8 on release/azure_data_cosmos-previews branch (PR #4005).
</issue_description>

<agent_instructions>
@copilot fix this issue. The target branch is release/azure_data_cosmos-previews.

Required changes:

  1. Primary fix — In sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs, in get_routing_map_for_collection(), change .item(collection_rid) to .item_by_rid(collection_rid) so the collection
    RID is not URL-encoded when constructing the pkranges resource link.
  2. Add error logging in try_lookup — Before the .ok() on the last line of try_lookup() in the same file, add a tracing::warn! that logs when the routing map fetch fails, including the collection_rid and the
    error. This ensures silent failures are visible in diagnostics. Example: let routing_map = self.routing_map_cache.get(/* ...
    */).await;
    if let Err(ref e) = routing_map {
    tracing::warn!(
    collection_rid,
    ...

📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

@tvaron3
Copy link
Copy Markdown
Member

tvaron3 commented Mar 25, 2026

@copilot continue

- Change .item(collection_rid) to .item_by_rid(collection_rid) in
  get_routing_map_for_collection() so RIDs are not URL-encoded
  (e.g., '=' → '%3D') when building the pkranges link. This was
  causing 404s from Cosmos DB on every partition key range fetch.
- Add tracing::warn! in try_lookup() before swallowing errors so
  silent fetch failures are visible in diagnostics.

Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-rust/sessions/7e76c701-3188-407a-8da0-e18abad78ebb
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 25, 2026

@copilot continue

Done in commit 0fc1e76. Both fixes applied:

  1. Changed .item(collection_rid).item_by_rid(collection_rid) in get_routing_map_for_collection() so collection RIDs are no longer URL-encoded when building the pkranges link.
  2. Added tracing::warn! in try_lookup() before swallowing errors with .ok() to make silent failures visible in diagnostics.

Build and clippy pass cleanly.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • test.documents.azure.com
    • Triggering command: /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/azure_data_cosmos-98019f27430ad2fa /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/azure_data_cosmos-98019f27430ad2fa /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_d-C /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_ddebuginfo=2 /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_d--allow=clippy::uninlined_format_args /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_d--warn=unexpected_cfgs /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_d--deny=clippy::large_futures /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_d--allow=rustdoc::bare_urls /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_d--check-cfg /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_dcfg(docsrs) /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_deriv�� /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_derive-54fde6821176ff13.clap_derive.beb3175668cf05b4-cgu.13.rcgu.o /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/clap_derive-54fde6821176ff13.clap_derive.beb3175668cf05b4-cgu.14.rcgu.o 7z9r�� l0of.0vtmjad.rcgu.o u6rw.0vtmjad.rcgu.o 8bbx.0vtmjad.rcgu.o 5gii.0vtmjad.rcgcc wi1c.0vtmjad.rcg-Wl,--version-script=/home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/tarsdk/cosmos/azure_data_cosmos/src/lib.rs evgl.0vtmjad.rcg-Wl,--no-undefined-version zti4.0vtmjad.rcg-m64 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI changed the title [WIP] Fix pk_range_cache to use .item_by_rid() for correct URL fetching Fix pk_range_cache to use .item_by_rid() for correct URL fetching Mar 25, 2026
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 25, 2026

@copilot continue

Done. Commit 0fc1e76 makes both required changes:

  1. Primary fix: Changed .item(collection_rid).item_by_rid(collection_rid) in get_routing_map_for_collection() so the collection RID is not URL-encoded (preventing =%3D which caused 404s on every pkranges fetch).

  2. Error logging: Added tracing::warn! in try_lookup() before the .ok() call so silent routing-map fetch failures are visible in diagnostics.

All 439 unit tests pass and cargo clippy is clean.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • test.documents.azure.com
    • Triggering command: /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/azure_data_cosmos-98019f27430ad2fa /home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/deps/azure_data_cosmos-98019f27430ad2fa lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --allow=clippy::uninlined_format_args --warn=unexpected_cfgs --deny=clippy::large_futures --allow=rustdoc::bare_urls --check-cfg 64-u�� 64-REDACTED-linux-gnu/lib/libcfg_if-bcf9f0b3b383a749.rlib 64-REDACTED-linux-gnu/lib/librustc_demangle-2eaee9ec806b6b01.rlib 1g3w.10ymxeo.rcg/home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/build/azure/home/REDACTED/work/azure-sdk-for-rust/azure-sdk-for-rust/target/debug/examples/rustcIFlJWQ/symbols.o 4lji.10ymxeo.rcg/home/REDACTED/.rustup/toolchains/stable-x86_64-REDACTED-linux-gnu/bin/rustc xb6t.10ymxeo.rcg--crate-name ebug/deps/azure_multi_write ebug/deps/azure_--edition=2021 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Add three tests in partition_key_range_cache.rs that verify:
- item_by_rid() preserves '=' in RIDs like "pLLZAIuPigw=" (no %3D)
- item() incorrectly URL-encodes '=' to '%3D' (documenting the bug)
- item_by_rid() also preserves '+' and '/' in base64 RIDs

Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-rust/sessions/f515482f-d87e-4105-9421-6ff7dac1508b
Copilot AI requested a review from tvaron3 March 25, 2026 20:39
@tvaron3 tvaron3 marked this pull request as ready for review March 25, 2026 20:42
@tvaron3 tvaron3 requested a review from a team as a code owner March 25, 2026 20:42
Copilot AI review requested due to automatic review settings March 25, 2026 20:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Cosmos partition key range (pkranges) routing map fetches when the cache key is a collection RID by ensuring the RID is not URL-encoded in the resource link, and adds diagnostics + unit tests to prevent regressions.

Changes:

  • Construct pkranges resource links with .item_by_rid(collection_rid) to avoid percent-encoding RIDs (fixes 404s on pkranges fetch).
  • Emit a tracing::warn! when routing map fetch fails before the error is converted into Ok(None).
  • Add unit tests validating correct vs incorrect pkranges link construction for RID inputs containing =, +, /.

Adds test-string words used in the pkranges URL encoding unit tests
to sdk/cosmos/.cspell.json's ignoreWords list so CI spell checks pass.

Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-rust/sessions/97fd20b9-499f-4c04-a672-232da75c3fe6
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 25, 2026

@copilot continue

Addressed the cspell CI failures in commit dd2d39e — added the test-string words (LLZA, Pigw, perfdb) to sdk/cosmos/.cspell.json.

Copilot AI requested a review from tvaron3 March 25, 2026 21:09
@github-project-automation github-project-automation Bot moved this from Todo to Approved in CosmosDB Go/Rust Crew Mar 25, 2026
@tvaron3
Copy link
Copy Markdown
Member

tvaron3 commented Mar 25, 2026

/check-enforcer override

@tvaron3 tvaron3 merged commit 4a99745 into main Mar 25, 2026
19 checks passed
@tvaron3 tvaron3 deleted the copilot/fix-pk-range-cache-404-issue branch March 25, 2026 22:01
@github-project-automation github-project-automation Bot moved this from Approved to Done in CosmosDB Go/Rust Crew Mar 25, 2026
@tvaron3 tvaron3 restored the copilot/fix-pk-range-cache-404-issue branch March 25, 2026 22:56
tvaron3 added a commit to tvaron3/azure-sdk-for-rust that referenced this pull request Mar 25, 2026
…ure#4032)

Fixes two issues in
`sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs`:

In `get_routing_map_for_collection()`, changed `.item(collection_rid)`
to `.item_by_rid(collection_rid)`. The `.item()` method URL-encodes the
value via `LinkSegment::new()`, so RIDs like `pLLZAIuPigw=` were being
encoded to `pLLZAIuPigw%3D` in the URL, causing Cosmos DB to return 404
on every partition key range fetch. The `.item_by_rid()` method uses
`LinkSegment::identity()` (no encoding), producing the correct URL:

```
dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct
```

Added `tracing::warn!` before the `.ok()` call in `try_lookup()` that
was silently swallowing errors. Routing map fetch failures now emit a
warning with the `collection_rid` and error details, making silent
failures visible in diagnostics.

Added three unit tests in `partition_key_range_cache.rs` that directly
verify the URL construction behavior:

- `pkranges_link_rid_with_equals_is_not_encoded` — verifies
`item_by_rid()` preserves `=` in the RID path (e.g., `pLLZAIuPigw=` →
`dbs/perfdb/colls/pLLZAIuPigw=/pkranges`)
- `pkranges_link_item_encodes_equals_incorrectly` — documents the bug:
`item()` encodes `=` to `%3D`, producing a path that causes 404s
- `pkranges_link_rid_with_plus_is_not_encoded` — verifies
`item_by_rid()` also preserves `+` and `/` in base64 RIDs

PR Azure#4005 changed the `pk_range_cache` key from container name to
collection RID. The URL construction code was not updated to use
`.item_by_rid()`, causing RID URL-encoding and subsequent 404s on every
pkranges fetch. Because errors were silently swallowed and `AsyncCache`
does not cache errors, this failed on every single request, resulting in
write lock contention and loss of client-side partition key routing (~7%
throughput regression observed in benchmarks).

<!-- START COPILOT ORIGINAL PROMPT -->

<details>

<summary>Original prompt</summary>

----

*This section details on the original issue you should resolve*

<issue_title>[Cosmos] pk_range_cache uses .item() instead of
.item_by_rid() causing silent 404s on every request</issue_title>
<issue_description>## Bug Report

PR [Azure#4005](Azure#4005)
changed the `pk_range_cache` key from container **name** to collection
**RID**. However, the code that constructs the URL for fetching
partition key ranges still uses `.item(collection_rid)` instead of
`.item_by_rid(collection_rid)`. This causes the RID to be URL-encoded
(e.g., `=` → `%3D`), resulting in a **404 from Cosmos DB** on every
partition key range fetch attempt.

Because `try_lookup` silently swallows errors via `Ok(routing_map.ok())`
and `AsyncCache` does not cache errors, this failure repeats on **every
single request**, causing:

1. **1.6M extra 404 requests/hour** observed on a benchmark account
after deploying the change
2. **Write lock contention** in `AsyncCache` as every concurrent
operation serializes through the failed fetch path
3. **Loss of client-side partition key routing** — the gateway must
route all requests instead
4. **~7% throughput regression** observed in continuous benchmarks (110M
→ 102M requests/hour)

In `partition_key_range_cache.rs`, `get_routing_map_for_collection()`:

```rust
let pk_range_link = self
    .database_link                       // dbs/perfdb
    .feed(ResourceType::Containers)
    .item(collection_rid)                // ← BUG: .item() URL-encodes the RID
    .feed(ResourceType::PartitionKeyRanges);
```

`.item()` calls `LinkSegment::new()` which URL-encodes the value. RIDs
like `pLLZAIuPigw=` get the `=` encoded to `%3D`:
```
dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges  ← Cosmos DB returns 404
```

Should use `.item_by_rid()` which calls `LinkSegment::identity()` (no
encoding):
```
dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct
```

In `partition_key_range_cache.rs` line 147, `try_lookup()`:

```rust
Ok(routing_map.ok())  // Converts Err(404) → Ok(None), invisible to caller
```

The caller in `container_connection.rs` sees `Ok(None)` and skips the
routing block entirely:

```rust
let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?;
if let Some(routing_map) = routing_map {
    // SKIPPED — no client-side partition key range resolution
}
```

`AsyncCache::get()` only caches successful values. When `compute()`
returns `Err`, the error propagates and the cache remains empty. Every
subsequent request:

1. Read lock → cache miss
2. Acquire **write lock** (serializes all concurrent operations on the
same key)
3. HTTP request to Cosmos DB → **404**
4. Error propagated, cache stays empty
5. Error swallowed as `Ok(None)`
6. Routing bypassed

Continuous benchmark on `cosmos-perf-rg` (4 pods, concurrency=100, 400K
RU/s):

| Hour (UTC) | 404 Count | Notes |
|---|---|---|
| 13:02 – 17:02 | 3,500 – 6,300 | Normal background |
| **18:02** | **1,645,604** | After deploying commit `98d01c8` |

Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression).
Server-side latency actually decreased (fewer effective requests
reaching the service), confirming the bottleneck is client-side.

```diff
// partition_key_range_cache.rs, get_routing_map_for_collection()
let pk_range_link = self
    .database_link
    .feed(ResourceType::Containers)
-   .item(collection_rid)
+   .item_by_rid(collection_rid)
    .feed(ResourceType::PartitionKeyRanges);
```

Additionally, consider:
- Logging errors in `try_lookup` before swallowing them, to make silent
failures visible
- Adding negative caching (or a backoff) in `AsyncCache` to avoid
retrying failed fetches on every request

Commit
[`98d01c8`](Azure@98d01c8)
on `release/azure_data_cosmos-previews` branch (PR
Azure#4005).
</issue_description>

<agent_instructions>
@copilot fix this issue. The target branch is
release/azure_data_cosmos-previews.

 Required changes:

1. Primary fix — In
sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs,
in get_routing_map_for_collection(), change .item(collection_rid) to
.item_by_rid(collection_rid) so the collection
 RID is not URL-encoded when constructing the pkranges resource link.
2. Add error logging in try_lookup — Before the .ok() on the last line
of try_lookup() in the same file, add a tracing::warn! that logs when
the routing map fetch fails, including the collection_rid and the
error. This ensures silent failures are visible in diagnostics. Example:
let routing_map = self.routing_map_cache.get(/* ...
   */).await;
   if let Err(ref e) = routing_map {
       tracing::warn!(
           collection_rid,
  ...

</details>

<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes Azure#4031

<!-- START COPILOT CODING AGENT TIPS -->
---

📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs),
[Azure Boards](https://gh.io/cca-azure-boards-docs) or
[Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in
one click without leaving your project management tool.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
tvaron3 added a commit to tvaron3/azure-sdk-for-rust that referenced this pull request Mar 25, 2026
…ure#4032)

Fixes two issues in
`sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs`:

In `get_routing_map_for_collection()`, changed `.item(collection_rid)`
to `.item_by_rid(collection_rid)`. The `.item()` method URL-encodes the
value via `LinkSegment::new()`, so RIDs like `pLLZAIuPigw=` were being
encoded to `pLLZAIuPigw%3D` in the URL, causing Cosmos DB to return 404
on every partition key range fetch. The `.item_by_rid()` method uses
`LinkSegment::identity()` (no encoding), producing the correct URL:

```
dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct
```

Added `tracing::warn!` before the `.ok()` call in `try_lookup()` that
was silently swallowing errors. Routing map fetch failures now emit a
warning with the `collection_rid` and error details, making silent
failures visible in diagnostics.

Added three unit tests in `partition_key_range_cache.rs` that directly
verify the URL construction behavior:

- `pkranges_link_rid_with_equals_is_not_encoded` — verifies
`item_by_rid()` preserves `=` in the RID path (e.g., `pLLZAIuPigw=` →
`dbs/perfdb/colls/pLLZAIuPigw=/pkranges`)
- `pkranges_link_item_encodes_equals_incorrectly` — documents the bug:
`item()` encodes `=` to `%3D`, producing a path that causes 404s
- `pkranges_link_rid_with_plus_is_not_encoded` — verifies
`item_by_rid()` also preserves `+` and `/` in base64 RIDs

PR Azure#4005 changed the `pk_range_cache` key from container name to
collection RID. The URL construction code was not updated to use
`.item_by_rid()`, causing RID URL-encoding and subsequent 404s on every
pkranges fetch. Because errors were silently swallowed and `AsyncCache`
does not cache errors, this failed on every single request, resulting in
write lock contention and loss of client-side partition key routing (~7%
throughput regression observed in benchmarks).

<!-- START COPILOT ORIGINAL PROMPT -->

<details>

<summary>Original prompt</summary>

----

*This section details on the original issue you should resolve*

<issue_title>[Cosmos] pk_range_cache uses .item() instead of
.item_by_rid() causing silent 404s on every request</issue_title>
<issue_description>## Bug Report

PR [Azure#4005](Azure#4005)
changed the `pk_range_cache` key from container **name** to collection
**RID**. However, the code that constructs the URL for fetching
partition key ranges still uses `.item(collection_rid)` instead of
`.item_by_rid(collection_rid)`. This causes the RID to be URL-encoded
(e.g., `=` → `%3D`), resulting in a **404 from Cosmos DB** on every
partition key range fetch attempt.

Because `try_lookup` silently swallows errors via `Ok(routing_map.ok())`
and `AsyncCache` does not cache errors, this failure repeats on **every
single request**, causing:

1. **1.6M extra 404 requests/hour** observed on a benchmark account
after deploying the change
2. **Write lock contention** in `AsyncCache` as every concurrent
operation serializes through the failed fetch path
3. **Loss of client-side partition key routing** — the gateway must
route all requests instead
4. **~7% throughput regression** observed in continuous benchmarks (110M
→ 102M requests/hour)

In `partition_key_range_cache.rs`, `get_routing_map_for_collection()`:

```rust
let pk_range_link = self
    .database_link                       // dbs/perfdb
    .feed(ResourceType::Containers)
    .item(collection_rid)                // ← BUG: .item() URL-encodes the RID
    .feed(ResourceType::PartitionKeyRanges);
```

`.item()` calls `LinkSegment::new()` which URL-encodes the value. RIDs
like `pLLZAIuPigw=` get the `=` encoded to `%3D`:
```
dbs/perfdb/colls/pLLZAIuPigw%3D/pkranges  ← Cosmos DB returns 404
```

Should use `.item_by_rid()` which calls `LinkSegment::identity()` (no
encoding):
```
dbs/perfdb/colls/pLLZAIuPigw=/pkranges    ← correct
```

In `partition_key_range_cache.rs` line 147, `try_lookup()`:

```rust
Ok(routing_map.ok())  // Converts Err(404) → Ok(None), invisible to caller
```

The caller in `container_connection.rs` sees `Ok(None)` and skips the
routing block entirely:

```rust
let routing_map = self.pk_range_cache.try_lookup(collection_rid, None).await?;
if let Some(routing_map) = routing_map {
    // SKIPPED — no client-side partition key range resolution
}
```

`AsyncCache::get()` only caches successful values. When `compute()`
returns `Err`, the error propagates and the cache remains empty. Every
subsequent request:

1. Read lock → cache miss
2. Acquire **write lock** (serializes all concurrent operations on the
same key)
3. HTTP request to Cosmos DB → **404**
4. Error propagated, cache stays empty
5. Error swallowed as `Ok(None)`
6. Routing bypassed

Continuous benchmark on `cosmos-perf-rg` (4 pods, concurrency=100, 400K
RU/s):

| Hour (UTC) | 404 Count | Notes |
|---|---|---|
| 13:02 – 17:02 | 3,500 – 6,300 | Normal background |
| **18:02** | **1,645,604** | After deploying commit `98d01c8` |

Throughput dropped from ~110M req/hr to ~102M req/hr (~7% regression).
Server-side latency actually decreased (fewer effective requests
reaching the service), confirming the bottleneck is client-side.

```diff
// partition_key_range_cache.rs, get_routing_map_for_collection()
let pk_range_link = self
    .database_link
    .feed(ResourceType::Containers)
-   .item(collection_rid)
+   .item_by_rid(collection_rid)
    .feed(ResourceType::PartitionKeyRanges);
```

Additionally, consider:
- Logging errors in `try_lookup` before swallowing them, to make silent
failures visible
- Adding negative caching (or a backoff) in `AsyncCache` to avoid
retrying failed fetches on every request

Commit
[`98d01c8`](Azure@98d01c8)
on `release/azure_data_cosmos-previews` branch (PR
Azure#4005).
</issue_description>

<agent_instructions>
@copilot fix this issue. The target branch is
release/azure_data_cosmos-previews.

 Required changes:

1. Primary fix — In
sdk/cosmos/azure_data_cosmos/src/routing/partition_key_range_cache.rs,
in get_routing_map_for_collection(), change .item(collection_rid) to
.item_by_rid(collection_rid) so the collection
 RID is not URL-encoded when constructing the pkranges resource link.
2. Add error logging in try_lookup — Before the .ok() on the last line
of try_lookup() in the same file, add a tracing::warn! that logs when
the routing map fetch fails, including the collection_rid and the
error. This ensures silent failures are visible in diagnostics. Example:
let routing_map = self.routing_map_cache.get(/* ...
   */).await;
   if let Err(ref e) = routing_map {
       tracing::warn!(
           collection_rid,
  ...

</details>

<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes Azure#4031

<!-- START COPILOT CODING AGENT TIPS -->
---

📍 Connect Copilot coding agent with [Jira](https://gh.io/cca-jira-docs),
[Azure Boards](https://gh.io/cca-azure-boards-docs) or
[Linear](https://gh.io/cca-linear-docs) to delegate work to Copilot in
one click without leaving your project management tool.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tvaron3 <70857381+tvaron3@users.noreply.github.com>
tvaron3 added a commit that referenced this pull request Mar 25, 2026
tvaron3 added a commit that referenced this pull request Mar 26, 2026
… fetching (#4041)

Cherry-pick of #4032 into `release/azure_data_cosmos-previews`.

Fixes pk_range_cache to use `.item_by_rid()` for correct URL fetching,
ensuring partition key range lookups use the correct container resource
ID.

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Cosmos] pk_range_cache uses .item() instead of .item_by_rid() causing silent 404s on every request

5 participants