Skip to content

[Documentation] Cosmos: Optimization Plan for PPCB High Memory Utilization#4419

Merged
kundadebdatta merged 4 commits into
release/azure_data_cosmos-previewsfrom
users/kundadebdatta/4361_ppcb_memory_optimization_plan_temp
May 18, 2026
Merged

[Documentation] Cosmos: Optimization Plan for PPCB High Memory Utilization#4419
kundadebdatta merged 4 commits into
release/azure_data_cosmos-previewsfrom
users/kundadebdatta/4361_ppcb_memory_optimization_plan_temp

Conversation

@kundadebdatta

Copy link
Copy Markdown
Member

What this PR does

Documentation-only port. Adds a single new file:

  • sdk/cosmos/azure_data_cosmos_benchmarks/docs/PPCB_MEMORY_ANALYSIS.md (+1,571 lines)

No source, manifest, test, or CI changes. No behavioral impact on any shipped crate.

Why

The Per-Partition Circuit Breaker (PPCB) subsystem in azure_data_cosmos_driver carries lazy routing state whose worst-case footprint scales with partition_count × failed_regions_per_partition. Large-container customers (~80k physical partitions) have asked us for concrete memory-cost guidance. This document is the consolidated answer, captured under azure_data_cosmos_benchmarks/docs so it lives next to the harness that produced the numbers.

It also documents an already-applied v2 optimization (SmallVec<[CosmosEndpoint; 4]> with the union feature replacing HashSet<CosmosEndpoint>, plus CompactString for PartitionKeyRangeId) and validates it against both a synthesized harness and a real Cosmos DB INT account.

Document structure

Section Content
§1–§9 Methodology, harness, workload, headline numbers, per-allocation decomposition, per-entry cost model for the v1 (pre-optimization) driver.
§10–§14 Findings, recommendations, risk/operational implications, caveats, reproduction steps.
Appendix A/B Raw DHAT stack frame table + glossary.
§15 Verified v2 optimization results — code changes applied, type-size impact, DHAT before/after, re-projected scaling.
§16 Real-account validation (v2) — N = 115 partitions on a real Cosmos INT account, cross-checked against §15 projections.
§17 Real-account v1 vs v2 comparison — apples-to-apples Δ on the same INT account.
§18 Simulated v1 vs v2 at 1k and 10k partitions — closed-form model derived from §15/§17 per-entry slopes; predictions for fleet-scale N.

Headline numbers documented

v1 (HashSet-based) at 80k partitions, fully tripped, 2 failed regions/partition:

Dimension PPCB Disabled PPCB Enabled Delta
Peak live heap 1,380 B 24,604,302 B (≈ 23.46 MiB) +24,602,922 B
Peak live blocks 13 160,014 +160,001
Bytes / partition entry ~308 B
Heap blocks / partition entry 2
Leaks 0 0

v2 (SmallVec + CompactString) impact:

  • Block-count pressure at 80k partitions: 160,014 → ~14 (−99.99 %) — the dominant operational win (eliminates ~160k allocator round-trips on partition-event storms).
  • Peak heap at 80k partitions: ~23.5 MiB → ~17 MiB (−28 %) — also drops the single contiguous 20 MiB HashMap backing allocation, reducing fragmentation risk.
  • PartitionFailoverEntry shrank 8 B (unexpected secondary win from SmallVec's union feature reusing the discriminant niche).
  • Confirmed on a real Cosmos INT account at N = 115 (§16/§17) and modeled at N = 1k / 10k (§18).

Reviewer guidance

  • Pure doc PR — no diff against Cargo.toml, src/, tests/, or CI.
  • The file is 1,571 lines but is fully self-contained; reviewers don't need to run anything.
  • If you want to reproduce, §14 (synthesized harness), §16.10 / §17.7 (real account), and §18.8 (simulated 1k/10k) each have copy-paste reproduction blocks.
  • The code changes the document refers to (v2: SmallVec + CompactString) are already shipped on release/azure_data_cosmos-previews (see PR Slim PartitionKeyRange cache #4393 Slim PartitionKeyRange cache in the immediate history) — this PR only adds the report that motivated and validated them.

Validation

  • git status clean after commit (only pre-existing untracked sdk/cosmos/.vscode/ remains, untouched).
  • Branch is exactly 1 commit ahead of origin/release/azure_data_cosmos-previews (962c1f7b7).
  • No build/test/clippy/fmt checks needed (Markdown-only change in a docs/ folder).

Out of scope

  • Any further memory work beyond SmallVec + CompactString (e.g., §11.3 boxing the HashMap value is explicitly deferred).
  • Driver source changes (the v2 optimizations are already merged separately).

@kundadebdatta kundadebdatta marked this pull request as ready for review May 18, 2026 17:25
@kundadebdatta kundadebdatta requested a review from a team as a code owner May 18, 2026 17:25
Copilot AI review requested due to automatic review settings May 18, 2026 17:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a comprehensive DHAT-based report documenting the PPCB routing-state memory footprint in azure_data_cosmos_driver, including methodology, measured results, and validation of an already-shipped v2 optimization.

Changes:

  • New performance/memory analysis document covering PPCB steady-state heap usage and scaling projections
  • Documents measurement methodology + reproduction steps (synth harness and real-account validation)
  • Captures before/after results for v2 (SmallVec + CompactString) and operational implications

Comment thread sdk/cosmos/azure_data_cosmos_benchmarks/docs/PPCB_MEMORY_ANALYSIS.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_benchmarks/docs/PPCB_MEMORY_ANALYSIS.md Outdated
@kundadebdatta kundadebdatta enabled auto-merge (squash) May 18, 2026 18:35
…datta/4361_ppcb_memory_optimization_plan_temp

@FabianMeiswinkel FabianMeiswinkel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks

@kundadebdatta kundadebdatta merged commit 05e6862 into release/azure_data_cosmos-previews May 18, 2026
12 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Approved in CosmosDB Rust SDK and Driver May 18, 2026
@kundadebdatta kundadebdatta deleted the users/kundadebdatta/4361_ppcb_memory_optimization_plan_temp branch May 18, 2026 23:57
@github-project-automation github-project-automation Bot moved this from Approved to Done in CosmosDB Rust SDK and Driver May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate cosmos-driver

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants