[Store] L2->L1 promotion-on-hit by yzhan1 · Pull Request #2071 · kvcache-ai/Mooncake

yzhan1 · 2026-05-10T05:42:23Z

Description

Adds the L2→L1 promotion-on-hit feature: when a key whose only complete replica lives on a peer client's local SSD (LOCAL_DISK) is read enough times to clear an admission threshold, the master enqueues a promotion task and the holder client's next FileStorage heartbeat stages a fresh MEMORY replica via Transfer Engine. Symmetric counterpart to offload-on-evict: offload moves cold data DRAM→SSD on eviction, promotion moves rediscovered-warm data SSD→DRAM on access.

The trigger lives inside MasterService::GetReplicaList. A CountMinSketch tracks per-key access frequency; gates filter out cases where promotion would be wasteful or unsafe (no LOCAL_DISK source, MEMORY already present, DRAM under high-watermark pressure, queue saturated, or task already in-flight for the same key). When all gates pass, the master records a PromotionTask in the per-shard map (refcnt-pinning the source replica) and pushes the key onto the holder's per-LocalDiskSegment promotion_objects map. The holder's FileStorage heartbeat thread drains a bounded batch from that map, calls PromotionAllocStart to stage a PROCESSING MEMORY replica, reads the bytes from local SSD, RDMA-WRITEs them via PromotionWrite, and notifies the master. NotifyPromotionSuccess flips the new replica COMPLETE and decrements the source LOCAL_DISK refcnt. A reaper drops expired tasks; orphaned PROCESSING replicas are reaped via the standard discarded-replicas path.

Failure handling matches offload's posture: per-key failures are logged and skipped, and the master-side reaper handles task TTL expiry on its own schedule.

Key design points addressed in this PR

Concurrent-Put disambiguation in NotifyPromotionSuccess. The PromotionTask stores the ReplicaID of the staged MEMORY replica captured during PromotionAllocStart. NotifyPromotionSuccess commits exactly that replica via GetReplicaByID, so a parallel Put creating its own PROCESSING MEMORY replica on the same key cannot be confused with ours.
Heartbeat liveness. Each promotion task does a synchronous SSD read + RDMA write whose wall time scales with object size (64 MB–1 GB tensors are common in KV-cache workloads). MasterService::PromotionObjectHeartbeat returns at most kMaxPerHeartbeat = 1 task per call and leaves the rest queued; the client processes whatever the master returned and never blocks beyond a single task. Leftovers are preserved by construction — no silent drops.
Sketch / threshold type safety. The CountMinSketch uses 8-bit saturating counters (max 255), so promotion_admission_threshold is clamped to [0, 255] at config parse time with a warning, giving the gate a stable comparison contract.

Config knobs (default off)

--promotion_on_hit=true — feature flag
--promotion_admission_threshold=N — reads per key before the gate fires (default 2; clamped to ≤255)
--promotion_queue_limit=N — soft per-shard cap on in-flight promotion tasks (default 50000)

Module

Type of Change

How Has This Been Tested?

Master-side unit tests (mooncake-store/tests/promotion_on_hit_test.cpp, 13 tests):

Gate behavior: feature off, no LOCAL_DISK source, MEMORY already present
Error returns: unknown client/key on Heartbeat / AllocStart / Notify
State machine: dedup of racing readers, stale task reaper, key removed mid-promotion
Cap gates: promotion_queue_limit rejects beyond saturation; PromotionObjectHeartbeat returns ≤kMaxPerHeartbeat per call and preserves leftover work (HeartbeatBoundedBatchPreservesLeftovers)
Multi-segment routing: cross-host LOCAL_DISK→MEMORY allocation, preferred_segments honored

Client-side unit tests (mooncake-store/tests/file_storage_promotion_test.cpp, 9 tests, FakeClient mocks the master RPCs):

Empty queue / heartbeat error paths
Per-key error isolation: AllocStart / BatchLoad / TransferWrite / Notify failures don't abort the batch
NonPositiveSize / SegmentNotFound / hard-error propagation
FakeClient mirrors the master's bounded-batch heartbeat semantics, so the drain shape exercised matches production

Snapshot tests (mooncake-store/tests/ha/snapshot/master_service_promotion_test_for_snapshot.cpp, 5 tests):

LOCAL_DISK replica round-trip, mixed MEMORY+LOCAL_DISK, multiple holders, in-flight task safety

End-to-end Python tests (mooncake-wheel/tests/test_promotion_on_hit.py, 2 tests in CI):

Positive: 96×1 MiB workload overflows 32 MiB segment → eviction creates LOCAL_DISK-only keys → repeated store.get clears admission threshold → MEMORY replica reappears → bit-exact bytes verified pre- AND post-promotion AND the post-promotion read is proven to serve from MEMORY (not LOCAL_DISK SSD) by snapshotting MooncakeDistributedStore.get_offload_rpc_read_count() around the read and asserting zero delta
Negative: below-watermark workload stays MEMORY-only (guards against spurious promotion / auto-LOCAL_DISK regressions)

Gated behind TEST_PROMOTION_ON_HIT in scripts/run_tests.sh.

Opt-in latency benchmark (BenchPromotionLatency class in the same file, skipped by default; run with MC_BENCH_PROMOTION_LATENCY=1):

The CI test proves MEMORY served the post-promotion read via the offload-RPC counter delta. For a richer empirical view, the benchmark times every read with time.perf_counter, samples LATENCY_SAMPLES reads per phase (default 1000), and reports p50/p95/p99 with a per-percentile speedup table. Pre-promotion reads bail the moment a read serves from MEMORY (detected via the same counter) so the LOCAL_DISK distribution stays uncontaminated.

Observed numbers from a docker-tmpfs run with 1 MiB objects, N=1000 samples:

	n	p50	p95	p99	min	max
Pre-promotion (LOCAL_DISK via offload-RPC)	1000	0.52 ms	0.68 ms	0.81 ms	0.40 ms	6.10 ms
Post-promotion (MEMORY direct)	1000	0.24 ms	0.34 ms	0.42 ms	0.18 ms	2.36 ms
Speedup		2.1×	2.0×	1.9×

The speedup holds across the distribution — MEMORY's win isn't a median trick. The 1.3× p50 assertion baked into the bench has comfortable headroom over the observed 2×. On docker tmpfs (no real SSD, no RDMA NIC) much of the per-read wall-time is Python/C++ marshalling overhead, so the SSD-vs-DRAM cost gap is compressed; on real hardware (NVMe + RDMA) the ratio is typically 10–50×.

The same numbers replicated at N=10000 with MOONCAKE_OFFLOAD_HEARTBEAT_INTERVAL_SECONDS=30 (so the worker doesn't race a longer pre-loop): p50/p95/p99 changed by ≤0.02 ms vs the N=1000 run, confirming N=1000 is the right default for stable tail percentiles.

Follow-ups (not in this PR)

Async promotion worker. Today the heartbeat thread serially executes the SSD read + RDMA write + notify per task with kMaxPerHeartbeat=1 to stay inside the client-liveness window. If real workloads show the resulting drain rate (≈ 1 promotion / heartbeat interval) becomes a bottleneck, an obvious next step is a dedicated worker thread that pulls from a bounded queue: the heartbeat enqueues, the worker drains independently, and the cap can be relaxed. ~180 LOC + tests; deferred unless needed.

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist

Code Review

This pull request introduces a 'promotion-on-hit' feature that asynchronously promotes objects from local SSD (L2) back to DRAM (L1) when they are frequently accessed. The implementation includes master-side gating logic using a Count-Min Sketch, new RPC endpoints for promotion orchestration, and a client-side execution loop in FileStorage. Feedback focuses on critical concurrency and performance issues: processing too many promotion tasks in a single heartbeat could block the client, and the current replica commitment logic in NotifyPromotionSuccess is ambiguous and could lead to data corruption if concurrent Put operations occur. Additionally, a type mismatch between the frequency sketch and the admission threshold was identified.

codecov-commenter · 2026-05-10T07:08:05Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 83.39286% with 186 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/tests/promotion_on_hit_test.cpp	93.84%	32 Missing ⚠️
mooncake-store/src/master.cpp	0.00%	29 Missing ⚠️
mooncake-store/src/file_storage.cpp	54.09%	28 Missing ⚠️
...oncake-store/tests/file_storage_promotion_test.cpp	86.01%	20 Missing ⚠️
mooncake-store/src/master_client.cpp	0.00%	18 Missing ⚠️
mooncake-store/src/rpc_service.cpp	15.78%	16 Missing ⚠️
mooncake-store/src/master_service.cpp	93.00%	14 Missing ⚠️
mooncake-store/src/client_service.cpp	0.00%	12 Missing ⚠️
mooncake-integration/store/store_py.cpp	0.00%	7 Missing ⚠️
mooncake-store/include/master_config.h	50.00%	6 Missing ⚠️
... and 3 more

📢 Thoughts on this report? Let us know!

Srinivasoo7 · 2026-05-11T00:22:35Z

Hi @yzhan1

Overall the implementation follows as per the RFC.

I'm curious about the following issues:

Promoted MEMORY replicas are appended after the existing LOCAL_DISK replica, but Client::Get still selects the first COMPLETE replica.

PromotionAllocStart appends the new PROCESSING MEMORY replica via metadata.AddReplicas, and NotifyPromotionSuccess only marks that appended replica COMPLETE.
However, normal Get still calls FindFirstCompleteReplica, which returns the first COMPLETE descriptor in master order. For a LOCAL_DISK-only key, the original LOCAL_DISK replica remains first, so after promotion, future Gets can still read from SSD instead of the new MEMORY replica. This means the main RFC goal, "subsequent Gets avoid SSD", is not guaranteed.

Ideal fix here would be to either return/prefer MEMORY replicas before LOCAL_DISK in GetReplicaList, insert the promoted MEMORY replica before the LOCAL_DISK source, or update client replica selection to prefer MEMORY over LOCAL_DISK?

Also, it would be nice to add a test that proves the post-promotion read actually selects MEMORY, not just that MEMORY metadata exists.

ProcessPromotionTasks drains all pending promotion objects but only processes one.

PromotionObjectHeartbeat moves the whole per-client promotion_objects map out of the master. Then FileStorage::ProcessPromotionTasks has kMaxPromotionsPerTick = 1 and breaks after the first item.
The remaining items are no longer in the client pending map, so they will not be retried on later heartbeats; their master promotion_tasks entries stay pinned until the stale-task reaper expires them. Under a burst of hot LOCAL_DISK keys, most queued promotions can be dropped and source refcounts remain pinned until timeout.

Could you keep unprocessed tasks queued, return only a bounded batch from the heartbeat, or process the full drained map with a byte/task budget that preserves leftovers?

LujhCoconut · 2026-05-12T11:31:26Z

Overall, this pr makes sense to me from a design perspective. As a nice-to-have, would it be possible to supplement the results with a few more metrics? This would further strengthen the empirical validation, though it's by no means a blocker.

…elected e2e proof Two PR-review issues from kvcache-ai#2071, addressed without changing the existing commit: 1. Heartbeat could silently drop leftover promotion tasks. The prior fix to a per-tick cap lived on the client side: master drained the whole per-LDS promotion_objects map, client processed one, the rest evaporated. Their per-shard promotion_tasks entries stayed refcnt-pinned until the reaper TTL (~10 min), and the dedup gate blocked re-enqueue, so a burst of hot LOCAL_DISK keys could lose all but one promotion for that window. Move the cap server-side. MasterService::PromotionObjectHeartbeat now extracts at most kMaxPerHeartbeat (1) entries from the per-LDS promotion_objects map and leaves the remainder queued for subsequent heartbeats. The client-side per-tick cap is removed — the master is the bound. Leftover work is preserved by construction. New unit test HeartbeatBoundedBatchPreservesLeftovers proves: with three queued keys, three successive heartbeats each return exactly one distinct key, the fourth returns empty, and all three keys' master-side state stays intact across the drain. The FakeClient harness in file_storage_promotion_test now mirrors the master's bounded-batch semantic on its mocked Heartbeat, so the existing client-side tests exercise the same drain shape production sees. 2. E2E couldn't prove subsequent reads avoided SSD. Post-promotion the test asserted bytes-back and the existence of a MEMORY replica, but bytes-back is the same whether MEMORY or LOCAL_DISK served the read (the offload-RPC path returns correct bytes too). The RFC's core invariant — "subsequent reads avoid SSD" — went unverified. Add an instrumented counter offload_rpc_read_count on RealClient that increments at every invocation of batch_get_into_offload_object_internal, the single chokepoint for LOCAL_DISK reads served via peer offload-RPC. Expose via the Python binding as MooncakeDistributedStore.get_offload_rpc_read_count(). The e2e snapshots the counter around the post-promotion store.get and asserts it didn't move — a non-zero delta would mean the read served from SSD instead of the promoted MEMORY replica.

LujhCoconut · 2026-05-14T03:09:11Z

Thank you for your patience. Overall, this PR looks quite complete to me. My final suggestion is regarding code quality:

[Nit] The const_cast in PromotionAllocStart to backfill alloc_id is safe at runtime, but the const qualifier on the map value type declares an intent of immutability that the two-phase emplace-then-mutate pattern contradicts. Removing const from promotion_tasks's value type would make the declaration match the actual usage and eliminate the cast, reducing the reasoning burden for future readers. No behavioral change.

- std::unordered_map<std::string, const PromotionTask> promotion_tasks;
+ std::unordered_map<std::string, PromotionTask> promotion_tasks;

- const_cast<PromotionTask&>(task_it->second).alloc_id = new_id;  
+ task_it->second.alloc_id = new_id;

The shard RW lock is already held in both TryPushPromotionQueue (emplace) and PromotionAllocStart (backill), so mutability is safely scoped.

Note that offloading_tasks is declared as const OffloadingTask because it's fully initialized at emplace time and never mutated afterward. PromotionTask is different due to the two-phase alloc_id backfill, so the const qualifier is unnecessarily restrictive here.

yzhan1 · 2026-05-14T04:45:17Z

@LujhCoconut thanks! I will revise based on your comment. Also have some perf tests that I'd like to add and some other minor bugs I'm fixing. Will address and push shortly

stmatengss · 2026-05-14T10:43:22Z

'''
ERROR: test_mooncake_backend_cpu (unittest.loader._FailedTest.test_mooncake_backend_cpu)

ImportError: Failed to import test module: test_mooncake_backend_cpu
Traceback (most recent call last):
File "/home/runner/work/Mooncake/Mooncake/test_env/lib/python3.12/site-packages/mooncake/pg.py", line 10, in
backend_module = importlib.import_module("mooncake.pg" + version_suffix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/importlib/init.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1324, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'mooncake.pg_2_12_0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/unittest/loader.py", line 137, in loadTestsFromName
module = import(module_name)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/Mooncake/Mooncake/mooncake-wheel/tests/test_mooncake_backend_cpu.py", line 7, in
from mooncake import pg
File "/home/runner/work/Mooncake/Mooncake/test_env/lib/python3.12/site-packages/mooncake/pg.py", line 12, in
raise ImportError(
ImportError: Mooncake PG was not built against torch==2.12.0.
Open an issue at https://github.com/kvcache-ai/Mooncake/issues.
'''

…elected e2e proof Two PR-review issues from kvcache-ai#2071, addressed without changing the existing commit: 1. Heartbeat could silently drop leftover promotion tasks. The prior fix to a per-tick cap lived on the client side: master drained the whole per-LDS promotion_objects map, client processed one, the rest evaporated. Their per-shard promotion_tasks entries stayed refcnt-pinned until the reaper TTL (~10 min), and the dedup gate blocked re-enqueue, so a burst of hot LOCAL_DISK keys could lose all but one promotion for that window. Move the cap server-side. MasterService::PromotionObjectHeartbeat now extracts at most kMaxPerHeartbeat (1) entries from the per-LDS promotion_objects map and leaves the remainder queued for subsequent heartbeats. The client-side per-tick cap is removed — the master is the bound. Leftover work is preserved by construction. New unit test HeartbeatBoundedBatchPreservesLeftovers proves: with three queued keys, three successive heartbeats each return exactly one distinct key, the fourth returns empty, and all three keys' master-side state stays intact across the drain. The FakeClient harness in file_storage_promotion_test now mirrors the master's bounded-batch semantic on its mocked Heartbeat, so the existing client-side tests exercise the same drain shape production sees. 2. E2E couldn't prove subsequent reads avoided SSD. Post-promotion the test asserted bytes-back and the existence of a MEMORY replica, but bytes-back is the same whether MEMORY or LOCAL_DISK served the read (the offload-RPC path returns correct bytes too). The RFC's core invariant — "subsequent reads avoid SSD" — went unverified. Add an instrumented counter offload_rpc_read_count on RealClient that increments at every invocation of batch_get_into_offload_object_internal, the single chokepoint for LOCAL_DISK reads served via peer offload-RPC. Expose via the Python binding as MooncakeDistributedStore.get_offload_rpc_read_count(). The e2e snapshots the counter around the post-promotion store.get and asserts it didn't move — a non-zero delta would mean the read served from SSD instead of the promoted MEMORY replica.

Reviewer asked for more metrics. The post-promotion offload-RPC counter already proves MEMORY served the read; adding a wall-time comparison gives the same evidence on a different axis and surfaces a concrete "this is the user-visible win" number. Phase 3 timed the 4 admission-threshold-clearing reads as the LOCAL_DISK sample. Phase 5 expanded the single post-promotion read into 10 timed reads as the MEMORY sample, plus the existing counter assertion. The test now prints a median / min / max table for each phase and asserts the post-promotion median is at least 1.3x faster. The threshold is intentionally loose: on docker tmpfs with small (1 MB) objects much of the wall time is Python/C++ marshalling overhead and the SSD-vs-DRAM gap shrinks to ~2-5x; on real SSDs + RDMA the gap is typically 10-50x. The counter assertion is the hard correctness check; the latency assertion catches gross regressions while the printed summary serves as the primary empirical signal. Observed in CI (docker tmpfs, 1 MB objects): pre-promotion (LOCAL_DISK via offload-RPC): median 1.53 ms post-promotion (MEMORY direct): median 0.31 ms speedup: 4.9x

Reviewer asked to keep a richer latency comparison in the repo but NOT run it as a CI test. test_promotion_after_repeated_hits goes back to the lean correctness shape: 4 reads to clear the admission gate, 1 read post-promotion + offload-RPC counter assertion to prove MEMORY served, no timing. The full p50/p95/p99 latency comparison moves into a sibling class BenchPromotionLatency gated by @unittest.skipUnless(MC_BENCH_PROMOTION_LATENCY), so: python test_promotion_on_hit.py # 3 tests, 1 skipped — CI MC_BENCH_PROMOTION_LATENCY=1 python test_promotion_on_hit.py # full bench The bench retains the per-read MEMORY-detection (snapshots offload_rpc_read_count around each pre-promotion read and bails on detection of MEMORY-served reads so the LOCAL_DISK distribution stays uncontaminated) and the 1.3x p50 speedup assertion. Sample count is LATENCY_SAMPLES (default 1000; verified stable to within noise at N=10000 too — N=1000 is the sweet spot for stable p99 without paying for 10x more wall time). Observed bench numbers on docker tmpfs (1 MiB objects, N=1000): pre-promotion (LOCAL_DISK via offload-RPC): p50=0.52 ms, p95=0.68 ms, p99=0.81 ms post-promotion (MEMORY direct): p50=0.24 ms, p95=0.33 ms, p99=0.42 ms speedup: p50=2.2x | p95=2.0x | p99=1.9x Real-hardware speedup is typically 10-50x because the SSD-vs-DRAM cost gap dominates only when objects are large and the FFI/marshalling overhead is fractional.

…obal cap, const, misconfig log) Addresses reviewer feedback on PR kvcache-ai#2071. Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task reaper only dropped the source LOCAL_DISK refcnt and erased the task entry; it never popped the staged PROCESSING MEMORY replica added by PromotionAllocStart. That replica is not in shard->processing_keys, so DiscardExpiredProcessingReplicas could not sweep it and the buffer leaked until the object was removed or evicted. Fix: in the reaper, when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop the staged replica and return its buffer to the allocator. Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate was 'shard->size() * kNumShards >= limit', approximately right for uniform workloads but ~1024x too eager on skewed workloads where hot keys cluster in few shards. Replace with a cluster-wide std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in TryPushPromotionQueue after successful emplace; decremented in NotifyPromotionSuccess and in the reaper. memory_order_relaxed since the value is advisory; the per-shard mutex already serializes inserts within a shard and the dedup gate prevents duplicate work. Issue 3 -- const_cast smell. The promotion_tasks map held const PromotionTask values for "generic safety", forcing a const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id under the shard write lock. Drop the const; PromotionAllocStart now sets task_it->second.alloc_id = new_id directly. Misconfig log -- emit LOG(WARNING) at startup when config.promotion_on_hit=true but enable_offload=false. Promotion requires offload to produce LOCAL_DISK replicas, so it is silently disabled in that combination; the log makes the disablement discoverable to operators. Tests ----- - New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1. Uses QuerySegments(seg).first (used bytes) to observe that the staged PROCESSING MEMORY replica's buffer is freed back to the DRAM allocator after the reaper sweeps the expired task. - New QueueLimitRejectsCrossShard: regression for Issue 2. With queue_limit=1, proves a second admission attempt on a *different shard* is rejected -- exactly the case the old per-shard cap admitted incorrectly. - Updated comment on existing QueueLimitRejectsBeyondCap to reflect the cluster-wide counter semantics. Verification ------------ - promotion_on_hit_test: 15/15 pass (5 consecutive clean runs) - file_storage_promotion_test: 9/9 pass - master_service_promotion_test_for_snapshot: 5/5 pass - offload_on_evict_test: 9/9 pass - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted (master_service.h, master_service.cpp, promotion_on_hit_test.cpp); all others "Already formatted" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…est + reaper poll Follow-up to 2cee8eb addressing two further reviewer concerns and recovering one edit batch that did not land in that commit. AllocStart deadline reset ------------------------- A queued promotion task could enter the active-transfer phase (PromotionAllocStart -> client SSD read -> RDMA write -> Notify) with only a sliver of TTL remaining if the holder's heartbeat queue was backlogged (kMaxPerHeartbeat = 1 serializes work). If the active transfer of a large object then ran past the TTL, the reaper would call EraseReplicaByID on the staged MEMORY replica mid-transfer; the allocator could then hand that buffer to a concurrent Put while the client's RDMA write was still landing into it. Fix: PromotionAllocStart now resets task.start_time when it stages the PROCESSING MEMORY replica. The queue-wait phase (alloc_id == 0, no buffer) and the active-transfer phase (alloc_id != 0, buffer in flight) each get their own full put_start_release_timeout_sec_ TTL window, measured from when they began. New regression test AllocStartResetsTaskDeadline drives the queue_wait_seconds + alloc_active_seconds > TTL scenario and asserts via QuerySegments(seg).first that the staged buffer survives past the bare TTL when measured from admission. NotifyPromotionSuccess counter coverage --------------------------------------- The success path of NotifyPromotionSuccess was not exercised by any existing test — all call sites in the suite asserted error paths. 2cee8eb added a fetch_sub on the success path but had no test to catch a missing decrement. New test NotifySuccessDecrementsCounter sets queue_limit=1, drives the full happy path (GetReplicaList -> PromotionAllocStart -> NotifyPromotionSuccess), then asserts a second admission on a different key is allowed. Without the success-path fetch_sub, the cap stays saturated at 1 and the second admission is dropped. Bug discrimination verified manually: temporarily removing fetch_sub fails this test as expected. Reaper test poll-instead-of-sleep --------------------------------- ReaperPopsStagedMemoryReplicaOnExpiry used a fixed 2-second sleep after a 1-second TTL. Under suite load the eviction thread cadence plus general scheduling jitter could push past the 2-second mark ~1 in 6 runs, producing a flake matching the StalePromotionReaper class. Replace with a 5-second polling deadline against QuerySegments(seg).first; the test now passes 5/5 stability runs. QueueLimitRejectsBeyondCap stale block comment ---------------------------------------------- The test's docblock still described the old per-shard gate (gate: shard_size * kNumShards >= limit). Update it to reflect the cluster-wide promotion_in_flight_ counter and cross-reference QueueLimitRejectsCrossShard for the cross-shard case. Verification ------------ - promotion_on_hit_test: 17/17 pass × 5 consecutive clean runs - New tests pass; mutation-test of NotifySuccessDecrementsCounter against a removed success-path fetch_sub fails as designed - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted; all others already-formatted Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…obal cap, const, misconfig log) Addresses reviewer feedback on PR kvcache-ai#2071. Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task reaper only dropped the source LOCAL_DISK refcnt and erased the task entry; it never popped the staged PROCESSING MEMORY replica added by PromotionAllocStart. That replica is not in shard->processing_keys, so DiscardExpiredProcessingReplicas could not sweep it and the buffer leaked until the object was removed or evicted. Fix: in the reaper, when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop the staged replica and return its buffer to the allocator. Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate was 'shard->size() * kNumShards >= limit', approximately right for uniform workloads but ~1024x too eager on skewed workloads where hot keys cluster in few shards. Replace with a cluster-wide std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in TryPushPromotionQueue after successful emplace; decremented in NotifyPromotionSuccess and in the reaper. memory_order_relaxed since the value is advisory; the per-shard mutex already serializes inserts within a shard and the dedup gate prevents duplicate work. Issue 3 -- const_cast smell. The promotion_tasks map held const PromotionTask values for "generic safety", forcing a const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id under the shard write lock. Drop the const; PromotionAllocStart now sets task_it->second.alloc_id = new_id directly. Misconfig log -- emit LOG(WARNING) at startup when config.promotion_on_hit=true but enable_offload=false. Promotion requires offload to produce LOCAL_DISK replicas, so it is silently disabled in that combination; the log makes the disablement discoverable to operators. Tests ----- - New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1. Uses QuerySegments(seg).first (used bytes) to observe that the staged PROCESSING MEMORY replica's buffer is freed back to the DRAM allocator after the reaper sweeps the expired task. - New QueueLimitRejectsCrossShard: regression for Issue 2. With queue_limit=1, proves a second admission attempt on a *different shard* is rejected -- exactly the case the old per-shard cap admitted incorrectly. - Updated comment on existing QueueLimitRejectsBeyondCap to reflect the cluster-wide counter semantics. Verification ------------ - promotion_on_hit_test: 15/15 pass (5 consecutive clean runs) - file_storage_promotion_test: 9/9 pass - master_service_promotion_test_for_snapshot: 5/5 pass - offload_on_evict_test: 9/9 pass - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted (master_service.h, master_service.cpp, promotion_on_hit_test.cpp); all others "Already formatted" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yzhan1 requested review from 00fish0, Libotry, ShangmingCai, XucSh, YiXR, stmatengss and ykwd as code owners May 10, 2026 05:42

github-actions Bot added run-ci Store Installation labels May 10, 2026

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Comment thread mooncake-store/src/file_storage.cpp

Comment thread mooncake-store/src/master_service.cpp Outdated

Comment thread mooncake-store/src/master_service.cpp

yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 2 times, most recently from 9c38edf to 30845ca Compare May 10, 2026 06:28

yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 3 times, most recently from 8b1d481 to fd69cf2 Compare May 10, 2026 07:55

yzhan1 mentioned this pull request May 10, 2026

[RFC]: Promote data from L2 -> L1 when it's not in L1 #2031

Open

1 task

yzhan1 and others added 6 commits May 14, 2026 09:11

[Store] L2->L1 promotion-on-hit

3ba68bf

yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch from 3b0a1da to 3ff606b Compare May 14, 2026 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store] L2->L1 promotion-on-hit#2071

[Store] L2->L1 promotion-on-hit#2071
yzhan1 wants to merge 6 commits into
kvcache-ai:mainfrom
yzhan1:feature/l2-l1-promotion-on-hit

yzhan1 commented May 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 10, 2026 •

edited

Loading

Uh oh!

Srinivasoo7 commented May 11, 2026

Uh oh!

LujhCoconut commented May 12, 2026

Uh oh!

LujhCoconut commented May 14, 2026

Uh oh!

yzhan1 commented May 14, 2026

Uh oh!

stmatengss commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yzhan1 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key design points addressed in this PR

Config knobs (default off)

Module

Type of Change

How Has This Been Tested?

Follow-ups (not in this PR)

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Srinivasoo7 commented May 11, 2026

Uh oh!

LujhCoconut commented May 12, 2026

Uh oh!

LujhCoconut commented May 14, 2026

Uh oh!

yzhan1 commented May 14, 2026

Uh oh!

stmatengss commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

''' ERROR: test_mooncake_backend_cpu (unittest.loader._FailedTest.test_mooncake_backend_cpu)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yzhan1 commented May 10, 2026 •

edited

Loading

codecov-commenter commented May 10, 2026 •

edited

Loading

stmatengss commented May 14, 2026 •

edited

Loading

'''
ERROR: test_mooncake_backend_cpu (unittest.loader._FailedTest.test_mooncake_backend_cpu)