Skip to content

[Store] L2->L1 promotion-on-hit#2071

Open
yzhan1 wants to merge 6 commits into
kvcache-ai:mainfrom
yzhan1:feature/l2-l1-promotion-on-hit
Open

[Store] L2->L1 promotion-on-hit#2071
yzhan1 wants to merge 6 commits into
kvcache-ai:mainfrom
yzhan1:feature/l2-l1-promotion-on-hit

Conversation

@yzhan1
Copy link
Copy Markdown

@yzhan1 yzhan1 commented May 10, 2026

Description

Adds the L2→L1 promotion-on-hit feature: when a key whose only complete replica lives on a peer client's local SSD (LOCAL_DISK) is read enough times to clear an admission threshold, the master enqueues a promotion task and the holder client's next FileStorage heartbeat stages a fresh MEMORY replica via Transfer Engine. Symmetric counterpart to offload-on-evict: offload moves cold data DRAM→SSD on eviction, promotion moves rediscovered-warm data SSD→DRAM on access.

The trigger lives inside MasterService::GetReplicaList. A CountMinSketch tracks per-key access frequency; gates filter out cases where promotion would be wasteful or unsafe (no LOCAL_DISK source, MEMORY already present, DRAM under high-watermark pressure, queue saturated, or task already in-flight for the same key). When all gates pass, the master records a PromotionTask in the per-shard map (refcnt-pinning the source replica) and pushes the key onto the holder's per-LocalDiskSegment promotion_objects map. The holder's FileStorage heartbeat thread drains a bounded batch from that map, calls PromotionAllocStart to stage a PROCESSING MEMORY replica, reads the bytes from local SSD, RDMA-WRITEs them via PromotionWrite, and notifies the master. NotifyPromotionSuccess flips the new replica COMPLETE and decrements the source LOCAL_DISK refcnt. A reaper drops expired tasks; orphaned PROCESSING replicas are reaped via the standard discarded-replicas path.

Failure handling matches offload's posture: per-key failures are logged and skipped, and the master-side reaper handles task TTL expiry on its own schedule.

Key design points addressed in this PR

  • Concurrent-Put disambiguation in NotifyPromotionSuccess. The PromotionTask stores the ReplicaID of the staged MEMORY replica captured during PromotionAllocStart. NotifyPromotionSuccess commits exactly that replica via GetReplicaByID, so a parallel Put creating its own PROCESSING MEMORY replica on the same key cannot be confused with ours.
  • Heartbeat liveness. Each promotion task does a synchronous SSD read + RDMA write whose wall time scales with object size (64 MB–1 GB tensors are common in KV-cache workloads). MasterService::PromotionObjectHeartbeat returns at most kMaxPerHeartbeat = 1 task per call and leaves the rest queued; the client processes whatever the master returned and never blocks beyond a single task. Leftovers are preserved by construction — no silent drops.
  • Sketch / threshold type safety. The CountMinSketch uses 8-bit saturating counters (max 255), so promotion_admission_threshold is clamped to [0, 255] at config parse time with a warning, giving the gate a stable comparison contract.

Config knobs (default off)

  • --promotion_on_hit=true — feature flag
  • --promotion_admission_threshold=N — reads per key before the gate fires (default 2; clamped to ≤255)
  • --promotion_queue_limit=N — soft per-shard cap on in-flight promotion tasks (default 50000)

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Master-side unit tests (mooncake-store/tests/promotion_on_hit_test.cpp, 13 tests):

  • Gate behavior: feature off, no LOCAL_DISK source, MEMORY already present
  • Error returns: unknown client/key on Heartbeat / AllocStart / Notify
  • State machine: dedup of racing readers, stale task reaper, key removed mid-promotion
  • Cap gates: promotion_queue_limit rejects beyond saturation; PromotionObjectHeartbeat returns ≤kMaxPerHeartbeat per call and preserves leftover work (HeartbeatBoundedBatchPreservesLeftovers)
  • Multi-segment routing: cross-host LOCAL_DISK→MEMORY allocation, preferred_segments honored

Client-side unit tests (mooncake-store/tests/file_storage_promotion_test.cpp, 9 tests, FakeClient mocks the master RPCs):

  • Empty queue / heartbeat error paths
  • Per-key error isolation: AllocStart / BatchLoad / TransferWrite / Notify failures don't abort the batch
  • NonPositiveSize / SegmentNotFound / hard-error propagation
  • FakeClient mirrors the master's bounded-batch heartbeat semantics, so the drain shape exercised matches production

Snapshot tests (mooncake-store/tests/ha/snapshot/master_service_promotion_test_for_snapshot.cpp, 5 tests):

  • LOCAL_DISK replica round-trip, mixed MEMORY+LOCAL_DISK, multiple holders, in-flight task safety

End-to-end Python tests (mooncake-wheel/tests/test_promotion_on_hit.py, 2 tests in CI):

  • Positive: 96×1 MiB workload overflows 32 MiB segment → eviction creates LOCAL_DISK-only keys → repeated store.get clears admission threshold → MEMORY replica reappears → bit-exact bytes verified pre- AND post-promotion AND the post-promotion read is proven to serve from MEMORY (not LOCAL_DISK SSD) by snapshotting MooncakeDistributedStore.get_offload_rpc_read_count() around the read and asserting zero delta
  • Negative: below-watermark workload stays MEMORY-only (guards against spurious promotion / auto-LOCAL_DISK regressions)

Gated behind TEST_PROMOTION_ON_HIT in scripts/run_tests.sh.

Opt-in latency benchmark (BenchPromotionLatency class in the same file, skipped by default; run with MC_BENCH_PROMOTION_LATENCY=1):

The CI test proves MEMORY served the post-promotion read via the offload-RPC counter delta. For a richer empirical view, the benchmark times every read with time.perf_counter, samples LATENCY_SAMPLES reads per phase (default 1000), and reports p50/p95/p99 with a per-percentile speedup table. Pre-promotion reads bail the moment a read serves from MEMORY (detected via the same counter) so the LOCAL_DISK distribution stays uncontaminated.

Observed numbers from a docker-tmpfs run with 1 MiB objects, N=1000 samples:

n p50 p95 p99 min max
Pre-promotion (LOCAL_DISK via offload-RPC) 1000 0.52 ms 0.68 ms 0.81 ms 0.40 ms 6.10 ms
Post-promotion (MEMORY direct) 1000 0.24 ms 0.34 ms 0.42 ms 0.18 ms 2.36 ms
Speedup 2.1× 2.0× 1.9×

The speedup holds across the distribution — MEMORY's win isn't a median trick. The 1.3× p50 assertion baked into the bench has comfortable headroom over the observed 2×. On docker tmpfs (no real SSD, no RDMA NIC) much of the per-read wall-time is Python/C++ marshalling overhead, so the SSD-vs-DRAM cost gap is compressed; on real hardware (NVMe + RDMA) the ratio is typically 10–50×.

The same numbers replicated at N=10000 with MOONCAKE_OFFLOAD_HEARTBEAT_INTERVAL_SECONDS=30 (so the worker doesn't race a longer pre-loop): p50/p95/p99 changed by ≤0.02 ms vs the N=1000 run, confirming N=1000 is the right default for stable tail percentiles.

Follow-ups (not in this PR)

  • Async promotion worker. Today the heartbeat thread serially executes the SSD read + RDMA write + notify per task with kMaxPerHeartbeat=1 to stay inside the client-liveness window. If real workloads show the resulting drain rate (≈ 1 promotion / heartbeat interval) becomes a bottleneck, an obvious next step is a dedicated worker thread that pulls from a bounded queue: the heartbeat enqueues, the worker drains independently, and the cap can be relaxed. ~180 LOC + tests; deferred unless needed.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'promotion-on-hit' feature that asynchronously promotes objects from local SSD (L2) back to DRAM (L1) when they are frequently accessed. The implementation includes master-side gating logic using a Count-Min Sketch, new RPC endpoints for promotion orchestration, and a client-side execution loop in FileStorage. Feedback focuses on critical concurrency and performance issues: processing too many promotion tasks in a single heartbeat could block the client, and the current replica commitment logic in NotifyPromotionSuccess is ambiguous and could lead to data corruption if concurrent Put operations occur. Additionally, a type mismatch between the frequency sketch and the admission threshold was identified.

Comment thread mooncake-store/src/file_storage.cpp
Comment thread mooncake-store/src/master_service.cpp Outdated
Comment thread mooncake-store/src/master_service.cpp
@yzhan1 yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 2 times, most recently from 9c38edf to 30845ca Compare May 10, 2026 06:28
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 10, 2026

@yzhan1 yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 3 times, most recently from 8b1d481 to fd69cf2 Compare May 10, 2026 07:55
@Srinivasoo7
Copy link
Copy Markdown

Hi @yzhan1

Overall the implementation follows as per the RFC.

I'm curious about the following issues:

  1. Promoted MEMORY replicas are appended after the existing LOCAL_DISK replica, but Client::Get still selects the first COMPLETE replica.

PromotionAllocStart appends the new PROCESSING MEMORY replica via metadata.AddReplicas, and NotifyPromotionSuccess only marks that appended replica COMPLETE.
However, normal Get still calls FindFirstCompleteReplica, which returns the first COMPLETE descriptor in master order. For a LOCAL_DISK-only key, the original LOCAL_DISK replica remains first, so after promotion, future Gets can still read from SSD instead of the new MEMORY replica. This means the main RFC goal, "subsequent Gets avoid SSD", is not guaranteed.

Ideal fix here would be to either return/prefer MEMORY replicas before LOCAL_DISK in GetReplicaList, insert the promoted MEMORY replica before the LOCAL_DISK source, or update client replica selection to prefer MEMORY over LOCAL_DISK?

Also, it would be nice to add a test that proves the post-promotion read actually selects MEMORY, not just that MEMORY metadata exists.

  1. ProcessPromotionTasks drains all pending promotion objects but only processes one.

PromotionObjectHeartbeat moves the whole per-client promotion_objects map out of the master. Then FileStorage::ProcessPromotionTasks has kMaxPromotionsPerTick = 1 and breaks after the first item.
The remaining items are no longer in the client pending map, so they will not be retried on later heartbeats; their master promotion_tasks entries stay pinned until the stale-task reaper expires them. Under a burst of hot LOCAL_DISK keys, most queued promotions can be dropped and source refcounts remain pinned until timeout.

Could you keep unprocessed tasks queued, return only a bounded batch from the heartbeat, or process the full drained map with a byte/task budget that preserves leftovers?

@LujhCoconut
Copy link
Copy Markdown
Collaborator

Overall, this pr makes sense to me from a design perspective. As a nice-to-have, would it be possible to supplement the results with a few more metrics? This would further strengthen the empirical validation, though it's by no means a blocker.

yzhan1 added a commit to yzhan1/Mooncake that referenced this pull request May 13, 2026
…elected e2e proof

Two PR-review issues from kvcache-ai#2071, addressed without changing the existing
commit:

1. Heartbeat could silently drop leftover promotion tasks.
   The prior fix to a per-tick cap lived on the client side: master
   drained the whole per-LDS promotion_objects map, client processed
   one, the rest evaporated. Their per-shard promotion_tasks entries
   stayed refcnt-pinned until the reaper TTL (~10 min), and the dedup
   gate blocked re-enqueue, so a burst of hot LOCAL_DISK keys could
   lose all but one promotion for that window.

   Move the cap server-side. MasterService::PromotionObjectHeartbeat
   now extracts at most kMaxPerHeartbeat (1) entries from the per-LDS
   promotion_objects map and leaves the remainder queued for subsequent
   heartbeats. The client-side per-tick cap is removed — the master is
   the bound. Leftover work is preserved by construction.

   New unit test HeartbeatBoundedBatchPreservesLeftovers proves: with
   three queued keys, three successive heartbeats each return exactly
   one distinct key, the fourth returns empty, and all three keys'
   master-side state stays intact across the drain.

   The FakeClient harness in file_storage_promotion_test now mirrors
   the master's bounded-batch semantic on its mocked Heartbeat, so the
   existing client-side tests exercise the same drain shape production
   sees.

2. E2E couldn't prove subsequent reads avoided SSD.
   Post-promotion the test asserted bytes-back and the existence of a
   MEMORY replica, but bytes-back is the same whether MEMORY or
   LOCAL_DISK served the read (the offload-RPC path returns correct
   bytes too). The RFC's core invariant — "subsequent reads avoid
   SSD" — went unverified.

   Add an instrumented counter offload_rpc_read_count on RealClient
   that increments at every invocation of
   batch_get_into_offload_object_internal, the single chokepoint for
   LOCAL_DISK reads served via peer offload-RPC. Expose via the
   Python binding as MooncakeDistributedStore.get_offload_rpc_read_count().
   The e2e snapshots the counter around the post-promotion store.get
   and asserts it didn't move — a non-zero delta would mean the read
   served from SSD instead of the promoted MEMORY replica.
@LujhCoconut
Copy link
Copy Markdown
Collaborator

Thank you for your patience. Overall, this PR looks quite complete to me. My final suggestion is regarding code quality:

[Nit] The const_cast in PromotionAllocStart to backfill alloc_id is safe at runtime, but the const qualifier on the map value type declares an intent of immutability that the two-phase emplace-then-mutate pattern contradicts. Removing const from promotion_tasks's value type would make the declaration match the actual usage and eliminate the cast, reducing the reasoning burden for future readers. No behavioral change.

- std::unordered_map<std::string, const PromotionTask> promotion_tasks;
+ std::unordered_map<std::string, PromotionTask> promotion_tasks;
- const_cast<PromotionTask&>(task_it->second).alloc_id = new_id;  
+ task_it->second.alloc_id = new_id;

The shard RW lock is already held in both TryPushPromotionQueue (emplace) and PromotionAllocStart (backill), so mutability is safely scoped.

Note that offloading_tasks is declared as const OffloadingTask because it's fully initialized at emplace time and never mutated afterward. PromotionTask is different due to the two-phase alloc_id backfill, so the const qualifier is unnecessarily restrictive here.

@yzhan1
Copy link
Copy Markdown
Author

yzhan1 commented May 14, 2026

@LujhCoconut thanks! I will revise based on your comment. Also have some perf tests that I'd like to add and some other minor bugs I'm fixing. Will address and push shortly

@stmatengss
Copy link
Copy Markdown
Collaborator

stmatengss commented May 14, 2026

'''
ERROR: test_mooncake_backend_cpu (unittest.loader._FailedTest.test_mooncake_backend_cpu)

ImportError: Failed to import test module: test_mooncake_backend_cpu
Traceback (most recent call last):
File "/home/runner/work/Mooncake/Mooncake/test_env/lib/python3.12/site-packages/mooncake/pg.py", line 10, in
backend_module = importlib.import_module("mooncake.pg" + version_suffix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/importlib/init.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 1387, in _gcd_import
File "", line 1360, in _find_and_load
File "", line 1324, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'mooncake.pg_2_12_0'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/unittest/loader.py", line 137, in loadTestsFromName
module = import(module_name)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/Mooncake/Mooncake/mooncake-wheel/tests/test_mooncake_backend_cpu.py", line 7, in
from mooncake import pg
File "/home/runner/work/Mooncake/Mooncake/test_env/lib/python3.12/site-packages/mooncake/pg.py", line 12, in
raise ImportError(
ImportError: Mooncake PG was not built against torch==2.12.0.
Open an issue at https://github.com/kvcache-ai/Mooncake/issues.
'''

yzhan1 and others added 6 commits May 14, 2026 09:11
…elected e2e proof

Two PR-review issues from kvcache-ai#2071, addressed without changing the existing
commit:

1. Heartbeat could silently drop leftover promotion tasks.
   The prior fix to a per-tick cap lived on the client side: master
   drained the whole per-LDS promotion_objects map, client processed
   one, the rest evaporated. Their per-shard promotion_tasks entries
   stayed refcnt-pinned until the reaper TTL (~10 min), and the dedup
   gate blocked re-enqueue, so a burst of hot LOCAL_DISK keys could
   lose all but one promotion for that window.

   Move the cap server-side. MasterService::PromotionObjectHeartbeat
   now extracts at most kMaxPerHeartbeat (1) entries from the per-LDS
   promotion_objects map and leaves the remainder queued for subsequent
   heartbeats. The client-side per-tick cap is removed — the master is
   the bound. Leftover work is preserved by construction.

   New unit test HeartbeatBoundedBatchPreservesLeftovers proves: with
   three queued keys, three successive heartbeats each return exactly
   one distinct key, the fourth returns empty, and all three keys'
   master-side state stays intact across the drain.

   The FakeClient harness in file_storage_promotion_test now mirrors
   the master's bounded-batch semantic on its mocked Heartbeat, so the
   existing client-side tests exercise the same drain shape production
   sees.

2. E2E couldn't prove subsequent reads avoided SSD.
   Post-promotion the test asserted bytes-back and the existence of a
   MEMORY replica, but bytes-back is the same whether MEMORY or
   LOCAL_DISK served the read (the offload-RPC path returns correct
   bytes too). The RFC's core invariant — "subsequent reads avoid
   SSD" — went unverified.

   Add an instrumented counter offload_rpc_read_count on RealClient
   that increments at every invocation of
   batch_get_into_offload_object_internal, the single chokepoint for
   LOCAL_DISK reads served via peer offload-RPC. Expose via the
   Python binding as MooncakeDistributedStore.get_offload_rpc_read_count().
   The e2e snapshots the counter around the post-promotion store.get
   and asserts it didn't move — a non-zero delta would mean the read
   served from SSD instead of the promoted MEMORY replica.
Reviewer asked for more metrics. The post-promotion offload-RPC counter
already proves MEMORY served the read; adding a wall-time comparison
gives the same evidence on a different axis and surfaces a concrete
"this is the user-visible win" number.

Phase 3 timed the 4 admission-threshold-clearing reads as the
LOCAL_DISK sample. Phase 5 expanded the single post-promotion read into
10 timed reads as the MEMORY sample, plus the existing counter
assertion. The test now prints a median / min / max table for each
phase and asserts the post-promotion median is at least 1.3x faster.

The threshold is intentionally loose: on docker tmpfs with small (1 MB)
objects much of the wall time is Python/C++ marshalling overhead and
the SSD-vs-DRAM gap shrinks to ~2-5x; on real SSDs + RDMA the gap is
typically 10-50x. The counter assertion is the hard correctness check;
the latency assertion catches gross regressions while the printed
summary serves as the primary empirical signal.

Observed in CI (docker tmpfs, 1 MB objects):
  pre-promotion  (LOCAL_DISK via offload-RPC): median 1.53 ms
  post-promotion (MEMORY direct):              median 0.31 ms
  speedup: 4.9x
Reviewer asked to keep a richer latency comparison in the repo but
NOT run it as a CI test. test_promotion_after_repeated_hits goes back
to the lean correctness shape: 4 reads to clear the admission gate,
1 read post-promotion + offload-RPC counter assertion to prove MEMORY
served, no timing. The full p50/p95/p99 latency comparison moves into
a sibling class BenchPromotionLatency gated by
@unittest.skipUnless(MC_BENCH_PROMOTION_LATENCY), so:

  python test_promotion_on_hit.py                              # 3 tests, 1 skipped — CI
  MC_BENCH_PROMOTION_LATENCY=1 python test_promotion_on_hit.py # full bench

The bench retains the per-read MEMORY-detection (snapshots
offload_rpc_read_count around each pre-promotion read and bails on
detection of MEMORY-served reads so the LOCAL_DISK distribution stays
uncontaminated) and the 1.3x p50 speedup assertion. Sample count is
LATENCY_SAMPLES (default 1000; verified stable to within noise at
N=10000 too — N=1000 is the sweet spot for stable p99 without paying
for 10x more wall time).

Observed bench numbers on docker tmpfs (1 MiB objects, N=1000):
  pre-promotion  (LOCAL_DISK via offload-RPC): p50=0.52 ms, p95=0.68 ms, p99=0.81 ms
  post-promotion (MEMORY direct):              p50=0.24 ms, p95=0.33 ms, p99=0.42 ms
  speedup: p50=2.2x | p95=2.0x | p99=1.9x

Real-hardware speedup is typically 10-50x because the SSD-vs-DRAM cost
gap dominates only when objects are large and the FFI/marshalling
overhead is fractional.
…obal cap, const, misconfig log)

Addresses reviewer feedback on PR kvcache-ai#2071.

Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task
reaper only dropped the source LOCAL_DISK refcnt and erased the task
entry; it never popped the staged PROCESSING MEMORY replica added by
PromotionAllocStart. That replica is not in shard->processing_keys, so
DiscardExpiredProcessingReplicas could not sweep it and the buffer
leaked until the object was removed or evicted. Fix: in the reaper,
when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop
the staged replica and return its buffer to the allocator.

Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate
was 'shard->size() * kNumShards >= limit', approximately right for
uniform workloads but ~1024x too eager on skewed workloads where hot
keys cluster in few shards. Replace with a cluster-wide
std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in
TryPushPromotionQueue after successful emplace; decremented in
NotifyPromotionSuccess and in the reaper. memory_order_relaxed since
the value is advisory; the per-shard mutex already serializes inserts
within a shard and the dedup gate prevents duplicate work.

Issue 3 -- const_cast smell. The promotion_tasks map held
const PromotionTask values for "generic safety", forcing a
const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id
under the shard write lock. Drop the const; PromotionAllocStart now
sets task_it->second.alloc_id = new_id directly.

Misconfig log -- emit LOG(WARNING) at startup when
config.promotion_on_hit=true but enable_offload=false. Promotion
requires offload to produce LOCAL_DISK replicas, so it is silently
disabled in that combination; the log makes the disablement
discoverable to operators.

Tests
-----
- New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1.
  Uses QuerySegments(seg).first (used bytes) to observe that the
  staged PROCESSING MEMORY replica's buffer is freed back to the DRAM
  allocator after the reaper sweeps the expired task.
- New QueueLimitRejectsCrossShard: regression for Issue 2. With
  queue_limit=1, proves a second admission attempt on a *different
  shard* is rejected -- exactly the case the old per-shard cap
  admitted incorrectly.
- Updated comment on existing QueueLimitRejectsBeyondCap to reflect
  the cluster-wide counter semantics.

Verification
------------
- promotion_on_hit_test: 15/15 pass (5 consecutive clean runs)
- file_storage_promotion_test: 9/9 pass
- master_service_promotion_test_for_snapshot: 5/5 pass
- offload_on_evict_test: 9/9 pass
- code_format.sh --base upstream/main (clang-format-20 in container):
  3 files reformatted (master_service.h, master_service.cpp,
  promotion_on_hit_test.cpp); all others "Already formatted"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…est + reaper poll

Follow-up to 2cee8eb addressing two further reviewer concerns and
recovering one edit batch that did not land in that commit.

AllocStart deadline reset
-------------------------
A queued promotion task could enter the active-transfer phase
(PromotionAllocStart -> client SSD read -> RDMA write -> Notify) with
only a sliver of TTL remaining if the holder's heartbeat queue was
backlogged (kMaxPerHeartbeat = 1 serializes work). If the active
transfer of a large object then ran past the TTL, the reaper would
call EraseReplicaByID on the staged MEMORY replica mid-transfer; the
allocator could then hand that buffer to a concurrent Put while the
client's RDMA write was still landing into it.

Fix: PromotionAllocStart now resets task.start_time when it stages
the PROCESSING MEMORY replica. The queue-wait phase (alloc_id == 0,
no buffer) and the active-transfer phase (alloc_id != 0, buffer
in flight) each get their own full put_start_release_timeout_sec_
TTL window, measured from when they began.

New regression test AllocStartResetsTaskDeadline drives the
queue_wait_seconds + alloc_active_seconds > TTL scenario and asserts
via QuerySegments(seg).first that the staged buffer survives past
the bare TTL when measured from admission.

NotifyPromotionSuccess counter coverage
---------------------------------------
The success path of NotifyPromotionSuccess was not exercised by any
existing test — all call sites in the suite asserted error paths.
2cee8eb added a fetch_sub on the success path but had no test to
catch a missing decrement.

New test NotifySuccessDecrementsCounter sets queue_limit=1, drives
the full happy path (GetReplicaList -> PromotionAllocStart ->
NotifyPromotionSuccess), then asserts a second admission on a
different key is allowed. Without the success-path fetch_sub, the
cap stays saturated at 1 and the second admission is dropped. Bug
discrimination verified manually: temporarily removing fetch_sub
fails this test as expected.

Reaper test poll-instead-of-sleep
---------------------------------
ReaperPopsStagedMemoryReplicaOnExpiry used a fixed 2-second sleep
after a 1-second TTL. Under suite load the eviction thread cadence
plus general scheduling jitter could push past the 2-second mark
~1 in 6 runs, producing a flake matching the StalePromotionReaper
class. Replace with a 5-second polling deadline against
QuerySegments(seg).first; the test now passes 5/5 stability runs.

QueueLimitRejectsBeyondCap stale block comment
----------------------------------------------
The test's docblock still described the old per-shard gate (gate:
shard_size * kNumShards >= limit). Update it to reflect the
cluster-wide promotion_in_flight_ counter and cross-reference
QueueLimitRejectsCrossShard for the cross-shard case.

Verification
------------
- promotion_on_hit_test: 17/17 pass × 5 consecutive clean runs
- New tests pass; mutation-test of NotifySuccessDecrementsCounter
  against a removed success-path fetch_sub fails as designed
- code_format.sh --base upstream/main (clang-format-20 in container):
  3 files reformatted; all others already-formatted

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yzhan1 added a commit to yzhan1/Mooncake that referenced this pull request May 14, 2026
…obal cap, const, misconfig log)

Addresses reviewer feedback on PR kvcache-ai#2071.

Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task
reaper only dropped the source LOCAL_DISK refcnt and erased the task
entry; it never popped the staged PROCESSING MEMORY replica added by
PromotionAllocStart. That replica is not in shard->processing_keys, so
DiscardExpiredProcessingReplicas could not sweep it and the buffer
leaked until the object was removed or evicted. Fix: in the reaper,
when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop
the staged replica and return its buffer to the allocator.

Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate
was 'shard->size() * kNumShards >= limit', approximately right for
uniform workloads but ~1024x too eager on skewed workloads where hot
keys cluster in few shards. Replace with a cluster-wide
std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in
TryPushPromotionQueue after successful emplace; decremented in
NotifyPromotionSuccess and in the reaper. memory_order_relaxed since
the value is advisory; the per-shard mutex already serializes inserts
within a shard and the dedup gate prevents duplicate work.

Issue 3 -- const_cast smell. The promotion_tasks map held
const PromotionTask values for "generic safety", forcing a
const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id
under the shard write lock. Drop the const; PromotionAllocStart now
sets task_it->second.alloc_id = new_id directly.

Misconfig log -- emit LOG(WARNING) at startup when
config.promotion_on_hit=true but enable_offload=false. Promotion
requires offload to produce LOCAL_DISK replicas, so it is silently
disabled in that combination; the log makes the disablement
discoverable to operators.

Tests
-----
- New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1.
  Uses QuerySegments(seg).first (used bytes) to observe that the
  staged PROCESSING MEMORY replica's buffer is freed back to the DRAM
  allocator after the reaper sweeps the expired task.
- New QueueLimitRejectsCrossShard: regression for Issue 2. With
  queue_limit=1, proves a second admission attempt on a *different
  shard* is rejected -- exactly the case the old per-shard cap
  admitted incorrectly.
- Updated comment on existing QueueLimitRejectsBeyondCap to reflect
  the cluster-wide counter semantics.

Verification
------------
- promotion_on_hit_test: 15/15 pass (5 consecutive clean runs)
- file_storage_promotion_test: 9/9 pass
- master_service_promotion_test_for_snapshot: 5/5 pass
- offload_on_evict_test: 9/9 pass
- code_format.sh --base upstream/main (clang-format-20 in container):
  3 files reformatted (master_service.h, master_service.cpp,
  promotion_on_hit_test.cpp); all others "Already formatted"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yzhan1 yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch from 3b0a1da to 3ff606b Compare May 14, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants