[Store] L2->L1 promotion-on-hit: end-to-end byte movement + extended test coverage by yzhan1 · Pull Request #1 · yzhan1/Mooncake

yzhan1 · 2026-05-06T18:17:29Z

Description

Mirror of the offload-on-evict feature (kvcache-ai#1899) in the reverse direction: when a key exists only as a LOCAL_DISK replica and a client reads it past the admission threshold, the master enqueues a promotion task on the holder, FileStorage drains it via heartbeat, and the bytes are staged back into a fresh MEMORY replica via Transfer Engine before the master flips it COMPLETE.

This branch has 4 commits total. The first two (d10a0bf master-side state machine, bddd0fc initial gate tests) were posted earlier; this PR adds:

d3edf27 — client byte movement + RPC wiring. RPCs through rpc_service/master_client, Client::PromotionWrite wrapping TE, FileStorage::ProcessPromotionTasks orchestration, pybind is_local_disk_replica accessor, scripts/run_tests.sh env-gated CI block. Makes the feature work end-to-end.
aecf651 — extended test coverage (concurrency, multi-segment alloc, FileStorage fault injection, snapshot round-trip, Python negative e2e). 31 tests total.

End-to-end flow after this PR:

client A: GET k  ──▶  master GetReplicaList
                        ├─ if !any_memory && any_local_disk and freq > threshold,
                        │   queue PromotionTask, push to holder's promotion_objects
                        ▼
client B (holder of LOCAL_DISK k):
   FileStorage::Heartbeat tick
     ├─ PromotionObjectHeartbeat (drain queue)
     └─ for each (key, size):
         ├─ master.PromotionAllocStart  ──▶ stage PROCESSING MEMORY
         ├─ FileStorage.AllocateBatch  (O_DIRECT staging buffer)
         ├─ storage_backend.BatchLoad  (SSD → staging)
         ├─ Client.PromotionWrite      (staging → MEMORY via TE)
         └─ master.NotifyPromotionSuccess  ──▶ flip COMPLETE

Per-key failures are logged and skipped; the master-side reaper (EvictionThreadFunc) drops orphaned PromotionTasks and decrements the source replica's refcnt after put_start_release_timeout_sec.

Module

Mooncake Store (mooncake-store)
Integration (mooncake-integration)
Python Wheel (mooncake-wheel)
CI/CD

Type of Change

New feature

How Has This Been Tested?

Suite	Tests	Result
`promotion_on_hit_test` (master gate)	12	12/12 pass (~14s) — 6 original gate + 3 concurrency/reaper + 3 multi-segment alloc
`file_storage_promotion_test` (NEW)	10	10/10 pass (~0.4s) — FakeClient subclass drives `ProcessPromotionTasks`; covers heartbeat protocol, alloc/load/write/notify failures, per-key independence
`master_service_promotion_test_for_snapshot` (NEW)	7	7/7 pass (~14s) — LOCAL_DISK + mixed replicas + multi-holder + in-flight task all round-trip via `TestSnapshotAndRestore` in TearDown
`test_promotion_on_hit.py`	2	2/2 pass (~78s) — positive (promotion fires after threshold) + negative (below-watermark workload stays MEMORY-only)

Regression check: no failures in master_service_test, offload_on_evict_test, file_storage_test.

Operator setup for the e2e:

mooncake_master \
    --enable_offload=true \
    --offload_on_evict=true \
    --promotion_on_hit=true \
    --promotion_admission_threshold=2 \
    --root_fs_dir=/path/to/local/disk

Client side requires enable_ssd_offload=True in MooncakeDistributedStore.setup. For test workloads under 256 MB / 500 keys, also set MOONCAKE_OFFLOAD_BUCKET_KEYS_LIMIT and MOONCAKE_OFFLOAD_BUCKET_SIZE_LIMIT_BYTES low enough that the bucket backend actually flushes (the defaults silently buffer in the ungrouped pool — same gotcha as the offload e2e).

Known follow-ups, not in this PR:

No metrics yet — MasterMetricManager has no promotion counters. Should land before flipping default-on for production clients.
Client::TransferRead LOCAL_DISK bug at client_service.cpp:2675 is pre-existing and orthogonal; the e2e sidesteps it by using batch_get_replica_desc (which calls GetReplicaList, the real promotion trigger) rather than store.get. Single-key store.get of a LOCAL_DISK-only key still throws "Expected DiskDescriptor".
Orphaned PROCESSING memory replica — if AllocStart succeeds but Notify never arrives, the reaper drops the task entry and source refcnt but doesn't remove the staged PROCESSING replica from metadata. Safe (invisible to readers) but leaks until master restart.

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

…elected e2e proof Two PR-review issues from kvcache-ai#2071, addressed without changing the existing commit: 1. Heartbeat could silently drop leftover promotion tasks. The prior fix to a per-tick cap lived on the client side: master drained the whole per-LDS promotion_objects map, client processed one, the rest evaporated. Their per-shard promotion_tasks entries stayed refcnt-pinned until the reaper TTL (~10 min), and the dedup gate blocked re-enqueue, so a burst of hot LOCAL_DISK keys could lose all but one promotion for that window. Move the cap server-side. MasterService::PromotionObjectHeartbeat now extracts at most kMaxPerHeartbeat (1) entries from the per-LDS promotion_objects map and leaves the remainder queued for subsequent heartbeats. The client-side per-tick cap is removed — the master is the bound. Leftover work is preserved by construction. New unit test HeartbeatBoundedBatchPreservesLeftovers proves: with three queued keys, three successive heartbeats each return exactly one distinct key, the fourth returns empty, and all three keys' master-side state stays intact across the drain. The FakeClient harness in file_storage_promotion_test now mirrors the master's bounded-batch semantic on its mocked Heartbeat, so the existing client-side tests exercise the same drain shape production sees. 2. E2E couldn't prove subsequent reads avoided SSD. Post-promotion the test asserted bytes-back and the existence of a MEMORY replica, but bytes-back is the same whether MEMORY or LOCAL_DISK served the read (the offload-RPC path returns correct bytes too). The RFC's core invariant — "subsequent reads avoid SSD" — went unverified. Add an instrumented counter offload_rpc_read_count on RealClient that increments at every invocation of batch_get_into_offload_object_internal, the single chokepoint for LOCAL_DISK reads served via peer offload-RPC. Expose via the Python binding as MooncakeDistributedStore.get_offload_rpc_read_count(). The e2e snapshots the counter around the post-promotion store.get and asserts it didn't move — a non-zero delta would mean the read served from SSD instead of the promoted MEMORY replica.

Reviewer asked for more metrics. The post-promotion offload-RPC counter already proves MEMORY served the read; adding a wall-time comparison gives the same evidence on a different axis and surfaces a concrete "this is the user-visible win" number. Phase 3 timed the 4 admission-threshold-clearing reads as the LOCAL_DISK sample. Phase 5 expanded the single post-promotion read into 10 timed reads as the MEMORY sample, plus the existing counter assertion. The test now prints a median / min / max table for each phase and asserts the post-promotion median is at least 1.3x faster. The threshold is intentionally loose: on docker tmpfs with small (1 MB) objects much of the wall time is Python/C++ marshalling overhead and the SSD-vs-DRAM gap shrinks to ~2-5x; on real SSDs + RDMA the gap is typically 10-50x. The counter assertion is the hard correctness check; the latency assertion catches gross regressions while the printed summary serves as the primary empirical signal. Observed in CI (docker tmpfs, 1 MB objects): pre-promotion (LOCAL_DISK via offload-RPC): median 1.53 ms post-promotion (MEMORY direct): median 0.31 ms speedup: 4.9x

Reviewer asked to keep a richer latency comparison in the repo but NOT run it as a CI test. test_promotion_after_repeated_hits goes back to the lean correctness shape: 4 reads to clear the admission gate, 1 read post-promotion + offload-RPC counter assertion to prove MEMORY served, no timing. The full p50/p95/p99 latency comparison moves into a sibling class BenchPromotionLatency gated by @unittest.skipUnless(MC_BENCH_PROMOTION_LATENCY), so: python test_promotion_on_hit.py # 3 tests, 1 skipped — CI MC_BENCH_PROMOTION_LATENCY=1 python test_promotion_on_hit.py # full bench The bench retains the per-read MEMORY-detection (snapshots offload_rpc_read_count around each pre-promotion read and bails on detection of MEMORY-served reads so the LOCAL_DISK distribution stays uncontaminated) and the 1.3x p50 speedup assertion. Sample count is LATENCY_SAMPLES (default 1000; verified stable to within noise at N=10000 too — N=1000 is the sweet spot for stable p99 without paying for 10x more wall time). Observed bench numbers on docker tmpfs (1 MiB objects, N=1000): pre-promotion (LOCAL_DISK via offload-RPC): p50=0.52 ms, p95=0.68 ms, p99=0.81 ms post-promotion (MEMORY direct): p50=0.24 ms, p95=0.33 ms, p99=0.42 ms speedup: p50=2.2x | p95=2.0x | p99=1.9x Real-hardware speedup is typically 10-50x because the SSD-vs-DRAM cost gap dominates only when objects are large and the FFI/marshalling overhead is fractional.

…obal cap, const, misconfig log) Addresses reviewer feedback on PR kvcache-ai#2071. Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task reaper only dropped the source LOCAL_DISK refcnt and erased the task entry; it never popped the staged PROCESSING MEMORY replica added by PromotionAllocStart. That replica is not in shard->processing_keys, so DiscardExpiredProcessingReplicas could not sweep it and the buffer leaked until the object was removed or evicted. Fix: in the reaper, when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop the staged replica and return its buffer to the allocator. Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate was 'shard->size() * kNumShards >= limit', approximately right for uniform workloads but ~1024x too eager on skewed workloads where hot keys cluster in few shards. Replace with a cluster-wide std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in TryPushPromotionQueue after successful emplace; decremented in NotifyPromotionSuccess and in the reaper. memory_order_relaxed since the value is advisory; the per-shard mutex already serializes inserts within a shard and the dedup gate prevents duplicate work. Issue 3 -- const_cast smell. The promotion_tasks map held const PromotionTask values for "generic safety", forcing a const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id under the shard write lock. Drop the const; PromotionAllocStart now sets task_it->second.alloc_id = new_id directly. Misconfig log -- emit LOG(WARNING) at startup when config.promotion_on_hit=true but enable_offload=false. Promotion requires offload to produce LOCAL_DISK replicas, so it is silently disabled in that combination; the log makes the disablement discoverable to operators. Tests ----- - New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1. Uses QuerySegments(seg).first (used bytes) to observe that the staged PROCESSING MEMORY replica's buffer is freed back to the DRAM allocator after the reaper sweeps the expired task. - New QueueLimitRejectsCrossShard: regression for Issue 2. With queue_limit=1, proves a second admission attempt on a *different shard* is rejected -- exactly the case the old per-shard cap admitted incorrectly. - Updated comment on existing QueueLimitRejectsBeyondCap to reflect the cluster-wide counter semantics. Verification ------------ - promotion_on_hit_test: 15/15 pass (5 consecutive clean runs) - file_storage_promotion_test: 9/9 pass - master_service_promotion_test_for_snapshot: 5/5 pass - offload_on_evict_test: 9/9 pass - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted (master_service.h, master_service.cpp, promotion_on_hit_test.cpp); all others "Already formatted"

…est + reaper poll Follow-up to 2cee8eb addressing two further reviewer concerns and recovering one edit batch that did not land in that commit. AllocStart deadline reset ------------------------- A queued promotion task could enter the active-transfer phase (PromotionAllocStart -> client SSD read -> RDMA write -> Notify) with only a sliver of TTL remaining if the holder's heartbeat queue was backlogged (kMaxPerHeartbeat = 1 serializes work). If the active transfer of a large object then ran past the TTL, the reaper would call EraseReplicaByID on the staged MEMORY replica mid-transfer; the allocator could then hand that buffer to a concurrent Put while the client's RDMA write was still landing into it. Fix: PromotionAllocStart now resets task.start_time when it stages the PROCESSING MEMORY replica. The queue-wait phase (alloc_id == 0, no buffer) and the active-transfer phase (alloc_id != 0, buffer in flight) each get their own full put_start_release_timeout_sec_ TTL window, measured from when they began. New regression test AllocStartResetsTaskDeadline drives the queue_wait_seconds + alloc_active_seconds > TTL scenario and asserts via QuerySegments(seg).first that the staged buffer survives past the bare TTL when measured from admission. NotifyPromotionSuccess counter coverage --------------------------------------- The success path of NotifyPromotionSuccess was not exercised by any existing test — all call sites in the suite asserted error paths. 2cee8eb added a fetch_sub on the success path but had no test to catch a missing decrement. New test NotifySuccessDecrementsCounter sets queue_limit=1, drives the full happy path (GetReplicaList -> PromotionAllocStart -> NotifyPromotionSuccess), then asserts a second admission on a different key is allowed. Without the success-path fetch_sub, the cap stays saturated at 1 and the second admission is dropped. Bug discrimination verified manually: temporarily removing fetch_sub fails this test as expected. Reaper test poll-instead-of-sleep --------------------------------- ReaperPopsStagedMemoryReplicaOnExpiry used a fixed 2-second sleep after a 1-second TTL. Under suite load the eviction thread cadence plus general scheduling jitter could push past the 2-second mark ~1 in 6 runs, producing a flake matching the StalePromotionReaper class. Replace with a 5-second polling deadline against QuerySegments(seg).first; the test now passes 5/5 stability runs. QueueLimitRejectsBeyondCap stale block comment ---------------------------------------------- The test's docblock still described the old per-shard gate (gate: shard_size * kNumShards >= limit). Update it to reflect the cluster-wide promotion_in_flight_ counter and cross-reference QueueLimitRejectsCrossShard for the cross-shard case. Verification ------------ - promotion_on_hit_test: 17/17 pass × 5 consecutive clean runs - New tests pass; mutation-test of NotifySuccessDecrementsCounter against a removed success-path fetch_sub fails as designed - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted; all others already-formatted

…ify holder check Two correctness fixes around the PromotionTask state machine, both symmetric to the staged-replica-leak class that the existing Issue 1 fix (reaper pops staged replica) addressed. Bug 1 -- AllocStart could orphan a staged replica ------------------------------------------------- If the reaper swept a PromotionTask between the holder's heartbeat returning the key and the eventual PromotionAllocStart RPC arriving (a hung client, GC pause, or HA failover stalling AllocStart past put_start_release_timeout_sec_), the old code path was: 1. allocator->Allocate(...) 2. metadata.AddReplicas(...) // stages PROCESSING MEMORY 3. shard->promotion_tasks.find(key) // returns end() 4. if (task_it != end) { record alloc_id } // silently skipped 5. return success // replica orphaned The staged PROCESSING MEMORY replica then had no PromotionTask pointing at it. DiscardExpiredProcessingReplicas only iterates shard->processing_keys (which AllocStart never populates), and the promotion-task reaper iterates shard->promotion_tasks (the entry is gone). The buffer leaked until the object itself was Remove'd or evicted. This is the symmetric counterpart of Issue 1: that one was "task survives but reaper fires after AllocStart"; this one is "AllocStart runs after task is reaped". Fix: lift the task lookup to the top of PromotionAllocStart, before allocation. Reject with REPLICA_IS_NOT_READY if the task is gone. The shard mutex is held by MetadataAccessorRW for the entire function, so the iterator stays valid across the allocator step. Regression test AllocStartRejectsReapedTask: admit a task, sleep past TTL so the reaper sweeps it, call AllocStart, assert REPLICA_IS_NOT_READY and that QuerySegments(seg).first is at baseline (no buffer was staged). Sanity-mutation verified: with the check removed the test fails as expected. Bug 2 -- NotifyPromotionSuccess didn't validate the caller ---------------------------------------------------------- PromotionTask did not record which client owned the source LOCAL_DISK replica, and NotifyPromotionSuccess accepted any client_id. Any client that knew the key could call Notify and flip the staged PROCESSING MEMORY replica to COMPLETE before the legitimate holder's RDMA write had landed, exposing torn data to readers via GetReplicaList. In single-tenant trusted-client deployments this surfaces as a silent corruption when a buggy client tracks state wrong (e.g., issuing Notify on a key it never offloaded). In multi-tenant deployments (offload RFC kvcache-ai#1836) it is a security gap. Fix: capture the holder client_id in PromotionTask at admission (TryPushPromotionQueue already resolves it via get_local_disk_client_id() for the PushPromotionQueue routing). At NotifyPromotionSuccess, return INVALID_PARAMS if task.holder_id != client_id. Regression test NotifyRejectsNonHolder: stage a promotion, call Notify with an unrelated client_id, assert INVALID_PARAMS and that the staged replica is still PROCESSING (no COMPLETE memory replica visible via GetReplicaList). Then call Notify with the legitimate holder client_id and assert it succeeds. Sanity-mutation verified: with the check removed the test fails as expected. Test churn ---------- MultiSegmentAllocPicksAvailableSegment and MultiSegmentAllocRespectsPreferred previously bypassed the gate and called PromotionAllocStart directly (the old contract permitted this; AllocStart silently no-op'd the task assignment). With the new task-existence gate they must seed a task first via GetReplicaList. Both tests now do that as a one-line setup before their multi-segment assertions. Behavior under test (segment selection) is unchanged. Verification ------------ - promotion_on_hit_test: 19/19 pass x 5 consecutive runs (was 17) - file_storage_promotion_test: 9/9 pass - master_service_promotion_test_for_snapshot: 5/5 pass - offload_on_evict_test: 9/9 pass - Both new tests verified bug-discriminating via temporary code mutation; both fail under the buggy condition and pass under the fix. - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted; all others already-formatted.

…lient failure Two related additions tightening the PromotionTask lifecycle, plus a ClearInvalidHandles cleanup. The HA-failover orphan path is left as-is - see "HA failover trade-off" below. AllocStart caller authorization ------------------------------- PromotionAllocStart now takes a client_id parameter and rejects calls where client_id != task.holder_id with INVALID_PARAMS. Without this gate, any client knowing the key could call AllocStart and have the master allocate a DRAM buffer plus stage a PROCESSING MEMORY replica against someone else's task; the legitimate holder's subsequent AllocStart would then race against an already-staged replica it didn't allocate. The auth check mirrors the holder-id gate already in place on NotifyPromotionSuccess and the new NotifyPromotionFailure. Also adds a size check: AllocStart rejects requests whose size doesn't match task.object_size (captured from the source LOCAL_DISK descriptor at admission). Defense-in-depth against a buggy caller requesting an arbitrary allocation size. NotifyPromotionFailure: eager release on client failure ------------------------------------------------------- New holder-side RPC for the explicit-abort case: the client got past PromotionAllocStart but a downstream step (local SSD read, RDMA write, NotifyPromotionSuccess) failed. Without this, every transient client-side failure pinned a task slot and a staged DRAM buffer for put_start_release_timeout_sec_ (~10 min default). On busy clusters that turned every flake into a 10-minute outage of promotion_queue_limit_. Semantics mirror the reaper's expiry path: holder-id auth check, drops source LOCAL_DISK refcnt, pops the staged PROCESSING MEMORY replica via EraseReplicaByID when alloc_id != 0, erases the task, decrements promotion_in_flight_, clears the per-client promotion_objects entry. Idempotent: missing task entry returns OK. file_storage.cpp calls NotifyPromotionFailure on every failure path in ProcessPromotionTasks: AllocStart failure (slot was claimed at TryPushPromotionQueue time, needs release even before any buffer was staged), AllocateBatch failure, SSD BatchLoad failure, TransferWrite failure, and NotifyPromotionSuccess failure. Each call is best-effort with the reaper as the long-stop. ClearInvalidHandles: clean up promotion_tasks on client expiry -------------------------------------------------------------- ClientMonitorFunc's ClearInvalidHandles previously cleaned up processing_keys, replication_tasks, and offloading_tasks when a client's metadata was erased, but skipped promotion_tasks. After holder-client expiry the dangling task pinned its slot for up to put_start_release_timeout_sec_; rolling restarts of many holders could saturate promotion_queue_limit_ for the full TTL. Now erased alongside the others, with matching promotion_in_flight_ decrement. HA failover trade-off (intentionally not addressed) --------------------------------------------------- The master snapshot persists metadata.replicas_ but not the per-shard task-tracking maps. After a leader change, an in-flight promotion's staged PROCESSING MEMORY replica survives in metadata but has no promotion_tasks entry pointing at it. The replica's allocator handle keeps its buffer reserved (so a subsequent Put cannot reuse the buffer - no UAF), but the buffer is leaked until the key is Remove'd, and the key itself silently loses future promotion eligibility because the HasReplica(fn_is_memory_replica) gate sees the orphan and blocks new admissions. Affected blast radius per failover is bounded by in_flight_at_snapshot * object_size. This is a pre-existing failure mode shared with PutStart, CopyStart, and MoveStart. An earlier iteration of this PR included a restore- side sweep that popped all PROCESSING replicas; on further review the sweep introduced a worse failure mode (in-flight RDMA writes from the previous epoch landing into newly-allocated buffers after the sweep released the allocator handle). A proper fix needs synchronous OpLog of pre-commit operations and is out of scope here. Tests ----- promotion_on_hit_test (24 tests, all pass): - AllocStartRejectsNonHolder, AllocStartRejectsSizeMismatch - NotifyFailureReleasesStateImmediately, NotifyFailureRejectsNonHolder - ClientExpiryClearsPromotionTask - existing tests updated to thread client_id through AllocStart file_storage_promotion_test (11 tests, all pass): - AllocStartFailureNotifiesMaster - PostAllocFailuresAllNotifyMaster Verification (in-container build + run): promotion_on_hit_test: 24/24 pass file_storage_promotion_test: 11/11 pass master_service_promotion_test_for_snapshot: 5/5 pass offload_on_evict_test: 9/9 pass master_service_test: 98/98 pass

…-mid-promotion cleanup Two reviewer-flagged correctness fixes around the PromotionTask lifecycle. promotion_admission_threshold lower-bound clamp ----------------------------------------------- `master.cpp` clamped the configured threshold at >255 but had no lower-bound clamp. With threshold=0, `master_service.cpp`'s `if (freq < threshold)` is always false (freq is uint8_t, can't be negative), so every Get on a LOCAL_DISK-only key admits a promotion and the frequency gate is silently bypassed. Operators setting the flag to 0 expecting "off" would instead get "always on", and the existing comment claiming "[1, 255]" clamping was wrong. Two-layer fix: - master.cpp: clamp 0 → 1 at flag-parse time with a LOG(WARNING) pointing operators at the intended minimum. - master_service.cpp constructor: defense-in-depth clamp on `promotion_admission_threshold_` for direct-construction paths (tests, embedded users) that bypass the flag parser. Mirrors the upper-bound clamp at 255. The misleading "[1, 255]" comment at the gate site is updated to note both bounds are enforced at parse time *and* in the constructor. Metadata-erase paths now drop in-flight PromotionTask entries ------------------------------------------------------------- Five paths erase an ObjectMetadata entry; ClearInvalidHandles was the only one that also wiped the matching promotion_tasks entry and decremented promotion_in_flight_. The other four left the task entry dangling for up to put_start_release_timeout_sec_ (~10 min default), pinning a slot of promotion_queue_limit_ and confusing subsequent NotifyPromotionSuccess RPCs on the key with a stale REPLICA_IS_NOT_READY. Sites fixed: - Remove(force=true) — master_service.cpp:1948 - RemoveByRegex — master_service.cpp:2002 - UpsertStart stale-handle erase — master_service.cpp:1242 - UpsertStart preempt erase — master_service.cpp:1282 - MetadataAccessorRW auto-cleanup ctor — master_service.h:1144-1148 All five now call a new private helper: void ErasePromotionTaskIfPresent(MetadataShardAccessorRW& shard, const std::string& key) NO_THREAD_SAFETY_ANALYSIS { if (shard->promotion_tasks.erase(key) > 0) { promotion_in_flight_.fetch_sub(1, std::memory_order_relaxed); } } Helper is documented at the declaration site only; call sites are plain one-liners. Why this is asymmetric with offload ----------------------------------- The codebase already tolerates orphan offloading_tasks at the same sites — offload has no cluster-wide in-flight counter to brick, and NotifyOffloadSuccess gracefully recreates missing metadata via AddReplica (because the bytes live on the client's SSD). Promotion can't do the same on commit (bytes live in master-allocated DRAM that gets deallocated alongside the metadata-erase, so committing those bytes risks surfacing torn or cross-key data — see the REPLICA_IS_NOT_READY rationale on NotifyPromotionSuccess). So the eager-erase is promotion-specific and addresses both the cap-slot leak and the user-visible REPLICA_IS_NOT_READY oddity during the orphan window. Tests ----- Four new tests in promotion_on_hit_test.cpp (suite now 28/28): - AdmissionThresholdZeroClampsToOne: configure threshold=0, assert the constructor lifts it to 1. - AdmissionThresholdAboveMaxClampsToMax: configure threshold=1000, assert clamp to 255. - RemoveErasesPromotionTask: with queue_limit=1, admit a task, Remove(force=true) the key, then admit a different key — must succeed because the slot was freed by Remove. - RemoveByRegexErasesPromotionTask: same shape via the regex path. Both clamp tests need friend access to the now-private promotion_admission_threshold_ field. Added a static GetPromotionAdmissionThresholdForTesting funnel on the fixture (since gtest TEST_F subclasses don't inherit friendship). The missing `friend class test::PromotionOnHitTest` declaration in master_service.h was re-added (it had been dropped during an earlier rebase). Bug-discrimination verified for each test via temporary mutation: - Clamp removed → both clamp tests fail. - ErasePromotionTaskIfPresent body removed → both Erases* tests fail. Verification (in-container build + run): promotion_on_hit_test: 28/28 pass file_storage_promotion_test: 11/11 pass master_service_promotion_test_for_snapshot: 5/5 pass offload_on_evict_test: 9/9 pass Formatted with scripts/code_format.sh --base upstream/main (clang-format-20 in container): 4 files reformatted, all others already-formatted.

…acts, dedup, shorten Comment-only cleanup, no code logic changes. - Drop internal review-discussion artifacts: "Issue 1/2/3 regression", "Bug 1/2 regression", "Pre-fix this would..." — these reference our internal triage labels that a fresh reader has no context for. Test docblocks rewritten to describe the behavior under test. - Drop numbered test enumeration (// 1., // 2., ...) in file_storage_promotion_test.cpp. - Drop a 36-line meta-narrative in promotion_on_hit_test.cpp's InjectLocalDiskReplica helper that walked through alternative implementations considered. Replaced with a 3-line description of what the helper actually does. - Dedup holder-id auth explanation: the canonical place is the PromotionTask struct doc. Inline checks in PromotionAllocStart, NotifyPromotionSuccess, and NotifyPromotionFailure collapse to a one-liner pointing back to the struct doc. - Dedup the "Mirror the reaper's expiry path" block in NotifyPromotionFailure to point at DiscardExpiredProcessingReplicas Part 4 (the canonical rationale) instead of restating each step. - Trim long comments in master_service.h (PromotionTask struct doc), master_service.cpp (AllocStart start_time reset, reaper Part 4, PromotionObjectHeartbeat cap rationale), and file_storage.cpp (post-AllocStart failure rationale, AllocStart-failure block). - Replace "v1" anchors ("in v1", "v1 doesn't pass...") with current-state phrasing. Build/tests unchanged: promotion_on_hit_test 28/28, file_storage_promotion_test 11/11. Format check passes.

The previous run's build-wheel-cu13 (3.10 and 3.12) jobs were cancelled by runner shutdown signals mid-build. Empty commit to trigger a fresh CI run. No code changes.

yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 12 times, most recently from 8b1d481 to fd69cf2 Compare May 10, 2026 07:55

yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 5 times, most recently from f163af1 to 672a911 Compare May 18, 2026 17:39

yzhan1 added 10 commits May 20, 2026 21:27

[Store] L2->L1 promotion-on-hit

7673cb1

yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch from 8012665 to ea3e2ff Compare May 21, 2026 04:43

[Store] L2->L1 promotion-on-hit: re-trigger CI

dee9f1a

The previous run's build-wheel-cu13 (3.10 and 3.12) jobs were cancelled by runner shutdown signals mid-build. Empty commit to trigger a fresh CI run. No code changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store] L2->L1 promotion-on-hit: end-to-end byte movement + extended test coverage#1

[Store] L2->L1 promotion-on-hit: end-to-end byte movement + extended test coverage#1
yzhan1 wants to merge 11 commits into
mainfrom
feature/l2-l1-promotion-on-hit

yzhan1 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yzhan1 commented May 6, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant