[Store] L2->L1 promotion-on-hit: end-to-end byte movement + extended test coverage#1
Open
yzhan1 wants to merge 11 commits into
Open
[Store] L2->L1 promotion-on-hit: end-to-end byte movement + extended test coverage#1yzhan1 wants to merge 11 commits into
yzhan1 wants to merge 11 commits into
Conversation
8b1d481 to
fd69cf2
Compare
f163af1 to
672a911
Compare
…elected e2e proof Two PR-review issues from kvcache-ai#2071, addressed without changing the existing commit: 1. Heartbeat could silently drop leftover promotion tasks. The prior fix to a per-tick cap lived on the client side: master drained the whole per-LDS promotion_objects map, client processed one, the rest evaporated. Their per-shard promotion_tasks entries stayed refcnt-pinned until the reaper TTL (~10 min), and the dedup gate blocked re-enqueue, so a burst of hot LOCAL_DISK keys could lose all but one promotion for that window. Move the cap server-side. MasterService::PromotionObjectHeartbeat now extracts at most kMaxPerHeartbeat (1) entries from the per-LDS promotion_objects map and leaves the remainder queued for subsequent heartbeats. The client-side per-tick cap is removed — the master is the bound. Leftover work is preserved by construction. New unit test HeartbeatBoundedBatchPreservesLeftovers proves: with three queued keys, three successive heartbeats each return exactly one distinct key, the fourth returns empty, and all three keys' master-side state stays intact across the drain. The FakeClient harness in file_storage_promotion_test now mirrors the master's bounded-batch semantic on its mocked Heartbeat, so the existing client-side tests exercise the same drain shape production sees. 2. E2E couldn't prove subsequent reads avoided SSD. Post-promotion the test asserted bytes-back and the existence of a MEMORY replica, but bytes-back is the same whether MEMORY or LOCAL_DISK served the read (the offload-RPC path returns correct bytes too). The RFC's core invariant — "subsequent reads avoid SSD" — went unverified. Add an instrumented counter offload_rpc_read_count on RealClient that increments at every invocation of batch_get_into_offload_object_internal, the single chokepoint for LOCAL_DISK reads served via peer offload-RPC. Expose via the Python binding as MooncakeDistributedStore.get_offload_rpc_read_count(). The e2e snapshots the counter around the post-promotion store.get and asserts it didn't move — a non-zero delta would mean the read served from SSD instead of the promoted MEMORY replica.
Reviewer asked for more metrics. The post-promotion offload-RPC counter already proves MEMORY served the read; adding a wall-time comparison gives the same evidence on a different axis and surfaces a concrete "this is the user-visible win" number. Phase 3 timed the 4 admission-threshold-clearing reads as the LOCAL_DISK sample. Phase 5 expanded the single post-promotion read into 10 timed reads as the MEMORY sample, plus the existing counter assertion. The test now prints a median / min / max table for each phase and asserts the post-promotion median is at least 1.3x faster. The threshold is intentionally loose: on docker tmpfs with small (1 MB) objects much of the wall time is Python/C++ marshalling overhead and the SSD-vs-DRAM gap shrinks to ~2-5x; on real SSDs + RDMA the gap is typically 10-50x. The counter assertion is the hard correctness check; the latency assertion catches gross regressions while the printed summary serves as the primary empirical signal. Observed in CI (docker tmpfs, 1 MB objects): pre-promotion (LOCAL_DISK via offload-RPC): median 1.53 ms post-promotion (MEMORY direct): median 0.31 ms speedup: 4.9x
Reviewer asked to keep a richer latency comparison in the repo but NOT run it as a CI test. test_promotion_after_repeated_hits goes back to the lean correctness shape: 4 reads to clear the admission gate, 1 read post-promotion + offload-RPC counter assertion to prove MEMORY served, no timing. The full p50/p95/p99 latency comparison moves into a sibling class BenchPromotionLatency gated by @unittest.skipUnless(MC_BENCH_PROMOTION_LATENCY), so: python test_promotion_on_hit.py # 3 tests, 1 skipped — CI MC_BENCH_PROMOTION_LATENCY=1 python test_promotion_on_hit.py # full bench The bench retains the per-read MEMORY-detection (snapshots offload_rpc_read_count around each pre-promotion read and bails on detection of MEMORY-served reads so the LOCAL_DISK distribution stays uncontaminated) and the 1.3x p50 speedup assertion. Sample count is LATENCY_SAMPLES (default 1000; verified stable to within noise at N=10000 too — N=1000 is the sweet spot for stable p99 without paying for 10x more wall time). Observed bench numbers on docker tmpfs (1 MiB objects, N=1000): pre-promotion (LOCAL_DISK via offload-RPC): p50=0.52 ms, p95=0.68 ms, p99=0.81 ms post-promotion (MEMORY direct): p50=0.24 ms, p95=0.33 ms, p99=0.42 ms speedup: p50=2.2x | p95=2.0x | p99=1.9x Real-hardware speedup is typically 10-50x because the SSD-vs-DRAM cost gap dominates only when objects are large and the FFI/marshalling overhead is fractional.
…obal cap, const, misconfig log) Addresses reviewer feedback on PR kvcache-ai#2071. Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task reaper only dropped the source LOCAL_DISK refcnt and erased the task entry; it never popped the staged PROCESSING MEMORY replica added by PromotionAllocStart. That replica is not in shard->processing_keys, so DiscardExpiredProcessingReplicas could not sweep it and the buffer leaked until the object was removed or evicted. Fix: in the reaper, when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop the staged replica and return its buffer to the allocator. Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate was 'shard->size() * kNumShards >= limit', approximately right for uniform workloads but ~1024x too eager on skewed workloads where hot keys cluster in few shards. Replace with a cluster-wide std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in TryPushPromotionQueue after successful emplace; decremented in NotifyPromotionSuccess and in the reaper. memory_order_relaxed since the value is advisory; the per-shard mutex already serializes inserts within a shard and the dedup gate prevents duplicate work. Issue 3 -- const_cast smell. The promotion_tasks map held const PromotionTask values for "generic safety", forcing a const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id under the shard write lock. Drop the const; PromotionAllocStart now sets task_it->second.alloc_id = new_id directly. Misconfig log -- emit LOG(WARNING) at startup when config.promotion_on_hit=true but enable_offload=false. Promotion requires offload to produce LOCAL_DISK replicas, so it is silently disabled in that combination; the log makes the disablement discoverable to operators. Tests ----- - New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1. Uses QuerySegments(seg).first (used bytes) to observe that the staged PROCESSING MEMORY replica's buffer is freed back to the DRAM allocator after the reaper sweeps the expired task. - New QueueLimitRejectsCrossShard: regression for Issue 2. With queue_limit=1, proves a second admission attempt on a *different shard* is rejected -- exactly the case the old per-shard cap admitted incorrectly. - Updated comment on existing QueueLimitRejectsBeyondCap to reflect the cluster-wide counter semantics. Verification ------------ - promotion_on_hit_test: 15/15 pass (5 consecutive clean runs) - file_storage_promotion_test: 9/9 pass - master_service_promotion_test_for_snapshot: 5/5 pass - offload_on_evict_test: 9/9 pass - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted (master_service.h, master_service.cpp, promotion_on_hit_test.cpp); all others "Already formatted"
…est + reaper poll Follow-up to 2cee8eb addressing two further reviewer concerns and recovering one edit batch that did not land in that commit. AllocStart deadline reset ------------------------- A queued promotion task could enter the active-transfer phase (PromotionAllocStart -> client SSD read -> RDMA write -> Notify) with only a sliver of TTL remaining if the holder's heartbeat queue was backlogged (kMaxPerHeartbeat = 1 serializes work). If the active transfer of a large object then ran past the TTL, the reaper would call EraseReplicaByID on the staged MEMORY replica mid-transfer; the allocator could then hand that buffer to a concurrent Put while the client's RDMA write was still landing into it. Fix: PromotionAllocStart now resets task.start_time when it stages the PROCESSING MEMORY replica. The queue-wait phase (alloc_id == 0, no buffer) and the active-transfer phase (alloc_id != 0, buffer in flight) each get their own full put_start_release_timeout_sec_ TTL window, measured from when they began. New regression test AllocStartResetsTaskDeadline drives the queue_wait_seconds + alloc_active_seconds > TTL scenario and asserts via QuerySegments(seg).first that the staged buffer survives past the bare TTL when measured from admission. NotifyPromotionSuccess counter coverage --------------------------------------- The success path of NotifyPromotionSuccess was not exercised by any existing test — all call sites in the suite asserted error paths. 2cee8eb added a fetch_sub on the success path but had no test to catch a missing decrement. New test NotifySuccessDecrementsCounter sets queue_limit=1, drives the full happy path (GetReplicaList -> PromotionAllocStart -> NotifyPromotionSuccess), then asserts a second admission on a different key is allowed. Without the success-path fetch_sub, the cap stays saturated at 1 and the second admission is dropped. Bug discrimination verified manually: temporarily removing fetch_sub fails this test as expected. Reaper test poll-instead-of-sleep --------------------------------- ReaperPopsStagedMemoryReplicaOnExpiry used a fixed 2-second sleep after a 1-second TTL. Under suite load the eviction thread cadence plus general scheduling jitter could push past the 2-second mark ~1 in 6 runs, producing a flake matching the StalePromotionReaper class. Replace with a 5-second polling deadline against QuerySegments(seg).first; the test now passes 5/5 stability runs. QueueLimitRejectsBeyondCap stale block comment ---------------------------------------------- The test's docblock still described the old per-shard gate (gate: shard_size * kNumShards >= limit). Update it to reflect the cluster-wide promotion_in_flight_ counter and cross-reference QueueLimitRejectsCrossShard for the cross-shard case. Verification ------------ - promotion_on_hit_test: 17/17 pass × 5 consecutive clean runs - New tests pass; mutation-test of NotifySuccessDecrementsCounter against a removed success-path fetch_sub fails as designed - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted; all others already-formatted
…ify holder check
Two correctness fixes around the PromotionTask state machine, both
symmetric to the staged-replica-leak class that the existing Issue 1
fix (reaper pops staged replica) addressed.
Bug 1 -- AllocStart could orphan a staged replica
-------------------------------------------------
If the reaper swept a PromotionTask between the holder's heartbeat
returning the key and the eventual PromotionAllocStart RPC arriving
(a hung client, GC pause, or HA failover stalling AllocStart past
put_start_release_timeout_sec_), the old code path was:
1. allocator->Allocate(...)
2. metadata.AddReplicas(...) // stages PROCESSING MEMORY
3. shard->promotion_tasks.find(key) // returns end()
4. if (task_it != end) { record alloc_id } // silently skipped
5. return success // replica orphaned
The staged PROCESSING MEMORY replica then had no PromotionTask
pointing at it. DiscardExpiredProcessingReplicas only iterates
shard->processing_keys (which AllocStart never populates), and the
promotion-task reaper iterates shard->promotion_tasks (the entry is
gone). The buffer leaked until the object itself was Remove'd or
evicted. This is the symmetric counterpart of Issue 1: that one was
"task survives but reaper fires after AllocStart"; this one is
"AllocStart runs after task is reaped".
Fix: lift the task lookup to the top of PromotionAllocStart, before
allocation. Reject with REPLICA_IS_NOT_READY if the task is gone.
The shard mutex is held by MetadataAccessorRW for the entire
function, so the iterator stays valid across the allocator step.
Regression test AllocStartRejectsReapedTask: admit a task, sleep
past TTL so the reaper sweeps it, call AllocStart, assert
REPLICA_IS_NOT_READY and that QuerySegments(seg).first is at
baseline (no buffer was staged). Sanity-mutation verified: with the
check removed the test fails as expected.
Bug 2 -- NotifyPromotionSuccess didn't validate the caller
----------------------------------------------------------
PromotionTask did not record which client owned the source
LOCAL_DISK replica, and NotifyPromotionSuccess accepted any
client_id. Any client that knew the key could call Notify and flip
the staged PROCESSING MEMORY replica to COMPLETE before the
legitimate holder's RDMA write had landed, exposing torn data to
readers via GetReplicaList.
In single-tenant trusted-client deployments this surfaces as a
silent corruption when a buggy client tracks state wrong (e.g.,
issuing Notify on a key it never offloaded). In multi-tenant
deployments (offload RFC kvcache-ai#1836) it is a security gap.
Fix: capture the holder client_id in PromotionTask at admission
(TryPushPromotionQueue already resolves it via
get_local_disk_client_id() for the PushPromotionQueue routing). At
NotifyPromotionSuccess, return INVALID_PARAMS if
task.holder_id != client_id.
Regression test NotifyRejectsNonHolder: stage a promotion, call
Notify with an unrelated client_id, assert INVALID_PARAMS and that
the staged replica is still PROCESSING (no COMPLETE memory replica
visible via GetReplicaList). Then call Notify with the legitimate
holder client_id and assert it succeeds. Sanity-mutation verified:
with the check removed the test fails as expected.
Test churn
----------
MultiSegmentAllocPicksAvailableSegment and
MultiSegmentAllocRespectsPreferred previously bypassed the gate and
called PromotionAllocStart directly (the old contract permitted
this; AllocStart silently no-op'd the task assignment). With the
new task-existence gate they must seed a task first via
GetReplicaList. Both tests now do that as a one-line setup before
their multi-segment assertions. Behavior under test (segment
selection) is unchanged.
Verification
------------
- promotion_on_hit_test: 19/19 pass x 5 consecutive runs (was 17)
- file_storage_promotion_test: 9/9 pass
- master_service_promotion_test_for_snapshot: 5/5 pass
- offload_on_evict_test: 9/9 pass
- Both new tests verified bug-discriminating via temporary code
mutation; both fail under the buggy condition and pass under the
fix.
- code_format.sh --base upstream/main (clang-format-20 in container):
3 files reformatted; all others already-formatted.
…lient failure Two related additions tightening the PromotionTask lifecycle, plus a ClearInvalidHandles cleanup. The HA-failover orphan path is left as-is - see "HA failover trade-off" below. AllocStart caller authorization ------------------------------- PromotionAllocStart now takes a client_id parameter and rejects calls where client_id != task.holder_id with INVALID_PARAMS. Without this gate, any client knowing the key could call AllocStart and have the master allocate a DRAM buffer plus stage a PROCESSING MEMORY replica against someone else's task; the legitimate holder's subsequent AllocStart would then race against an already-staged replica it didn't allocate. The auth check mirrors the holder-id gate already in place on NotifyPromotionSuccess and the new NotifyPromotionFailure. Also adds a size check: AllocStart rejects requests whose size doesn't match task.object_size (captured from the source LOCAL_DISK descriptor at admission). Defense-in-depth against a buggy caller requesting an arbitrary allocation size. NotifyPromotionFailure: eager release on client failure ------------------------------------------------------- New holder-side RPC for the explicit-abort case: the client got past PromotionAllocStart but a downstream step (local SSD read, RDMA write, NotifyPromotionSuccess) failed. Without this, every transient client-side failure pinned a task slot and a staged DRAM buffer for put_start_release_timeout_sec_ (~10 min default). On busy clusters that turned every flake into a 10-minute outage of promotion_queue_limit_. Semantics mirror the reaper's expiry path: holder-id auth check, drops source LOCAL_DISK refcnt, pops the staged PROCESSING MEMORY replica via EraseReplicaByID when alloc_id != 0, erases the task, decrements promotion_in_flight_, clears the per-client promotion_objects entry. Idempotent: missing task entry returns OK. file_storage.cpp calls NotifyPromotionFailure on every failure path in ProcessPromotionTasks: AllocStart failure (slot was claimed at TryPushPromotionQueue time, needs release even before any buffer was staged), AllocateBatch failure, SSD BatchLoad failure, TransferWrite failure, and NotifyPromotionSuccess failure. Each call is best-effort with the reaper as the long-stop. ClearInvalidHandles: clean up promotion_tasks on client expiry -------------------------------------------------------------- ClientMonitorFunc's ClearInvalidHandles previously cleaned up processing_keys, replication_tasks, and offloading_tasks when a client's metadata was erased, but skipped promotion_tasks. After holder-client expiry the dangling task pinned its slot for up to put_start_release_timeout_sec_; rolling restarts of many holders could saturate promotion_queue_limit_ for the full TTL. Now erased alongside the others, with matching promotion_in_flight_ decrement. HA failover trade-off (intentionally not addressed) --------------------------------------------------- The master snapshot persists metadata.replicas_ but not the per-shard task-tracking maps. After a leader change, an in-flight promotion's staged PROCESSING MEMORY replica survives in metadata but has no promotion_tasks entry pointing at it. The replica's allocator handle keeps its buffer reserved (so a subsequent Put cannot reuse the buffer - no UAF), but the buffer is leaked until the key is Remove'd, and the key itself silently loses future promotion eligibility because the HasReplica(fn_is_memory_replica) gate sees the orphan and blocks new admissions. Affected blast radius per failover is bounded by in_flight_at_snapshot * object_size. This is a pre-existing failure mode shared with PutStart, CopyStart, and MoveStart. An earlier iteration of this PR included a restore- side sweep that popped all PROCESSING replicas; on further review the sweep introduced a worse failure mode (in-flight RDMA writes from the previous epoch landing into newly-allocated buffers after the sweep released the allocator handle). A proper fix needs synchronous OpLog of pre-commit operations and is out of scope here. Tests ----- promotion_on_hit_test (24 tests, all pass): - AllocStartRejectsNonHolder, AllocStartRejectsSizeMismatch - NotifyFailureReleasesStateImmediately, NotifyFailureRejectsNonHolder - ClientExpiryClearsPromotionTask - existing tests updated to thread client_id through AllocStart file_storage_promotion_test (11 tests, all pass): - AllocStartFailureNotifiesMaster - PostAllocFailuresAllNotifyMaster Verification (in-container build + run): promotion_on_hit_test: 24/24 pass file_storage_promotion_test: 11/11 pass master_service_promotion_test_for_snapshot: 5/5 pass offload_on_evict_test: 9/9 pass master_service_test: 98/98 pass
…-mid-promotion cleanup
Two reviewer-flagged correctness fixes around the PromotionTask
lifecycle.
promotion_admission_threshold lower-bound clamp
-----------------------------------------------
`master.cpp` clamped the configured threshold at >255 but had no
lower-bound clamp. With threshold=0, `master_service.cpp`'s
`if (freq < threshold)` is always false (freq is uint8_t, can't be
negative), so every Get on a LOCAL_DISK-only key admits a promotion
and the frequency gate is silently bypassed. Operators setting the
flag to 0 expecting "off" would instead get "always on", and the
existing comment claiming "[1, 255]" clamping was wrong.
Two-layer fix:
- master.cpp: clamp 0 → 1 at flag-parse time with a LOG(WARNING)
pointing operators at the intended minimum.
- master_service.cpp constructor: defense-in-depth clamp on
`promotion_admission_threshold_` for direct-construction paths
(tests, embedded users) that bypass the flag parser. Mirrors the
upper-bound clamp at 255.
The misleading "[1, 255]" comment at the gate site is updated to note
both bounds are enforced at parse time *and* in the constructor.
Metadata-erase paths now drop in-flight PromotionTask entries
-------------------------------------------------------------
Five paths erase an ObjectMetadata entry; ClearInvalidHandles was the
only one that also wiped the matching promotion_tasks entry and
decremented promotion_in_flight_. The other four left the task entry
dangling for up to put_start_release_timeout_sec_ (~10 min default),
pinning a slot of promotion_queue_limit_ and confusing subsequent
NotifyPromotionSuccess RPCs on the key with a stale REPLICA_IS_NOT_READY.
Sites fixed:
- Remove(force=true) — master_service.cpp:1948
- RemoveByRegex — master_service.cpp:2002
- UpsertStart stale-handle erase — master_service.cpp:1242
- UpsertStart preempt erase — master_service.cpp:1282
- MetadataAccessorRW auto-cleanup ctor — master_service.h:1144-1148
All five now call a new private helper:
void ErasePromotionTaskIfPresent(MetadataShardAccessorRW& shard,
const std::string& key)
NO_THREAD_SAFETY_ANALYSIS {
if (shard->promotion_tasks.erase(key) > 0) {
promotion_in_flight_.fetch_sub(1, std::memory_order_relaxed);
}
}
Helper is documented at the declaration site only; call sites are
plain one-liners.
Why this is asymmetric with offload
-----------------------------------
The codebase already tolerates orphan offloading_tasks at the same
sites — offload has no cluster-wide in-flight counter to brick, and
NotifyOffloadSuccess gracefully recreates missing metadata via
AddReplica (because the bytes live on the client's SSD). Promotion
can't do the same on commit (bytes live in master-allocated DRAM
that gets deallocated alongside the metadata-erase, so committing
those bytes risks surfacing torn or cross-key data — see the
REPLICA_IS_NOT_READY rationale on NotifyPromotionSuccess). So the
eager-erase is promotion-specific and addresses both the cap-slot
leak and the user-visible REPLICA_IS_NOT_READY oddity during the
orphan window.
Tests
-----
Four new tests in promotion_on_hit_test.cpp (suite now 28/28):
- AdmissionThresholdZeroClampsToOne: configure threshold=0, assert
the constructor lifts it to 1.
- AdmissionThresholdAboveMaxClampsToMax: configure threshold=1000,
assert clamp to 255.
- RemoveErasesPromotionTask: with queue_limit=1, admit a task,
Remove(force=true) the key, then admit a different key — must
succeed because the slot was freed by Remove.
- RemoveByRegexErasesPromotionTask: same shape via the regex path.
Both clamp tests need friend access to the now-private
promotion_admission_threshold_ field. Added a static
GetPromotionAdmissionThresholdForTesting funnel on the fixture (since
gtest TEST_F subclasses don't inherit friendship). The missing
`friend class test::PromotionOnHitTest` declaration in
master_service.h was re-added (it had been dropped during an earlier
rebase).
Bug-discrimination verified for each test via temporary mutation:
- Clamp removed → both clamp tests fail.
- ErasePromotionTaskIfPresent body removed → both Erases* tests fail.
Verification (in-container build + run):
promotion_on_hit_test: 28/28 pass
file_storage_promotion_test: 11/11 pass
master_service_promotion_test_for_snapshot: 5/5 pass
offload_on_evict_test: 9/9 pass
Formatted with scripts/code_format.sh --base upstream/main
(clang-format-20 in container): 4 files reformatted, all others
already-formatted.
…acts, dedup, shorten
Comment-only cleanup, no code logic changes.
- Drop internal review-discussion artifacts: "Issue 1/2/3 regression",
"Bug 1/2 regression", "Pre-fix this would..." — these reference our
internal triage labels that a fresh reader has no context for. Test
docblocks rewritten to describe the behavior under test.
- Drop numbered test enumeration (// 1., // 2., ...) in
file_storage_promotion_test.cpp.
- Drop a 36-line meta-narrative in promotion_on_hit_test.cpp's
InjectLocalDiskReplica helper that walked through alternative
implementations considered. Replaced with a 3-line description of
what the helper actually does.
- Dedup holder-id auth explanation: the canonical place is the
PromotionTask struct doc. Inline checks in PromotionAllocStart,
NotifyPromotionSuccess, and NotifyPromotionFailure collapse to a
one-liner pointing back to the struct doc.
- Dedup the "Mirror the reaper's expiry path" block in
NotifyPromotionFailure to point at DiscardExpiredProcessingReplicas
Part 4 (the canonical rationale) instead of restating each step.
- Trim long comments in master_service.h (PromotionTask struct doc),
master_service.cpp (AllocStart start_time reset, reaper Part 4,
PromotionObjectHeartbeat cap rationale), and file_storage.cpp
(post-AllocStart failure rationale, AllocStart-failure block).
- Replace "v1" anchors ("in v1", "v1 doesn't pass...") with
current-state phrasing.
Build/tests unchanged: promotion_on_hit_test 28/28,
file_storage_promotion_test 11/11. Format check passes.
8012665 to
ea3e2ff
Compare
The previous run's build-wheel-cu13 (3.10 and 3.12) jobs were cancelled by runner shutdown signals mid-build. Empty commit to trigger a fresh CI run. No code changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Mirror of the offload-on-evict feature (kvcache-ai#1899) in the reverse direction: when a key exists only as a LOCAL_DISK replica and a client reads it past the admission threshold, the master enqueues a promotion task on the holder, FileStorage drains it via heartbeat, and the bytes are staged back into a fresh MEMORY replica via Transfer Engine before the master flips it COMPLETE.
This branch has 4 commits total. The first two (
d10a0bfmaster-side state machine,bddd0fcinitial gate tests) were posted earlier; this PR adds:d3edf27— client byte movement + RPC wiring. RPCs throughrpc_service/master_client,Client::PromotionWritewrapping TE,FileStorage::ProcessPromotionTasksorchestration, pybindis_local_disk_replicaaccessor,scripts/run_tests.shenv-gated CI block. Makes the feature work end-to-end.aecf651— extended test coverage (concurrency, multi-segment alloc, FileStorage fault injection, snapshot round-trip, Python negative e2e). 31 tests total.End-to-end flow after this PR:
Per-key failures are logged and skipped; the master-side reaper (
EvictionThreadFunc) drops orphaned PromotionTasks and decrements the source replica's refcnt afterput_start_release_timeout_sec.Module
mooncake-store)mooncake-integration)mooncake-wheel)Type of Change
How Has This Been Tested?
promotion_on_hit_test(master gate)file_storage_promotion_test(NEW)ProcessPromotionTasks; covers heartbeat protocol, alloc/load/write/notify failures, per-key independencemaster_service_promotion_test_for_snapshot(NEW)TestSnapshotAndRestorein TearDowntest_promotion_on_hit.pyRegression check: no failures in
master_service_test,offload_on_evict_test,file_storage_test.Operator setup for the e2e:
Client side requires
enable_ssd_offload=TrueinMooncakeDistributedStore.setup. For test workloads under 256 MB / 500 keys, also setMOONCAKE_OFFLOAD_BUCKET_KEYS_LIMITandMOONCAKE_OFFLOAD_BUCKET_SIZE_LIMIT_BYTESlow enough that the bucket backend actually flushes (the defaults silently buffer in the ungrouped pool — same gotcha as the offload e2e).Known follow-ups, not in this PR:
MasterMetricManagerhas no promotion counters. Should land before flipping default-on for production clients.Client::TransferReadLOCAL_DISK bug atclient_service.cpp:2675is pre-existing and orthogonal; the e2e sidesteps it by usingbatch_get_replica_desc(which callsGetReplicaList, the real promotion trigger) rather thanstore.get. Single-keystore.getof a LOCAL_DISK-only key still throws "Expected DiskDescriptor".AllocStartsucceeds butNotifynever arrives, the reaper drops the task entry and source refcnt but doesn't remove the staged PROCESSING replica from metadata. Safe (invisible to readers) but leaks until master restart.Checklist
./scripts/code_format.shbefore submitting.