[Store] L2->L1 promotion-on-hit: reviewer feedback (orphan reaper, global cap, const, misconfig log)

yzhan1 · claude · yzhan1 · commit 2cee8ebabbe2 · 2026-05-13T10:45:00.000-07:00
Addresses reviewer feedback on PR kvcache-ai#2071. Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task reaper only dropped the source LOCAL_DISK refcnt and erased the task entry; it never popped the staged PROCESSING MEMORY replica added by PromotionAllocStart. That replica is not in shard->processing_keys, so DiscardExpiredProcessingReplicas could not sweep it and the buffer leaked until the object was removed or evicted. Fix: in the reaper, when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop the staged replica and return its buffer to the allocator. Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate was 'shard->size() * kNumShards >= limit', approximately right for uniform workloads but ~1024x too eager on skewed workloads where hot keys cluster in few shards. Replace with a cluster-wide std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in TryPushPromotionQueue after successful emplace; decremented in NotifyPromotionSuccess and in the reaper. memory_order_relaxed since the value is advisory; the per-shard mutex already serializes inserts within a shard and the dedup gate prevents duplicate work. Issue 3 -- const_cast smell. The promotion_tasks map held const PromotionTask values for "generic safety", forcing a const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id under the shard write lock. Drop the const; PromotionAllocStart now sets task_it->second.alloc_id = new_id directly. Misconfig log -- emit LOG(WARNING) at startup when config.promotion_on_hit=true but enable_offload=false. Promotion requires offload to produce LOCAL_DISK replicas, so it is silently disabled in that combination; the log makes the disablement discoverable to operators. Tests ----- - New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1. Uses QuerySegments(seg).first (used bytes) to observe that the staged PROCESSING MEMORY replica's buffer is freed back to the DRAM allocator after the reaper sweeps the expired task. - New QueueLimitRejectsCrossShard: regression for Issue 2. With queue_limit=1, proves a second admission attempt on a *different shard* is rejected -- exactly the case the old per-shard cap admitted incorrectly. - Updated comment on existing QueueLimitRejectsBeyondCap to reflect the cluster-wide counter semantics. Verification ------------ - promotion_on_hit_test: 15/15 pass (5 consecutive clean runs) - file_storage_promotion_test: 9/9 pass - master_service_promotion_test_for_snapshot: 5/5 pass - offload_on_evict_test: 9/9 pass - code_format.sh --base upstream/main (clang-format-20 in container): 3 files reformatted (master_service.h, master_service.cpp, promotion_on_hit_test.cpp); all others "Already formatted" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/mooncake-store/include/master_service.h b/mooncake-store/include/master_service.h
@@ -902,7 +902,7 @@ class MasterService {
             GUARDED_BY(mutex);
         std::unordered_map<std::string, const OffloadingTask> offloading_tasks
             GUARDED_BY(mutex);
-        std::unordered_map<std::string, const PromotionTask> promotion_tasks
+        std::unordered_map<std::string, PromotionTask> promotion_tasks
             GUARDED_BY(mutex);
     };
     std::array<MetadataShard, kNumShards> metadata_shards_;
@@ -1248,6 +1248,18 @@ class MasterService {
     bool promotion_on_hit_{false};
     uint32_t promotion_admission_threshold_{2};
     uint32_t promotion_queue_limit_{50000};
+    // Global in-flight task counter, checked against promotion_queue_limit_
+    // as the gate cap. A previous per-shard heuristic (shard->size() *
+    // kNumShards) was effectively right for uniform workloads but ~1024x
+    // tight on skewed workloads, where hot keys cluster in a few shards and
+    // would saturate one shard's projection of the cap while the cluster
+    // had near-zero in-flight tasks. Promotion specifically targets skewed
+    // access (hot keys re-accessed after eviction), so the global counter
+    // is the correct primitive. Incremented in TryPushPromotionQueue after
+    // successful enqueue; decremented in NotifyPromotionSuccess and in the
+    // promotion task reaper after the task entry is erased. Relaxed memory
+    // order is safe — the value is an advisory soft cap, not a barrier.
+    std::atomic<uint64_t> promotion_in_flight_{0};
     // Master-side frequency sketch. Constructed only when promotion_on_hit_ is
     // true. CountMinSketch is mutex-protected internally so we can call into it
     // from any GetReplicaList caller without additional locking.
diff --git a/mooncake-store/src/master_service.cpp b/mooncake-store/src/master_service.cpp
@@ -185,6 +185,13 @@ MasterService::MasterService(const MasterServiceConfig& config)
     promotion_on_hit_ = enable_offload_ && config.promotion_on_hit;
     promotion_admission_threshold_ = config.promotion_admission_threshold;
     promotion_queue_limit_ = config.promotion_queue_limit;
+    if (config.promotion_on_hit && !enable_offload_) {
+        LOG(WARNING) << "promotion_on_hit=true was requested but "
+                     << "enable_offload=false; promotion is silently "
+                     << "disabled because it requires offload to produce "
+                     << "LOCAL_DISK replicas. Set enable_offload=true to "
+                     << "use this feature.";
+    }
     if (promotion_on_hit_) {
         promotion_sketch_ = std::make_unique<CountMinSketch>();
         LOG(INFO) << "Promotion-on-hit mode enabled: LOCAL_DISK-only Gets "
@@ -2404,11 +2411,15 @@ void MasterService::TryPushPromotionQueue(const std::string& key) {
         return;
     }
 
-    // Cap gate: cheap stat — read this shard's count under our held RW lock,
-    // which already gives an upper bound that's typically tight enough.
-    // Cluster-wide cap can be added in a follow-up; for v1 the per-shard
-    // count × shard count gives a soft cap.
-    if (shard->promotion_tasks.size() * kNumShards >= promotion_queue_limit_) {
+    // Cap gate: read the cluster-wide in-flight count. Soft cap — a
+    // benign TOCTOU race between this load and the emplace below can let
+    // a few extra tasks slip in, but the per-shard mutex already
+    // serializes inserts within a shard and the dedup gate above prevents
+    // duplicate work, so the worst case is N concurrent inserters across
+    // distinct shards each admitting one extra task. Atomic load is
+    // relaxed because the value is purely advisory.
+    if (promotion_in_flight_.load(std::memory_order_relaxed) >=
+        promotion_queue_limit_) {
         return;
     }
 
@@ -2443,6 +2454,7 @@ void MasterService::TryPushPromotionQueue(const std::string& key) {
                            .alloc_id = 0,
                            .object_size = object_size,
                            .start_time = std::chrono::system_clock::now()});
+    promotion_in_flight_.fetch_add(1, std::memory_order_relaxed);
     VLOG(1) << "promotion_queued key=" << key << " size=" << object_size;
 }
 
@@ -2527,10 +2539,7 @@ auto MasterService::PromotionAllocStart(
     auto& shard = accessor.GetShard();
     auto task_it = shard->promotion_tasks.find(key);
     if (task_it != shard->promotion_tasks.end()) {
-        // const_cast: promotion_tasks holds const PromotionTask values for
-        // generic safety, but we own the entry under the shard write lock
-        // and need to record the alloc_id post-Allocate.
-        const_cast<PromotionTask&>(task_it->second).alloc_id = new_id;
+        task_it->second.alloc_id = new_id;
     }
     return PromotionAllocStartResponse{std::move(desc)};
 }
@@ -2571,6 +2580,7 @@ auto MasterService::NotifyPromotionSuccess(const UUID& client_id,
         source->dec_refcnt();
     }
     shard->promotion_tasks.erase(task_it);
+    promotion_in_flight_.fetch_sub(1, std::memory_order_relaxed);
 
     // Erase the per-client promotion_objects entry (best-effort; the
     // heartbeat may have already drained it).
@@ -2765,13 +2775,23 @@ void MasterService::DiscardExpiredProcessingReplicas(
         task_it = shard->offloading_tasks.erase(task_it);
     }
 
-    // Part 4: Discard expired promotion-on-hit tasks. Drops the source
-    // LOCAL_DISK replica's refcnt and erases the task entry; the per-client
-    // promotion_objects map is best-effort garbage collected on the next
-    // heartbeat (entries for vanished tasks are harmless — the client will
-    // allocate, transfer, then NotifyPromotionSuccess will return
-    // REPLICA_IS_NOT_READY and the new MEMORY replica is reaped via the
-    // discarded-replicas path).
+    // Part 4: Discard expired promotion-on-hit tasks. For each expired
+    // task:
+    //   - Drop the source LOCAL_DISK replica's refcnt so it can be
+    //     evicted normally.
+    //   - If PromotionAllocStart already staged a PROCESSING MEMORY
+    //     replica (alloc_id != 0), pop it via EraseReplicaByID. The
+    //     staged replica is not in shard->processing_keys, so
+    //     DiscardExpiredProcessingReplicas would never reap it; this is
+    //     the only place that does. Without this the buffer leaks until
+    //     the object itself is removed or evicted.
+    //   - Erase the task entry and decrement the global in-flight
+    //     counter.
+    // The per-client promotion_objects map is best-effort garbage
+    // collected on the next heartbeat (entries for vanished tasks are
+    // harmless — the client will allocate, transfer, then
+    // NotifyPromotionSuccess will return REPLICA_IS_NOT_READY since the
+    // task entry is gone).
     for (auto task_it = shard->promotion_tasks.begin();
          task_it != shard->promotion_tasks.end();) {
         const auto ttl =
@@ -2787,9 +2807,13 @@ void MasterService::DiscardExpiredProcessingReplicas(
             if (source != nullptr) {
                 source->dec_refcnt();
             }
+            if (task_it->second.alloc_id != 0) {
+                metadata_it->second.EraseReplicaByID(task_it->second.alloc_id);
+            }
         }
         LOG(WARNING) << "Promotion task expired for key: " << task_it->first;
         task_it = shard->promotion_tasks.erase(task_it);
+        promotion_in_flight_.fetch_sub(1, std::memory_order_relaxed);
     }
 
     if (!discarded_replicas.empty()) {
diff --git a/mooncake-store/tests/promotion_on_hit_test.cpp b/mooncake-store/tests/promotion_on_hit_test.cpp
@@ -559,8 +559,9 @@ TEST_F(PromotionOnHitTest, QueueLimitRejectsBeyondCap) {
     auto r1 = service->GetReplicaList(k1);
     ASSERT_TRUE(r1.has_value());
 
-    // Second read on k2 (same shard S, different key, so no dedup) must be
-    // dropped by the cap gate: shard already has 1 task and 1*1024 >= 1.
+    // Second read on k2 (same shard S, different key, so no dedup) must
+    // be dropped by the cap gate: the cluster-wide in-flight counter is
+    // already 1, which meets promotion_queue_limit_ = 1.
     auto r2 = service->GetReplicaList(k2);
     ASSERT_TRUE(r2.has_value()) << "read itself must still succeed; "
                                 << "queue gate is silent";
@@ -569,8 +570,8 @@ TEST_F(PromotionOnHitTest, QueueLimitRejectsBeyondCap) {
     auto heartbeat = service->PromotionObjectHeartbeat(seg.client_id);
     ASSERT_TRUE(heartbeat.has_value());
     EXPECT_EQ(heartbeat->size(), 1u)
-        << "queue_limit=1 should cap the same shard at 1 task; "
-        << "k2's enqueue must be dropped";
+        << "promotion_queue_limit=1 should admit only the first task "
+        << "globally; k2's enqueue must be dropped";
     EXPECT_EQ(heartbeat->count(k1), 1u)
         << "k1 was read first and should be the surviving task";
     EXPECT_EQ(heartbeat->count(k2), 0u)
@@ -650,6 +651,142 @@ TEST_F(PromotionOnHitTest, HeartbeatBoundedBatchPreservesLeftovers) {
     service->RemoveAll();
 }
 
+// Issue 1 regression: the promotion task reaper must pop the staged
+// PROCESSING MEMORY replica added by PromotionAllocStart. Without it,
+// the staged replica was orphaned forever: it's not in
+// shard->processing_keys (so DiscardExpiredProcessingReplicas can't see
+// it) and the previous reaper code only touched the source LOCAL_DISK
+// refcnt and the task entry. The orphan held its allocator buffer
+// indefinitely.
+//
+// We can't observe the staged replica via GetReplicaList because the
+// master filters out PROCESSING entries (clients can only read COMPLETE
+// replicas), so we use QuerySegments to watch the DRAM allocator's used
+// bytes: AllocStart bumps it, and the reaper must return it to baseline.
+// NotifyPromotionSuccess on a reaped task must also fail cleanly.
+TEST_F(PromotionOnHitTest, ReaperPopsStagedMemoryReplicaOnExpiry) {
+    MasterServiceConfig config;
+    config.enable_offload = true;
+    config.promotion_on_hit = true;
+    config.promotion_admission_threshold = 1;
+    config.default_kv_lease_ttl = 2000;
+    config.put_start_discard_timeout_sec = 0;
+    config.put_start_release_timeout_sec = 1;
+    auto service = std::make_unique<MasterService>(config);
+
+    constexpr size_t seg_size = 1024 * 1024 * 16;
+    auto ctx = PrepareSegment(*service, "seg_a", kDefaultSegmentBase, seg_size);
+    ASSERT_TRUE(InjectLocalDiskReplica(*service, ctx.client_id, "k_cold", 1024,
+                                       ctx.segment_name));
+
+    // Baseline allocator usage on the DRAM segment.
+    auto seg_baseline = service->QuerySegments(ctx.segment_name);
+    ASSERT_TRUE(seg_baseline.has_value());
+    const size_t used_baseline = seg_baseline->first;
+
+    // Trigger the gate to enqueue a PromotionTask.
+    {
+        auto r = service->GetReplicaList("k_cold");
+        ASSERT_TRUE(r.has_value());
+    }
+    // Drive the AllocStart side so alloc_id != 0 — this is the exact
+    // setup that left an orphaned PROCESSING MEMORY replica pre-fix.
+    auto alloc = service->PromotionAllocStart("k_cold", 1024, {});
+    ASSERT_TRUE(alloc.has_value());
+
+    // After AllocStart, the DRAM allocator must have committed bytes for
+    // the staged PROCESSING MEMORY replica.
+    auto seg_after_alloc = service->QuerySegments(ctx.segment_name);
+    ASSERT_TRUE(seg_after_alloc.has_value());
+    EXPECT_GT(seg_after_alloc->first, used_baseline)
+        << "PromotionAllocStart should bump segment used bytes "
+        << "(allocator-tracked PROCESSING MEMORY replica)";
+
+    // Sleep past the staleness window; the eviction thread reaps the
+    // task and (with the fix) pops the staged replica via
+    // EraseReplicaByID, which releases the buffer back to the allocator.
+    std::this_thread::sleep_for(std::chrono::seconds(2));
+
+    auto seg_after_reap = service->QuerySegments(ctx.segment_name);
+    ASSERT_TRUE(seg_after_reap.has_value());
+    EXPECT_EQ(seg_after_reap->first, used_baseline)
+        << "after reap: staged PROCESSING MEMORY replica's buffer must "
+        << "be freed back to the DRAM allocator. Pre-fix the buffer "
+        << "leaked and used bytes stayed elevated until the object "
+        << "itself was removed or evicted.";
+
+    // NotifyPromotionSuccess for a reaped task must not commit anything
+    // and must return REPLICA_IS_NOT_READY (the task entry is gone, so
+    // the alloc_id lookup at the top of NotifyPromotionSuccess fails
+    // fast).
+    auto notify = service->NotifyPromotionSuccess(ctx.client_id, "k_cold");
+    ASSERT_FALSE(notify.has_value());
+    EXPECT_EQ(notify.error(), ErrorCode::REPLICA_IS_NOT_READY);
+
+    service->RemoveAll();
+}
+
+// Issue 2 regression: the cap gate must be cluster-wide. The old
+// implementation used `shard->size() * kNumShards >= promotion_queue_limit_`,
+// which made the cap fire ~1024x too eagerly on skewed workloads (hot
+// keys cluster in few shards). With a global atomic counter, a task in
+// shard A counts toward the cap that gates a task in shard B.
+TEST_F(PromotionOnHitTest, QueueLimitRejectsCrossShard) {
+    MasterServiceConfig config;
+    config.enable_offload = true;
+    config.promotion_on_hit = true;
+    config.promotion_admission_threshold = 1;
+    config.promotion_queue_limit = 1;  // 1 in-flight task globally
+    config.default_kv_lease_ttl = 2000;
+    auto service = std::make_unique<MasterService>(config);
+
+    constexpr size_t seg_size = 1024 * 1024 * 16;
+    auto seg = PrepareSegment(*service, "seg_a", kDefaultSegmentBase, seg_size);
+
+    // Find two keys hashing to *different* shards. With the old per-shard
+    // heuristic this would let both through (each shard's count is 0
+    // independently). With the global counter, only the first goes in.
+    constexpr size_t kNumShardsLocal = 1024;
+    auto shard_of = [](const std::string& k) {
+        return std::hash<std::string>{}(k) % kNumShardsLocal;
+    };
+    const std::string k1 = "xshard_first";
+    std::string k2;
+    for (int i = 0; i < 100000 && k2.empty(); ++i) {
+        std::string candidate = "xshard_other_" + std::to_string(i);
+        if (shard_of(candidate) != shard_of(k1)) {
+            k2 = candidate;
+        }
+    }
+    ASSERT_FALSE(k2.empty()) << "couldn't find a different-shard key";
+    ASSERT_NE(shard_of(k1), shard_of(k2));
+
+    ASSERT_TRUE(InjectLocalDiskReplica(*service, seg.client_id, k1, 1024,
+                                       seg.segment_name));
+    ASSERT_TRUE(InjectLocalDiskReplica(*service, seg.client_id, k2, 1024,
+                                       seg.segment_name));
+
+    auto r1 = service->GetReplicaList(k1);
+    ASSERT_TRUE(r1.has_value());
+
+    // k2 lives in a different shard, but the global cap is already met
+    // by k1's task — k2 must be rejected.
+    auto r2 = service->GetReplicaList(k2);
+    ASSERT_TRUE(r2.has_value()) << "read itself still succeeds";
+
+    auto heartbeat = service->PromotionObjectHeartbeat(seg.client_id);
+    ASSERT_TRUE(heartbeat.has_value());
+    EXPECT_EQ(heartbeat->size(), 1u)
+        << "with global cap=1 and one task already in shard " << shard_of(k1)
+        << ", a key hashing to shard " << shard_of(k2)
+        << " must be rejected by the global gate (pre-fix it would have "
+        << "been admitted since its shard's local count was 0)";
+    EXPECT_EQ(heartbeat->count(k1), 1u);
+    EXPECT_EQ(heartbeat->count(k2), 0u);
+
+    service->RemoveAll();
+}
+
 }  // namespace mooncake::test
 
 int main(int argc, char** argv) {