Skip to content

[Store] L2->L1 promotion-on-hit: end-to-end byte movement + extended test coverage#1

Open
yzhan1 wants to merge 11 commits into
mainfrom
feature/l2-l1-promotion-on-hit
Open

[Store] L2->L1 promotion-on-hit: end-to-end byte movement + extended test coverage#1
yzhan1 wants to merge 11 commits into
mainfrom
feature/l2-l1-promotion-on-hit

Conversation

@yzhan1
Copy link
Copy Markdown
Owner

@yzhan1 yzhan1 commented May 6, 2026

Description

Mirror of the offload-on-evict feature (kvcache-ai#1899) in the reverse direction: when a key exists only as a LOCAL_DISK replica and a client reads it past the admission threshold, the master enqueues a promotion task on the holder, FileStorage drains it via heartbeat, and the bytes are staged back into a fresh MEMORY replica via Transfer Engine before the master flips it COMPLETE.

This branch has 4 commits total. The first two (d10a0bf master-side state machine, bddd0fc initial gate tests) were posted earlier; this PR adds:

  1. d3edf27 — client byte movement + RPC wiring. RPCs through rpc_service/master_client, Client::PromotionWrite wrapping TE, FileStorage::ProcessPromotionTasks orchestration, pybind is_local_disk_replica accessor, scripts/run_tests.sh env-gated CI block. Makes the feature work end-to-end.
  2. aecf651 — extended test coverage (concurrency, multi-segment alloc, FileStorage fault injection, snapshot round-trip, Python negative e2e). 31 tests total.

End-to-end flow after this PR:

client A: GET k  ──▶  master GetReplicaList
                        ├─ if !any_memory && any_local_disk and freq > threshold,
                        │   queue PromotionTask, push to holder's promotion_objects
                        ▼
client B (holder of LOCAL_DISK k):
   FileStorage::Heartbeat tick
     ├─ PromotionObjectHeartbeat (drain queue)
     └─ for each (key, size):
         ├─ master.PromotionAllocStart  ──▶ stage PROCESSING MEMORY
         ├─ FileStorage.AllocateBatch  (O_DIRECT staging buffer)
         ├─ storage_backend.BatchLoad  (SSD → staging)
         ├─ Client.PromotionWrite      (staging → MEMORY via TE)
         └─ master.NotifyPromotionSuccess  ──▶ flip COMPLETE

Per-key failures are logged and skipped; the master-side reaper (EvictionThreadFunc) drops orphaned PromotionTasks and decrements the source replica's refcnt after put_start_release_timeout_sec.

Module

  • Mooncake Store (mooncake-store)
  • Integration (mooncake-integration)
  • Python Wheel (mooncake-wheel)
  • CI/CD

Type of Change

  • New feature

How Has This Been Tested?

Suite Tests Result
promotion_on_hit_test (master gate) 12 12/12 pass (~14s) — 6 original gate + 3 concurrency/reaper + 3 multi-segment alloc
file_storage_promotion_test (NEW) 10 10/10 pass (~0.4s) — FakeClient subclass drives ProcessPromotionTasks; covers heartbeat protocol, alloc/load/write/notify failures, per-key independence
master_service_promotion_test_for_snapshot (NEW) 7 7/7 pass (~14s) — LOCAL_DISK + mixed replicas + multi-holder + in-flight task all round-trip via TestSnapshotAndRestore in TearDown
test_promotion_on_hit.py 2 2/2 pass (~78s) — positive (promotion fires after threshold) + negative (below-watermark workload stays MEMORY-only)

Regression check: no failures in master_service_test, offload_on_evict_test, file_storage_test.

Operator setup for the e2e:

mooncake_master \
    --enable_offload=true \
    --offload_on_evict=true \
    --promotion_on_hit=true \
    --promotion_admission_threshold=2 \
    --root_fs_dir=/path/to/local/disk

Client side requires enable_ssd_offload=True in MooncakeDistributedStore.setup. For test workloads under 256 MB / 500 keys, also set MOONCAKE_OFFLOAD_BUCKET_KEYS_LIMIT and MOONCAKE_OFFLOAD_BUCKET_SIZE_LIMIT_BYTES low enough that the bucket backend actually flushes (the defaults silently buffer in the ungrouped pool — same gotcha as the offload e2e).

Known follow-ups, not in this PR:

  • No metrics yet — MasterMetricManager has no promotion counters. Should land before flipping default-on for production clients.
  • Client::TransferRead LOCAL_DISK bug at client_service.cpp:2675 is pre-existing and orthogonal; the e2e sidesteps it by using batch_get_replica_desc (which calls GetReplicaList, the real promotion trigger) rather than store.get. Single-key store.get of a LOCAL_DISK-only key still throws "Expected DiskDescriptor".
  • Orphaned PROCESSING memory replica — if AllocStart succeeds but Notify never arrives, the reaper drops the task entry and source refcnt but doesn't remove the staged PROCESSING replica from metadata. Safe (invisible to readers) but leaks until master restart.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@yzhan1 yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 12 times, most recently from 8b1d481 to fd69cf2 Compare May 10, 2026 07:55
@yzhan1 yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch 5 times, most recently from f163af1 to 672a911 Compare May 18, 2026 17:39
yzhan1 added 10 commits May 20, 2026 21:27
…elected e2e proof

Two PR-review issues from kvcache-ai#2071, addressed without changing the existing
commit:

1. Heartbeat could silently drop leftover promotion tasks.
   The prior fix to a per-tick cap lived on the client side: master
   drained the whole per-LDS promotion_objects map, client processed
   one, the rest evaporated. Their per-shard promotion_tasks entries
   stayed refcnt-pinned until the reaper TTL (~10 min), and the dedup
   gate blocked re-enqueue, so a burst of hot LOCAL_DISK keys could
   lose all but one promotion for that window.

   Move the cap server-side. MasterService::PromotionObjectHeartbeat
   now extracts at most kMaxPerHeartbeat (1) entries from the per-LDS
   promotion_objects map and leaves the remainder queued for subsequent
   heartbeats. The client-side per-tick cap is removed — the master is
   the bound. Leftover work is preserved by construction.

   New unit test HeartbeatBoundedBatchPreservesLeftovers proves: with
   three queued keys, three successive heartbeats each return exactly
   one distinct key, the fourth returns empty, and all three keys'
   master-side state stays intact across the drain.

   The FakeClient harness in file_storage_promotion_test now mirrors
   the master's bounded-batch semantic on its mocked Heartbeat, so the
   existing client-side tests exercise the same drain shape production
   sees.

2. E2E couldn't prove subsequent reads avoided SSD.
   Post-promotion the test asserted bytes-back and the existence of a
   MEMORY replica, but bytes-back is the same whether MEMORY or
   LOCAL_DISK served the read (the offload-RPC path returns correct
   bytes too). The RFC's core invariant — "subsequent reads avoid
   SSD" — went unverified.

   Add an instrumented counter offload_rpc_read_count on RealClient
   that increments at every invocation of
   batch_get_into_offload_object_internal, the single chokepoint for
   LOCAL_DISK reads served via peer offload-RPC. Expose via the
   Python binding as MooncakeDistributedStore.get_offload_rpc_read_count().
   The e2e snapshots the counter around the post-promotion store.get
   and asserts it didn't move — a non-zero delta would mean the read
   served from SSD instead of the promoted MEMORY replica.
Reviewer asked for more metrics. The post-promotion offload-RPC counter
already proves MEMORY served the read; adding a wall-time comparison
gives the same evidence on a different axis and surfaces a concrete
"this is the user-visible win" number.

Phase 3 timed the 4 admission-threshold-clearing reads as the
LOCAL_DISK sample. Phase 5 expanded the single post-promotion read into
10 timed reads as the MEMORY sample, plus the existing counter
assertion. The test now prints a median / min / max table for each
phase and asserts the post-promotion median is at least 1.3x faster.

The threshold is intentionally loose: on docker tmpfs with small (1 MB)
objects much of the wall time is Python/C++ marshalling overhead and
the SSD-vs-DRAM gap shrinks to ~2-5x; on real SSDs + RDMA the gap is
typically 10-50x. The counter assertion is the hard correctness check;
the latency assertion catches gross regressions while the printed
summary serves as the primary empirical signal.

Observed in CI (docker tmpfs, 1 MB objects):
  pre-promotion  (LOCAL_DISK via offload-RPC): median 1.53 ms
  post-promotion (MEMORY direct):              median 0.31 ms
  speedup: 4.9x
Reviewer asked to keep a richer latency comparison in the repo but
NOT run it as a CI test. test_promotion_after_repeated_hits goes back
to the lean correctness shape: 4 reads to clear the admission gate,
1 read post-promotion + offload-RPC counter assertion to prove MEMORY
served, no timing. The full p50/p95/p99 latency comparison moves into
a sibling class BenchPromotionLatency gated by
@unittest.skipUnless(MC_BENCH_PROMOTION_LATENCY), so:

  python test_promotion_on_hit.py                              # 3 tests, 1 skipped — CI
  MC_BENCH_PROMOTION_LATENCY=1 python test_promotion_on_hit.py # full bench

The bench retains the per-read MEMORY-detection (snapshots
offload_rpc_read_count around each pre-promotion read and bails on
detection of MEMORY-served reads so the LOCAL_DISK distribution stays
uncontaminated) and the 1.3x p50 speedup assertion. Sample count is
LATENCY_SAMPLES (default 1000; verified stable to within noise at
N=10000 too — N=1000 is the sweet spot for stable p99 without paying
for 10x more wall time).

Observed bench numbers on docker tmpfs (1 MiB objects, N=1000):
  pre-promotion  (LOCAL_DISK via offload-RPC): p50=0.52 ms, p95=0.68 ms, p99=0.81 ms
  post-promotion (MEMORY direct):              p50=0.24 ms, p95=0.33 ms, p99=0.42 ms
  speedup: p50=2.2x | p95=2.0x | p99=1.9x

Real-hardware speedup is typically 10-50x because the SSD-vs-DRAM cost
gap dominates only when objects are large and the FFI/marshalling
overhead is fractional.
…obal cap, const, misconfig log)

Addresses reviewer feedback on PR kvcache-ai#2071.

Issue 1 -- orphaned PROCESSING MEMORY replica leak. The promotion task
reaper only dropped the source LOCAL_DISK refcnt and erased the task
entry; it never popped the staged PROCESSING MEMORY replica added by
PromotionAllocStart. That replica is not in shard->processing_keys, so
DiscardExpiredProcessingReplicas could not sweep it and the buffer
leaked until the object was removed or evicted. Fix: in the reaper,
when alloc_id != 0, call metadata.EraseReplicaByID(alloc_id) to pop
the staged replica and return its buffer to the allocator.

Issue 2 -- per-shard cap was wrong for skewed workloads. The old gate
was 'shard->size() * kNumShards >= limit', approximately right for
uniform workloads but ~1024x too eager on skewed workloads where hot
keys cluster in few shards. Replace with a cluster-wide
std::atomic<uint64_t> promotion_in_flight_ counter. Incremented in
TryPushPromotionQueue after successful emplace; decremented in
NotifyPromotionSuccess and in the reaper. memory_order_relaxed since
the value is advisory; the per-shard mutex already serializes inserts
within a shard and the dedup gate prevents duplicate work.

Issue 3 -- const_cast smell. The promotion_tasks map held
const PromotionTask values for "generic safety", forcing a
const_cast<PromotionTask&> in PromotionAllocStart to set alloc_id
under the shard write lock. Drop the const; PromotionAllocStart now
sets task_it->second.alloc_id = new_id directly.

Misconfig log -- emit LOG(WARNING) at startup when
config.promotion_on_hit=true but enable_offload=false. Promotion
requires offload to produce LOCAL_DISK replicas, so it is silently
disabled in that combination; the log makes the disablement
discoverable to operators.

Tests
-----
- New ReaperPopsStagedMemoryReplicaOnExpiry: regression for Issue 1.
  Uses QuerySegments(seg).first (used bytes) to observe that the
  staged PROCESSING MEMORY replica's buffer is freed back to the DRAM
  allocator after the reaper sweeps the expired task.
- New QueueLimitRejectsCrossShard: regression for Issue 2. With
  queue_limit=1, proves a second admission attempt on a *different
  shard* is rejected -- exactly the case the old per-shard cap
  admitted incorrectly.
- Updated comment on existing QueueLimitRejectsBeyondCap to reflect
  the cluster-wide counter semantics.

Verification
------------
- promotion_on_hit_test: 15/15 pass (5 consecutive clean runs)
- file_storage_promotion_test: 9/9 pass
- master_service_promotion_test_for_snapshot: 5/5 pass
- offload_on_evict_test: 9/9 pass
- code_format.sh --base upstream/main (clang-format-20 in container):
  3 files reformatted (master_service.h, master_service.cpp,
  promotion_on_hit_test.cpp); all others "Already formatted"
…est + reaper poll

Follow-up to 2cee8eb addressing two further reviewer concerns and
recovering one edit batch that did not land in that commit.

AllocStart deadline reset
-------------------------
A queued promotion task could enter the active-transfer phase
(PromotionAllocStart -> client SSD read -> RDMA write -> Notify) with
only a sliver of TTL remaining if the holder's heartbeat queue was
backlogged (kMaxPerHeartbeat = 1 serializes work). If the active
transfer of a large object then ran past the TTL, the reaper would
call EraseReplicaByID on the staged MEMORY replica mid-transfer; the
allocator could then hand that buffer to a concurrent Put while the
client's RDMA write was still landing into it.

Fix: PromotionAllocStart now resets task.start_time when it stages
the PROCESSING MEMORY replica. The queue-wait phase (alloc_id == 0,
no buffer) and the active-transfer phase (alloc_id != 0, buffer
in flight) each get their own full put_start_release_timeout_sec_
TTL window, measured from when they began.

New regression test AllocStartResetsTaskDeadline drives the
queue_wait_seconds + alloc_active_seconds > TTL scenario and asserts
via QuerySegments(seg).first that the staged buffer survives past
the bare TTL when measured from admission.

NotifyPromotionSuccess counter coverage
---------------------------------------
The success path of NotifyPromotionSuccess was not exercised by any
existing test — all call sites in the suite asserted error paths.
2cee8eb added a fetch_sub on the success path but had no test to
catch a missing decrement.

New test NotifySuccessDecrementsCounter sets queue_limit=1, drives
the full happy path (GetReplicaList -> PromotionAllocStart ->
NotifyPromotionSuccess), then asserts a second admission on a
different key is allowed. Without the success-path fetch_sub, the
cap stays saturated at 1 and the second admission is dropped. Bug
discrimination verified manually: temporarily removing fetch_sub
fails this test as expected.

Reaper test poll-instead-of-sleep
---------------------------------
ReaperPopsStagedMemoryReplicaOnExpiry used a fixed 2-second sleep
after a 1-second TTL. Under suite load the eviction thread cadence
plus general scheduling jitter could push past the 2-second mark
~1 in 6 runs, producing a flake matching the StalePromotionReaper
class. Replace with a 5-second polling deadline against
QuerySegments(seg).first; the test now passes 5/5 stability runs.

QueueLimitRejectsBeyondCap stale block comment
----------------------------------------------
The test's docblock still described the old per-shard gate (gate:
shard_size * kNumShards >= limit). Update it to reflect the
cluster-wide promotion_in_flight_ counter and cross-reference
QueueLimitRejectsCrossShard for the cross-shard case.

Verification
------------
- promotion_on_hit_test: 17/17 pass × 5 consecutive clean runs
- New tests pass; mutation-test of NotifySuccessDecrementsCounter
  against a removed success-path fetch_sub fails as designed
- code_format.sh --base upstream/main (clang-format-20 in container):
  3 files reformatted; all others already-formatted
…ify holder check

Two correctness fixes around the PromotionTask state machine, both
symmetric to the staged-replica-leak class that the existing Issue 1
fix (reaper pops staged replica) addressed.

Bug 1 -- AllocStart could orphan a staged replica
-------------------------------------------------
If the reaper swept a PromotionTask between the holder's heartbeat
returning the key and the eventual PromotionAllocStart RPC arriving
(a hung client, GC pause, or HA failover stalling AllocStart past
put_start_release_timeout_sec_), the old code path was:

  1. allocator->Allocate(...)
  2. metadata.AddReplicas(...)                  // stages PROCESSING MEMORY
  3. shard->promotion_tasks.find(key)            // returns end()
  4. if (task_it != end) { record alloc_id }     // silently skipped
  5. return success                              // replica orphaned

The staged PROCESSING MEMORY replica then had no PromotionTask
pointing at it. DiscardExpiredProcessingReplicas only iterates
shard->processing_keys (which AllocStart never populates), and the
promotion-task reaper iterates shard->promotion_tasks (the entry is
gone). The buffer leaked until the object itself was Remove'd or
evicted. This is the symmetric counterpart of Issue 1: that one was
"task survives but reaper fires after AllocStart"; this one is
"AllocStart runs after task is reaped".

Fix: lift the task lookup to the top of PromotionAllocStart, before
allocation. Reject with REPLICA_IS_NOT_READY if the task is gone.
The shard mutex is held by MetadataAccessorRW for the entire
function, so the iterator stays valid across the allocator step.

Regression test AllocStartRejectsReapedTask: admit a task, sleep
past TTL so the reaper sweeps it, call AllocStart, assert
REPLICA_IS_NOT_READY and that QuerySegments(seg).first is at
baseline (no buffer was staged). Sanity-mutation verified: with the
check removed the test fails as expected.

Bug 2 -- NotifyPromotionSuccess didn't validate the caller
----------------------------------------------------------
PromotionTask did not record which client owned the source
LOCAL_DISK replica, and NotifyPromotionSuccess accepted any
client_id. Any client that knew the key could call Notify and flip
the staged PROCESSING MEMORY replica to COMPLETE before the
legitimate holder's RDMA write had landed, exposing torn data to
readers via GetReplicaList.

In single-tenant trusted-client deployments this surfaces as a
silent corruption when a buggy client tracks state wrong (e.g.,
issuing Notify on a key it never offloaded). In multi-tenant
deployments (offload RFC kvcache-ai#1836) it is a security gap.

Fix: capture the holder client_id in PromotionTask at admission
(TryPushPromotionQueue already resolves it via
get_local_disk_client_id() for the PushPromotionQueue routing). At
NotifyPromotionSuccess, return INVALID_PARAMS if
task.holder_id != client_id.

Regression test NotifyRejectsNonHolder: stage a promotion, call
Notify with an unrelated client_id, assert INVALID_PARAMS and that
the staged replica is still PROCESSING (no COMPLETE memory replica
visible via GetReplicaList). Then call Notify with the legitimate
holder client_id and assert it succeeds. Sanity-mutation verified:
with the check removed the test fails as expected.

Test churn
----------
MultiSegmentAllocPicksAvailableSegment and
MultiSegmentAllocRespectsPreferred previously bypassed the gate and
called PromotionAllocStart directly (the old contract permitted
this; AllocStart silently no-op'd the task assignment). With the
new task-existence gate they must seed a task first via
GetReplicaList. Both tests now do that as a one-line setup before
their multi-segment assertions. Behavior under test (segment
selection) is unchanged.

Verification
------------
- promotion_on_hit_test: 19/19 pass x 5 consecutive runs (was 17)
- file_storage_promotion_test: 9/9 pass
- master_service_promotion_test_for_snapshot: 5/5 pass
- offload_on_evict_test: 9/9 pass
- Both new tests verified bug-discriminating via temporary code
  mutation; both fail under the buggy condition and pass under the
  fix.
- code_format.sh --base upstream/main (clang-format-20 in container):
  3 files reformatted; all others already-formatted.
…lient failure

Two related additions tightening the PromotionTask lifecycle, plus a
ClearInvalidHandles cleanup. The HA-failover orphan path is left as-is
- see "HA failover trade-off" below.

AllocStart caller authorization
-------------------------------
PromotionAllocStart now takes a client_id parameter and rejects calls
where client_id != task.holder_id with INVALID_PARAMS. Without this
gate, any client knowing the key could call AllocStart and have the
master allocate a DRAM buffer plus stage a PROCESSING MEMORY replica
against someone else's task; the legitimate holder's subsequent
AllocStart would then race against an already-staged replica it
didn't allocate. The auth check mirrors the holder-id gate already in
place on NotifyPromotionSuccess and the new NotifyPromotionFailure.

Also adds a size check: AllocStart rejects requests whose size
doesn't match task.object_size (captured from the source LOCAL_DISK
descriptor at admission). Defense-in-depth against a buggy caller
requesting an arbitrary allocation size.

NotifyPromotionFailure: eager release on client failure
-------------------------------------------------------
New holder-side RPC for the explicit-abort case: the client got past
PromotionAllocStart but a downstream step (local SSD read, RDMA
write, NotifyPromotionSuccess) failed. Without this, every transient
client-side failure pinned a task slot and a staged DRAM buffer for
put_start_release_timeout_sec_ (~10 min default). On busy clusters
that turned every flake into a 10-minute outage of
promotion_queue_limit_.

Semantics mirror the reaper's expiry path: holder-id auth check,
drops source LOCAL_DISK refcnt, pops the staged PROCESSING MEMORY
replica via EraseReplicaByID when alloc_id != 0, erases the task,
decrements promotion_in_flight_, clears the per-client
promotion_objects entry. Idempotent: missing task entry returns OK.

file_storage.cpp calls NotifyPromotionFailure on every failure path
in ProcessPromotionTasks: AllocStart failure (slot was claimed at
TryPushPromotionQueue time, needs release even before any buffer was
staged), AllocateBatch failure, SSD BatchLoad failure, TransferWrite
failure, and NotifyPromotionSuccess failure. Each call is best-effort
with the reaper as the long-stop.

ClearInvalidHandles: clean up promotion_tasks on client expiry
--------------------------------------------------------------
ClientMonitorFunc's ClearInvalidHandles previously cleaned up
processing_keys, replication_tasks, and offloading_tasks when a
client's metadata was erased, but skipped promotion_tasks. After
holder-client expiry the dangling task pinned its slot for up to
put_start_release_timeout_sec_; rolling restarts of many holders
could saturate promotion_queue_limit_ for the full TTL. Now erased
alongside the others, with matching promotion_in_flight_ decrement.

HA failover trade-off (intentionally not addressed)
---------------------------------------------------
The master snapshot persists metadata.replicas_ but not the per-shard
task-tracking maps. After a leader change, an in-flight promotion's
staged PROCESSING MEMORY replica survives in metadata but has no
promotion_tasks entry pointing at it. The replica's allocator handle
keeps its buffer reserved (so a subsequent Put cannot reuse the
buffer - no UAF), but the buffer is leaked until the key is
Remove'd, and the key itself silently loses future promotion
eligibility because the HasReplica(fn_is_memory_replica) gate sees
the orphan and blocks new admissions. Affected blast radius per
failover is bounded by in_flight_at_snapshot * object_size.

This is a pre-existing failure mode shared with PutStart, CopyStart,
and MoveStart. An earlier iteration of this PR included a restore-
side sweep that popped all PROCESSING replicas; on further review
the sweep introduced a worse failure mode (in-flight RDMA writes
from the previous epoch landing into newly-allocated buffers after
the sweep released the allocator handle). A proper fix needs
synchronous OpLog of pre-commit operations and is out of scope here.

Tests
-----
promotion_on_hit_test (24 tests, all pass):
  - AllocStartRejectsNonHolder, AllocStartRejectsSizeMismatch
  - NotifyFailureReleasesStateImmediately, NotifyFailureRejectsNonHolder
  - ClientExpiryClearsPromotionTask
  - existing tests updated to thread client_id through AllocStart

file_storage_promotion_test (11 tests, all pass):
  - AllocStartFailureNotifiesMaster
  - PostAllocFailuresAllNotifyMaster

Verification (in-container build + run):
  promotion_on_hit_test:                       24/24 pass
  file_storage_promotion_test:                 11/11 pass
  master_service_promotion_test_for_snapshot:   5/5  pass
  offload_on_evict_test:                        9/9  pass
  master_service_test:                         98/98 pass
…-mid-promotion cleanup

Two reviewer-flagged correctness fixes around the PromotionTask
lifecycle.

promotion_admission_threshold lower-bound clamp
-----------------------------------------------
`master.cpp` clamped the configured threshold at >255 but had no
lower-bound clamp. With threshold=0, `master_service.cpp`'s
`if (freq < threshold)` is always false (freq is uint8_t, can't be
negative), so every Get on a LOCAL_DISK-only key admits a promotion
and the frequency gate is silently bypassed. Operators setting the
flag to 0 expecting "off" would instead get "always on", and the
existing comment claiming "[1, 255]" clamping was wrong.

Two-layer fix:
- master.cpp: clamp 0 → 1 at flag-parse time with a LOG(WARNING)
  pointing operators at the intended minimum.
- master_service.cpp constructor: defense-in-depth clamp on
  `promotion_admission_threshold_` for direct-construction paths
  (tests, embedded users) that bypass the flag parser. Mirrors the
  upper-bound clamp at 255.

The misleading "[1, 255]" comment at the gate site is updated to note
both bounds are enforced at parse time *and* in the constructor.

Metadata-erase paths now drop in-flight PromotionTask entries
-------------------------------------------------------------
Five paths erase an ObjectMetadata entry; ClearInvalidHandles was the
only one that also wiped the matching promotion_tasks entry and
decremented promotion_in_flight_. The other four left the task entry
dangling for up to put_start_release_timeout_sec_ (~10 min default),
pinning a slot of promotion_queue_limit_ and confusing subsequent
NotifyPromotionSuccess RPCs on the key with a stale REPLICA_IS_NOT_READY.

Sites fixed:
- Remove(force=true) — master_service.cpp:1948
- RemoveByRegex — master_service.cpp:2002
- UpsertStart stale-handle erase — master_service.cpp:1242
- UpsertStart preempt erase — master_service.cpp:1282
- MetadataAccessorRW auto-cleanup ctor — master_service.h:1144-1148

All five now call a new private helper:

  void ErasePromotionTaskIfPresent(MetadataShardAccessorRW& shard,
                                   const std::string& key)
      NO_THREAD_SAFETY_ANALYSIS {
      if (shard->promotion_tasks.erase(key) > 0) {
          promotion_in_flight_.fetch_sub(1, std::memory_order_relaxed);
      }
  }

Helper is documented at the declaration site only; call sites are
plain one-liners.

Why this is asymmetric with offload
-----------------------------------
The codebase already tolerates orphan offloading_tasks at the same
sites — offload has no cluster-wide in-flight counter to brick, and
NotifyOffloadSuccess gracefully recreates missing metadata via
AddReplica (because the bytes live on the client's SSD). Promotion
can't do the same on commit (bytes live in master-allocated DRAM
that gets deallocated alongside the metadata-erase, so committing
those bytes risks surfacing torn or cross-key data — see the
REPLICA_IS_NOT_READY rationale on NotifyPromotionSuccess). So the
eager-erase is promotion-specific and addresses both the cap-slot
leak and the user-visible REPLICA_IS_NOT_READY oddity during the
orphan window.

Tests
-----
Four new tests in promotion_on_hit_test.cpp (suite now 28/28):
- AdmissionThresholdZeroClampsToOne: configure threshold=0, assert
  the constructor lifts it to 1.
- AdmissionThresholdAboveMaxClampsToMax: configure threshold=1000,
  assert clamp to 255.
- RemoveErasesPromotionTask: with queue_limit=1, admit a task,
  Remove(force=true) the key, then admit a different key — must
  succeed because the slot was freed by Remove.
- RemoveByRegexErasesPromotionTask: same shape via the regex path.

Both clamp tests need friend access to the now-private
promotion_admission_threshold_ field. Added a static
GetPromotionAdmissionThresholdForTesting funnel on the fixture (since
gtest TEST_F subclasses don't inherit friendship). The missing
`friend class test::PromotionOnHitTest` declaration in
master_service.h was re-added (it had been dropped during an earlier
rebase).

Bug-discrimination verified for each test via temporary mutation:
- Clamp removed → both clamp tests fail.
- ErasePromotionTaskIfPresent body removed → both Erases* tests fail.

Verification (in-container build + run):
  promotion_on_hit_test:                       28/28 pass
  file_storage_promotion_test:                 11/11 pass
  master_service_promotion_test_for_snapshot:   5/5  pass
  offload_on_evict_test:                        9/9  pass

Formatted with scripts/code_format.sh --base upstream/main
(clang-format-20 in container): 4 files reformatted, all others
already-formatted.
…acts, dedup, shorten

Comment-only cleanup, no code logic changes.

- Drop internal review-discussion artifacts: "Issue 1/2/3 regression",
  "Bug 1/2 regression", "Pre-fix this would..." — these reference our
  internal triage labels that a fresh reader has no context for. Test
  docblocks rewritten to describe the behavior under test.
- Drop numbered test enumeration (// 1., // 2., ...) in
  file_storage_promotion_test.cpp.
- Drop a 36-line meta-narrative in promotion_on_hit_test.cpp's
  InjectLocalDiskReplica helper that walked through alternative
  implementations considered. Replaced with a 3-line description of
  what the helper actually does.
- Dedup holder-id auth explanation: the canonical place is the
  PromotionTask struct doc. Inline checks in PromotionAllocStart,
  NotifyPromotionSuccess, and NotifyPromotionFailure collapse to a
  one-liner pointing back to the struct doc.
- Dedup the "Mirror the reaper's expiry path" block in
  NotifyPromotionFailure to point at DiscardExpiredProcessingReplicas
  Part 4 (the canonical rationale) instead of restating each step.
- Trim long comments in master_service.h (PromotionTask struct doc),
  master_service.cpp (AllocStart start_time reset, reaper Part 4,
  PromotionObjectHeartbeat cap rationale), and file_storage.cpp
  (post-AllocStart failure rationale, AllocStart-failure block).
- Replace "v1" anchors ("in v1", "v1 doesn't pass...") with
  current-state phrasing.

Build/tests unchanged: promotion_on_hit_test 28/28,
file_storage_promotion_test 11/11. Format check passes.
@yzhan1 yzhan1 force-pushed the feature/l2-l1-promotion-on-hit branch from 8012665 to ea3e2ff Compare May 21, 2026 04:43
The previous run's build-wheel-cu13 (3.10 and 3.12) jobs were cancelled
by runner shutdown signals mid-build. Empty commit to trigger a fresh
CI run. No code changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant