[Store] feat: add graceful segment unmount APIs by Lin-z-w · Pull Request #2065 · kvcache-ai/Mooncake

Lin-z-w · 2026-05-09T09:21:49Z

Description

This PR adds graceful segment unmount support to Mooncake Store.

The change allows callers to request a grace period when unmounting mounted segments so the master first marks the segment as GRACEFULLY_UNMOUNTING, removes it from allocation, and only completes the unmount after the grace period expires. This lets existing readers continue using the segment during the grace window while preventing new allocations from targeting it.

The feature is exposed through both C++/pybind APIs and the Python HTTP service:

unmount_segment(segment_ids, grace_period_seconds=0) for mounted shm/file-backed segments.
unmount_and_free_segment(segment_ids, grace_period_seconds=0) for allocate-backed segments.
/api/unmount_shm with grace_period_seconds.
/api/unmount with grace_period_seconds.

The implementation also hardens cleanup behavior by polling graceful unmount completion by segment_id rather than reusable segment names, so local memory registrations and backing resources are released correctly even if a client remounts another segment with the same name during the cleanup window.

Module

Type of Change

How Has This Been Tested?

The following tests/checks were run locally and passed:

cmake --build build --target master_service_test pybind_client_test -j

build/mooncake-store/tests/master_service_test \
  --gtest_filter='MasterServiceTest.GracefulUnmountSegment*'

build/mooncake-store/tests/pybind_client_test \
  --gtest_filter='RealClientTest.MountAndAllocateUnmountApisAcceptGracePeriod:RealClientTest.MountAndAllocateUnmountApisRejectForeignSegments:RealClientTest.AllocateAndMountSegmentFreesOnTearDown'

python3 -m unittest mooncake-wheel.tests.test_mooncake_store_service_api

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist

Code Review

This pull request introduces a graceful unmount feature for memory segments, allowing them to remain readable for a specified duration while preventing new allocations. The implementation includes a new GracefulUnmountScheduler in the master service, a GRACEFULLY_UNMOUNTING segment status, and updates to the HTTP, Python, and C++ APIs. Feedback identifies several critical issues, including a potential data race in the master service due to insufficient locking, a race condition in the scheduler that could cause a crash during shutdown, and performance concerns regarding inefficient record removal and the holding of mutexes during synchronous RPC calls.

gemini-code-assist · 2026-05-09T09:24:20Z

+                                           const UUID& client_id,
+                                           uint64_t grace_period_ms)
+    -> tl::expected<void, ErrorCode> {
+    std::shared_lock<std::shared_mutex> shared_lock(snapshot_mutex_);


The GracefulUnmountSegment method modifies the state of segment_manager_ (via PrepareGracefulUnmountSegment), which typically requires an exclusive lock on snapshot_mutex_ to ensure consistency and prevent data races with other metadata operations or snapshots. Using std::shared_lock here is likely insufficient if snapshot_mutex_ is the primary synchronization primitive for the service's metadata.

Suggested change

std::shared_lock<std::shared_mutex> shared_lock(snapshot_mutex_);

std::unique_lock<std::shared_mutex> lock(snapshot_mutex_);

gemini-code-assist · 2026-05-09T09:24:20Z

+        if (!timer_running_.load()) {
+            timer_running_.store(true);
+            timer_thread_ = std::thread([this]() { this->TimerLoop(); });
+        }


There is a potential race condition between Schedule and Stop. Stop sets timer_running_ to false and releases the mutex before joining the thread. If Schedule is called during this window, it will see timer_running_ as false and attempt to start a new thread, assigning it to timer_thread_. Since the old thread is still joinable, this assignment will trigger std::terminate. A stopping_ flag should be used to prevent restarting the timer during or after shutdown.

gemini-code-assist · 2026-05-09T09:24:20Z

+void MasterService::GracefulUnmountScheduler::RemoveClientRecords(
+    const UUID& client_id) {
+    std::lock_guard<std::mutex> lock(mutex_);
+    std::vector<Record> remaining;
+    remaining.reserve(queue_.size());
+    while (!queue_.empty()) {
+        auto rec = queue_.top();
+        queue_.pop();
+        if (rec.client_id != client_id) {
+            remaining.push_back(rec);
+        }
+    }
+    for (auto& rec : remaining) {
+        queue_.push(rec);
+    }
+}


The current implementation of RemoveClientRecords is inefficient ($O(N \log N)$) as it involves popping all elements from the priority queue and pushing them back. Since this is called in a loop within ClientMonitorFunc for every expired client, it could become a performance bottleneck if there are many segments in the graceful unmount queue. Consider checking client validity within the TimerLoop instead, or using a more suitable data structure like std::set or a sorted std::list if frequent removals are expected.

gemini-code-assist · 2026-05-09T09:24:20Z

+            auto result = client_->UnmountSegmentById(
+                id, grace_period_ms, std::move(cleanup_callback));


Calling UnmountSegmentById (which performs a synchronous RPC) while holding the mounted_segment_records_mutex_ can lead to performance issues, as it blocks all other mount/unmount operations for the duration of the network call. It is recommended to collect the necessary information under the lock and then perform the RPCs outside of the critical section.

gemini-code-assist · 2026-05-09T09:24:20Z

+            auto result = client_->UnmountSegmentById(
+                id, grace_period_ms, std::move(cleanup_callback));


Calling UnmountSegmentById (which performs a synchronous RPC) while holding the allocated_segment_records_mutex_ can lead to performance issues, as it blocks all other allocation/unmount operations for the duration of the network call. It is recommended to collect the necessary information under the lock and then perform the RPCs outside of the critical section.

codecov-commenter · 2026-05-09T10:01:53Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 79.46768% with 108 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/src/client_service.cpp	49.00%	51 Missing ⚠️
mooncake-store/src/master_service.cpp	85.41%	14 Missing ⚠️
mooncake-store/src/real_client.cpp	87.20%	11 Missing ⚠️
mooncake-store/src/segment.cpp	75.00%	10 Missing ⚠️
mooncake-integration/store/store_py.cpp	0.00%	7 Missing ⚠️
mooncake-store/src/master_client.cpp	53.33%	7 Missing ⚠️
mooncake-store/src/rpc_service.cpp	78.26%	5 Missing ⚠️
mooncake-store/tests/master_service_test.cpp	97.74%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

LujhCoconut · 2026-05-11T06:07:04Z

+void Client::StartGracefulUnmountTimer(const UUID& segment_id,
+                                       uint64_t grace_period_ms) {
+    auto delay =
+        std::chrono::milliseconds(grace_period_ms) + std::chrono::seconds(10);
+    task_thread_pool_.enqueue([this, segment_id, delay]() {
+        if (this->WaitForGracefulUnmountDelay(delay)) {
+            return;
+        }
+        this->OnGracefulUnmountTimer(segment_id, /*retry_left=*/3);
+    });
+}


[use-after-free] The lambda captures this and is enqueued into task_thread_pool_, but ~Client() only sets graceful_unmount_timer_stopping_ = true and notifies — it does not wait for the pool task to finish. If the Client is destroyed while a timer task is still pending or sleeping in the pool, the task will invoke OnGracefulUnmountTimer on a dangling this pointer.
Consider using shared_from_this() / weak_ptr, or draining pending tasks in the destructor before proceeding with member teardown.

Thanks for pointing at this path. The exact UAF concern is mitigated by our ThreadPool semantics: stop() wakes workers and joins them after draining already queued tasks, and Client calls task_thread_pool_.stop() before member destruction. So queued timer lambdas should not outlive the Client object. However, the destructor ordering still has a real cleanup issue: it clears graceful_unmount_cleanup_callbacks_ without invoking them. I will adjust the destructor to stop/drain graceful timer work before clearing graceful state, then move and invoke pending cleanup callbacks so local mmap/allocated records are released deterministically.

LujhCoconut · 2026-05-11T06:32:25Z

        std::lock_guard<std::mutex> lock(mounted_segments_mutex_);
        mounted_segments_.clear();
+        gracefully_unmounting_segments_.clear();
+        graceful_unmount_cleanup_callbacks_.clear();
+    }
+    {
+        std::lock_guard<std::mutex> lock(graceful_unmount_timer_mutex_);
+        graceful_unmount_timer_stopping_ = true;


[resource-leak] The destructor clears graceful_unmount_cleanup_callbacks_ without invoking the stored callbacks. Each callback holds a ReleaseMountedSegmentRecord call that does munmap. If the Client is destroyed while graceful unmounts are still in progress, the mmap regions and file descriptors are leaked. Invoke each callback before clearing the map.

Fixed in f1f0ccd. Client::~Client now signals graceful timer shutdown and drains the task thread pool before clearing graceful state. It also moves pending graceful cleanup callbacks out of the map and invokes them, so local mmap/allocated records are released instead of being dropped.

LujhCoconut · 2026-05-11T06:40:34Z

+        for (auto& rec : expired) {
+            {
+                std::lock_guard<std::mutex> lock(mutex_);
+                auto generation_it = client_generations_.find(rec.client_id);
+                if (generation_it != client_generations_.end() &&
+                    generation_it->second != rec.client_generation) {
+                    continue;
+                }
+            }
+            if (service_) {
+                service_->UnmountSegment(rec.segment_id, rec.client_id);
+            }
+        }


[race-condition] TOCTOU window between generation check and UnmountSegment call

TOCTOU (Time-of-Check to Time-of-Use): A race condition where a state is checked at one time but used at a later time, allowing the state to change in between.

There is a TOCTOU window between the generation check (under lock) and the UnmountSegment call (after unlock). If RemoveClientRecords increments the generation in that window, the stale record will be executed instead of being skipped. If UnmountSegment is not idempotent in all edge cases, this could cause incorrect behavior.

Fixed in f1f0ccd. I removed the generation-based lazy cancellation and restored direct removal of queued records under the scheduler mutex. This avoids the check/use window between generation validation and UnmountSegment execution.

LujhCoconut · 2026-05-11T06:45:50Z

+                }
+            }
+            if (service_) {
+                service_->UnmountSegment(rec.segment_id, rec.client_id);


Nits: Records are popped from the queue before UnmountSegment is called. If the RPC fails (network timeout, segment already cleaned up by another path), the record is silently dropped with no retry and no log message. Add at least a LOG(WARNING) on failure so operators can diagnose stuck segments.

Fixed in f1f0ccd. The scheduler now logs a warning with segment_id, client_id, and error code when the final UnmountSegment call fails after a graceful timer expires.

LujhCoconut · 2026-05-11T07:03:14Z

Thanks for this PR. The more critical issues have been raised in my earlier comments. From my perspective, there are still areas where performance could be optimized (e.g., exponential backoff for retries). However, I think it would be appropriate to merge this PR first and follow up with a dedicated performance optimization PR later.

00fish0 · 2026-05-11T07:53:22Z

    return ErrorCode::OK;
 }

+ErrorCode ScopedSegmentAccess::PrepareGracefulUnmountSegment(


PrepareGracefulUnmountSegment destroys the allocator at t=0, which invalidates the segment's object replicas immediately.

PrepareGracefulUnmountSegment run at t=0 when the graceful-unmount request arrives) does:

segment_manager_->allocator_manager_.removeAllocator(segment.name, allocator);

mounted_segment.buf_allocator.reset();

Those are the only two shared_ptrs to this BufferAllocator (one in allocator_manager_, one in mounted_segment.buf_allocator). Replicas only hold a weak_ptr to it (allocator.h:79; replica.h:292-295 — has_invalid_mem_handle() → !allocator_.expired()).

So as soon as this runs, the allocator's refcount hits zero and it's destroyed, and every replica on this segment immediately becomes has_invalid_mem_handle() == true — at t=0, not when the grace period expires.

Consequences:

ClientMonitorFunc calls ClearInvalidHandles() on every iteration, which EraseReplicas for these invalid handles; if an object's only replica was here, the whole metadata entry is erased too.

Even before that sweep runs, a GetReplicaList for that key calls CleanupStaleHandles itself, wipes the replica on the spot, and returns object-not-found.

In other words: a Get that reaches the master during the grace period will, for any object whose data lives only on this segment, fail almost immediately (OBJECT_NOT_FOUND) — it does not survive until the end of the grace period. What the grace period actually preserves is only:

(1) the segment metadata (QuerySegmentStatus still finds it)
(2) the client-side TransferEngine registration of that memory (kept until grace_period + ~40s, when unregisterLocalMemory + munmap run).

That doesn't match the PR description ("This lets existing readers continue using the segment during the grace window"): it only helps a reader that has already obtained a buffer descriptor from the master and is about to issue / is issuing a one-sided RDMA read (descriptor in hand, memory still registered → the read works). It does not protect a reader that issues a new Get during the grace period.

Suggestion: during the graceful phase, don't destroy the allocator — just remove it from the "allocatable" set (keep the mounted_segment.buf_allocator shared_ptr, only call allocator_manager_.removeAllocator(...)), so replicas on the segment stay isAllocatorValid() for the duration. Then buf_allocator.reset() only at CommitUnmountSegment when the grace period expires. That way GetReplicaList keeps returning this segment's descriptors during the grace window, which is what "object stays readable during the grace period" should mean.

There's also a test gap: GracefulUnmountSegment_PreventAllocation only does a fresh PutStart/PutEnd after the graceful unmount — it never does a GetReplicaList for the key that already lived on segment1, so "is an existing object still readable during the grace period" isn't covered at all. Worth adding a case: mount → put object to seg → graceful unmount(seg) → GetReplicaList(key) within the grace window should still succeed and point at that seg.

Fixed in f1f0ccd. PrepareGracefulUnmountSegment now only removes the allocator from the allocatable set and keeps mounted_segment.buf_allocator alive during the grace window. The allocator is released during the final unmount path. I also extended GracefulUnmountSegment_PreventAllocation to verify that an existing object on the graceful segment remains readable via GetReplicaList during the grace period, while new allocations avoid that segment.

Lin-z-w · 2026-05-11T08:30:11Z

+            // Notify graceful unmount scheduler to drop pending records
+            // for expired clients. The actual unmount is handled below.
+            for (auto& cid : expired_clients) {
+                graceful_unmount_scheduler_.RemoveClientRecords(cid);
+            }
+


I checked the current branch and this specific state leak does not reproduce with the current PrepareUnmountSegment() implementation. It only rejects UNMOUNTING and does not reject GRACEFULLY_UNMOUNTING, so the client-expiry path can still prepare and commit cleanup for a segment already in graceful unmount. The snippet in the comment looks like it reflects an older version of the status check. I agree this state should remain covered by tests, and I can add/extend a test if needed, but the current code path should not permanently leak the segment for this reason.

LujhCoconut

Thanks for this PR. This proposal makes sense to me. The correctness of the current implementation is also LGTM.

Lin-z-w added 5 commits May 9, 2026 16:40

feat(store): add graceful segment unmount lifecycle

eca6b79

feat(store): expose graceful unmount in Python HTTP APIs

8d844e5

fix(store): poll graceful unmount cleanup by segment id

55918f2

chore(store): remove unused segment status client RPC

6c8fbb5

fix(store): preserve segment name indexes on graceful expiry

858e3b1

Lin-z-w requested review from ShangmingCai, XucSh, YiXR, stmatengss and ykwd as code owners May 9, 2026 09:21

github-actions Bot added run-ci Store Installation labels May 9, 2026

Lin-z-w changed the title ~~[Store] feat(store): add graceful segment unmount APIs~~ [Store] feat: add graceful segment unmount APIs May 9, 2026

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

fix(store): harden graceful unmount concurrency

4d5597b

ykwd requested a review from 00fish0 May 11, 2026 04:18

LujhCoconut reviewed May 11, 2026

View reviewed changes

00fish0 reviewed May 11, 2026

View reviewed changes

Lin-z-w commented May 11, 2026

View reviewed changes

fix(store): preserve graceful unmount readability

f1f0ccd

LujhCoconut approved these changes May 12, 2026

View reviewed changes

	std::shared_lock<std::shared_mutex> shared_lock(snapshot_mutex_);
	std::unique_lock<std::shared_mutex> lock(snapshot_mutex_);

		auto result = client_->UnmountSegmentById(
		id, grace_period_ms, std::move(cleanup_callback));

Conversation

Lin-z-w commented May 9, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LujhCoconut commented May 11, 2026

Uh oh!

00fish0 May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LujhCoconut left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented May 9, 2026 •

edited

Loading

00fish0 May 11, 2026 •

edited

Loading