Skip to content

[Store] feat: add graceful segment unmount APIs#2065

Open
Lin-z-w wants to merge 7 commits into
kvcache-ai:mainfrom
Lin-z-w:feat/segment-graceful-unmount
Open

[Store] feat: add graceful segment unmount APIs#2065
Lin-z-w wants to merge 7 commits into
kvcache-ai:mainfrom
Lin-z-w:feat/segment-graceful-unmount

Conversation

@Lin-z-w
Copy link
Copy Markdown
Contributor

@Lin-z-w Lin-z-w commented May 9, 2026

Description

This PR adds graceful segment unmount support to Mooncake Store.

The change allows callers to request a grace period when unmounting mounted segments so the master first marks the segment as GRACEFULLY_UNMOUNTING, removes it from allocation, and only completes the unmount after the grace period expires. This lets existing readers continue using the segment during the grace window while preventing new allocations from targeting it.

The feature is exposed through both C++/pybind APIs and the Python HTTP service:

  • unmount_segment(segment_ids, grace_period_seconds=0) for mounted shm/file-backed segments.
  • unmount_and_free_segment(segment_ids, grace_period_seconds=0) for allocate-backed segments.
  • /api/unmount_shm with grace_period_seconds.
  • /api/unmount with grace_period_seconds.

The implementation also hardens cleanup behavior by polling graceful unmount completion by segment_id rather than reusable segment names, so local memory registrations and backing resources are released correctly even if a client remounts another segment with the same name during the cleanup window.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

The following tests/checks were run locally and passed:

cmake --build build --target master_service_test pybind_client_test -j

build/mooncake-store/tests/master_service_test \
  --gtest_filter='MasterServiceTest.GracefulUnmountSegment*'

build/mooncake-store/tests/pybind_client_test \
  --gtest_filter='RealClientTest.MountAndAllocateUnmountApisAcceptGracePeriod:RealClientTest.MountAndAllocateUnmountApisRejectForeignSegments:RealClientTest.AllocateAndMountSegmentFreesOnTearDown'

python3 -m unittest mooncake-wheel.tests.test_mooncake_store_service_api

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@Lin-z-w Lin-z-w changed the title [Store] feat(store): add graceful segment unmount APIs [Store] feat: add graceful segment unmount APIs May 9, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a graceful unmount feature for memory segments, allowing them to remain readable for a specified duration while preventing new allocations. The implementation includes a new GracefulUnmountScheduler in the master service, a GRACEFULLY_UNMOUNTING segment status, and updates to the HTTP, Python, and C++ APIs. Feedback identifies several critical issues, including a potential data race in the master service due to insufficient locking, a race condition in the scheduler that could cause a crash during shutdown, and performance concerns regarding inefficient record removal and the holding of mutexes during synchronous RPC calls.

Comment thread mooncake-store/src/master_service.cpp Outdated
const UUID& client_id,
uint64_t grace_period_ms)
-> tl::expected<void, ErrorCode> {
std::shared_lock<std::shared_mutex> shared_lock(snapshot_mutex_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The GracefulUnmountSegment method modifies the state of segment_manager_ (via PrepareGracefulUnmountSegment), which typically requires an exclusive lock on snapshot_mutex_ to ensure consistency and prevent data races with other metadata operations or snapshots. Using std::shared_lock here is likely insufficient if snapshot_mutex_ is the primary synchronization primitive for the service's metadata.

Suggested change
std::shared_lock<std::shared_mutex> shared_lock(snapshot_mutex_);
std::unique_lock<std::shared_mutex> lock(snapshot_mutex_);

Comment on lines +5305 to +5308
if (!timer_running_.load()) {
timer_running_.store(true);
timer_thread_ = std::thread([this]() { this->TimerLoop(); });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential race condition between Schedule and Stop. Stop sets timer_running_ to false and releases the mutex before joining the thread. If Schedule is called during this window, it will see timer_running_ as false and attempt to start a new thread, assigning it to timer_thread_. Since the old thread is still joinable, this assignment will trigger std::terminate. A stopping_ flag should be used to prevent restarting the timer during or after shutdown.

Comment on lines +5313 to +5328
void MasterService::GracefulUnmountScheduler::RemoveClientRecords(
const UUID& client_id) {
std::lock_guard<std::mutex> lock(mutex_);
std::vector<Record> remaining;
remaining.reserve(queue_.size());
while (!queue_.empty()) {
auto rec = queue_.top();
queue_.pop();
if (rec.client_id != client_id) {
remaining.push_back(rec);
}
}
for (auto& rec : remaining) {
queue_.push(rec);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of RemoveClientRecords is inefficient ($O(N \log N)$) as it involves popping all elements from the priority queue and pushing them back. Since this is called in a loop within ClientMonitorFunc for every expired client, it could become a performance bottleneck if there are many segments in the graceful unmount queue. Consider checking client validity within the TimerLoop instead, or using a more suitable data structure like std::set or a sorted std::list if frequent removals are expected.

Comment thread mooncake-store/src/real_client.cpp Outdated
Comment on lines +1323 to +1324
auto result = client_->UnmountSegmentById(
id, grace_period_ms, std::move(cleanup_callback));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling UnmountSegmentById (which performs a synchronous RPC) while holding the mounted_segment_records_mutex_ can lead to performance issues, as it blocks all other mount/unmount operations for the duration of the network call. It is recommended to collect the necessary information under the lock and then perform the RPCs outside of the critical section.

Comment thread mooncake-store/src/real_client.cpp Outdated
Comment on lines +1490 to +1491
auto result = client_->UnmountSegmentById(
id, grace_period_ms, std::move(cleanup_callback));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling UnmountSegmentById (which performs a synchronous RPC) while holding the allocated_segment_records_mutex_ can lead to performance issues, as it blocks all other allocation/unmount operations for the duration of the network call. It is recommended to collect the necessary information under the lock and then perform the RPCs outside of the critical section.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 79.46768% with 108 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/client_service.cpp 49.00% 51 Missing ⚠️
mooncake-store/src/master_service.cpp 85.41% 14 Missing ⚠️
mooncake-store/src/real_client.cpp 87.20% 11 Missing ⚠️
mooncake-store/src/segment.cpp 75.00% 10 Missing ⚠️
mooncake-integration/store/store_py.cpp 0.00% 7 Missing ⚠️
mooncake-store/src/master_client.cpp 53.33% 7 Missing ⚠️
mooncake-store/src/rpc_service.cpp 78.26% 5 Missing ⚠️
mooncake-store/tests/master_service_test.cpp 97.74% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ykwd ykwd requested a review from 00fish0 May 11, 2026 04:18
Comment on lines +2268 to +2278
void Client::StartGracefulUnmountTimer(const UUID& segment_id,
uint64_t grace_period_ms) {
auto delay =
std::chrono::milliseconds(grace_period_ms) + std::chrono::seconds(10);
task_thread_pool_.enqueue([this, segment_id, delay]() {
if (this->WaitForGracefulUnmountDelay(delay)) {
return;
}
this->OnGracefulUnmountTimer(segment_id, /*retry_left=*/3);
});
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[use-after-free] The lambda captures this and is enqueued into task_thread_pool_, but ~Client() only sets graceful_unmount_timer_stopping_ = true and notifies — it does not wait for the pool task to finish. If the Client is destroyed while a timer task is still pending or sleeping in the pool, the task will invoke OnGracefulUnmountTimer on a dangling this pointer.
Consider using shared_from_this() / weak_ptr, or draining pending tasks in the destructor before proceeding with member teardown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing at this path. The exact UAF concern is mitigated by our ThreadPool semantics: stop() wakes workers and joins them after draining already queued tasks, and Client calls task_thread_pool_.stop() before member destruction. So queued timer lambdas should not outlive the Client object. However, the destructor ordering still has a real cleanup issue: it clears graceful_unmount_cleanup_callbacks_ without invoking them. I will adjust the destructor to stop/drain graceful timer work before clearing graceful state, then move and invoke pending cleanup callbacks so local mmap/allocated records are released deterministically.

Comment thread mooncake-store/src/client_service.cpp Outdated
Comment on lines +139 to +146
std::lock_guard<std::mutex> lock(mounted_segments_mutex_);
mounted_segments_.clear();
gracefully_unmounting_segments_.clear();
graceful_unmount_cleanup_callbacks_.clear();
}
{
std::lock_guard<std::mutex> lock(graceful_unmount_timer_mutex_);
graceful_unmount_timer_stopping_ = true;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[resource-leak] The destructor clears graceful_unmount_cleanup_callbacks_ without invoking the stored callbacks. Each callback holds a ReleaseMountedSegmentRecord call that does munmap. If the Client is destroyed while graceful unmounts are still in progress, the mmap regions and file descriptors are leaked. Invoke each callback before clearing the map.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1f0ccd. Client::~Client now signals graceful timer shutdown and drains the task thread pool before clearing graceful state. It also moves pending graceful cleanup callbacks out of the map and invokes them, so local mmap/allocated records are released instead of being dropped.

Comment on lines +5366 to +5378
for (auto& rec : expired) {
{
std::lock_guard<std::mutex> lock(mutex_);
auto generation_it = client_generations_.find(rec.client_id);
if (generation_it != client_generations_.end() &&
generation_it->second != rec.client_generation) {
continue;
}
}
if (service_) {
service_->UnmountSegment(rec.segment_id, rec.client_id);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[race-condition] TOCTOU window between generation check and UnmountSegment call

TOCTOU (Time-of-Check to Time-of-Use): A race condition where a state is checked at one time but used at a later time, allowing the state to change in between.

There is a TOCTOU window between the generation check (under lock) and the UnmountSegment call (after unlock). If RemoveClientRecords increments the generation in that window, the stale record will be executed instead of being skipped. If UnmountSegment is not idempotent in all edge cases, this could cause incorrect behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1f0ccd. I removed the generation-based lazy cancellation and restored direct removal of queued records under the scheduler mutex. This avoids the check/use window between generation validation and UnmountSegment execution.

Comment thread mooncake-store/src/master_service.cpp Outdated
}
}
if (service_) {
service_->UnmountSegment(rec.segment_id, rec.client_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits: Records are popped from the queue before UnmountSegment is called. If the RPC fails (network timeout, segment already cleaned up by another path), the record is silently dropped with no retry and no log message. Add at least a LOG(WARNING) on failure so operators can diagnose stuck segments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1f0ccd. The scheduler now logs a warning with segment_id, client_id, and error code when the final UnmountSegment call fails after a graceful timer expires.

@LujhCoconut
Copy link
Copy Markdown
Collaborator

Thanks for this PR. The more critical issues have been raised in my earlier comments. From my perspective, there are still areas where performance could be optimized (e.g., exponential backoff for retries). However, I think it would be appropriate to merge this PR first and follow up with a dedicated performance optimization PR later.

return ErrorCode::OK;
}

ErrorCode ScopedSegmentAccess::PrepareGracefulUnmountSegment(
Copy link
Copy Markdown
Collaborator

@00fish0 00fish0 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrepareGracefulUnmountSegment destroys the allocator at t=0, which invalidates the segment's object replicas immediately.

PrepareGracefulUnmountSegment run at t=0 when the graceful-unmount request arrives) does:

  • segment_manager_->allocator_manager_.removeAllocator(segment.name, allocator);
  • mounted_segment.buf_allocator.reset();

Those are the only two shared_ptrs to this BufferAllocator (one in allocator_manager_, one in mounted_segment.buf_allocator). Replicas only hold a weak_ptr to it (allocator.h:79; replica.h:292-295 — has_invalid_mem_handle() → !allocator_.expired()).

So as soon as this runs, the allocator's refcount hits zero and it's destroyed, and every replica on this segment immediately becomes has_invalid_mem_handle() == true — at t=0, not when the grace period expires.

Consequences:

  • ClientMonitorFunc calls ClearInvalidHandles() on every iteration, which EraseReplicas for these invalid handles; if an object's only replica was here, the whole metadata entry is erased too.
  • Even before that sweep runs, a GetReplicaList for that key calls CleanupStaleHandles itself, wipes the replica on the spot, and returns object-not-found.

In other words: a Get that reaches the master during the grace period will, for any object whose data lives only on this segment, fail almost immediately (OBJECT_NOT_FOUND) — it does not survive until the end of the grace period. What the grace period actually preserves is only:

(1) the segment metadata (QuerySegmentStatus still finds it)
(2) the client-side TransferEngine registration of that memory (kept until grace_period + ~40s, when unregisterLocalMemory + munmap run).

That doesn't match the PR description ("This lets existing readers continue using the segment during the grace window"): it only helps a reader that has already obtained a buffer descriptor from the master and is about to issue / is issuing a one-sided RDMA read (descriptor in hand, memory still registered → the read works). It does not protect a reader that issues a new Get during the grace period.

Suggestion: during the graceful phase, don't destroy the allocator — just remove it from the "allocatable" set (keep the mounted_segment.buf_allocator shared_ptr, only call allocator_manager_.removeAllocator(...)), so replicas on the segment stay isAllocatorValid() for the duration. Then buf_allocator.reset() only at CommitUnmountSegment when the grace period expires. That way GetReplicaList keeps returning this segment's descriptors during the grace window, which is what "object stays readable during the grace period" should mean.

There's also a test gap: GracefulUnmountSegment_PreventAllocation only does a fresh PutStart/PutEnd after the graceful unmount — it never does a GetReplicaList for the key that already lived on segment1, so "is an existing object still readable during the grace period" isn't covered at all. Worth adding a case: mount → put object to seg → graceful unmount(seg) → GetReplicaList(key) within the grace window should still succeed and point at that seg.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f1f0ccd. PrepareGracefulUnmountSegment now only removes the allocator from the allocatable set and keeps mounted_segment.buf_allocator alive during the grace window. The allocator is released during the final unmount path. I also extended GracefulUnmountSegment_PreventAllocation to verify that an existing object on the graceful segment remains readable via GetReplicaList during the grace period, while new allocations avoid that segment.

Comment on lines +3992 to +3997
// Notify graceful unmount scheduler to drop pending records
// for expired clients. The actual unmount is handled below.
for (auto& cid : expired_clients) {
graceful_unmount_scheduler_.RemoveClientRecords(cid);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the current branch and this specific state leak does not reproduce with the current PrepareUnmountSegment() implementation. It only rejects UNMOUNTING and does not reject GRACEFULLY_UNMOUNTING, so the client-expiry path can still prepare and commit cleanup for a segment already in graceful unmount. The snippet in the comment looks like it reflects an older version of the status check. I agree this state should remain covered by tests, and I can add/extend a test if needed, but the current code path should not permanently leak the segment for this reason.

Copy link
Copy Markdown
Collaborator

@LujhCoconut LujhCoconut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. This proposal makes sense to me. The correctness of the current implementation is also LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants