[Store] feat: add graceful segment unmount APIs#2065
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a graceful unmount feature for memory segments, allowing them to remain readable for a specified duration while preventing new allocations. The implementation includes a new GracefulUnmountScheduler in the master service, a GRACEFULLY_UNMOUNTING segment status, and updates to the HTTP, Python, and C++ APIs. Feedback identifies several critical issues, including a potential data race in the master service due to insufficient locking, a race condition in the scheduler that could cause a crash during shutdown, and performance concerns regarding inefficient record removal and the holding of mutexes during synchronous RPC calls.
| const UUID& client_id, | ||
| uint64_t grace_period_ms) | ||
| -> tl::expected<void, ErrorCode> { | ||
| std::shared_lock<std::shared_mutex> shared_lock(snapshot_mutex_); |
There was a problem hiding this comment.
The GracefulUnmountSegment method modifies the state of segment_manager_ (via PrepareGracefulUnmountSegment), which typically requires an exclusive lock on snapshot_mutex_ to ensure consistency and prevent data races with other metadata operations or snapshots. Using std::shared_lock here is likely insufficient if snapshot_mutex_ is the primary synchronization primitive for the service's metadata.
| std::shared_lock<std::shared_mutex> shared_lock(snapshot_mutex_); | |
| std::unique_lock<std::shared_mutex> lock(snapshot_mutex_); |
| if (!timer_running_.load()) { | ||
| timer_running_.store(true); | ||
| timer_thread_ = std::thread([this]() { this->TimerLoop(); }); | ||
| } |
There was a problem hiding this comment.
There is a potential race condition between Schedule and Stop. Stop sets timer_running_ to false and releases the mutex before joining the thread. If Schedule is called during this window, it will see timer_running_ as false and attempt to start a new thread, assigning it to timer_thread_. Since the old thread is still joinable, this assignment will trigger std::terminate. A stopping_ flag should be used to prevent restarting the timer during or after shutdown.
| void MasterService::GracefulUnmountScheduler::RemoveClientRecords( | ||
| const UUID& client_id) { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| std::vector<Record> remaining; | ||
| remaining.reserve(queue_.size()); | ||
| while (!queue_.empty()) { | ||
| auto rec = queue_.top(); | ||
| queue_.pop(); | ||
| if (rec.client_id != client_id) { | ||
| remaining.push_back(rec); | ||
| } | ||
| } | ||
| for (auto& rec : remaining) { | ||
| queue_.push(rec); | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation of RemoveClientRecords is inefficient ($O(N \log N)$) as it involves popping all elements from the priority queue and pushing them back. Since this is called in a loop within ClientMonitorFunc for every expired client, it could become a performance bottleneck if there are many segments in the graceful unmount queue. Consider checking client validity within the TimerLoop instead, or using a more suitable data structure like std::set or a sorted std::list if frequent removals are expected.
| auto result = client_->UnmountSegmentById( | ||
| id, grace_period_ms, std::move(cleanup_callback)); |
There was a problem hiding this comment.
Calling UnmountSegmentById (which performs a synchronous RPC) while holding the mounted_segment_records_mutex_ can lead to performance issues, as it blocks all other mount/unmount operations for the duration of the network call. It is recommended to collect the necessary information under the lock and then perform the RPCs outside of the critical section.
| auto result = client_->UnmountSegmentById( | ||
| id, grace_period_ms, std::move(cleanup_callback)); |
There was a problem hiding this comment.
Calling UnmountSegmentById (which performs a synchronous RPC) while holding the allocated_segment_records_mutex_ can lead to performance issues, as it blocks all other allocation/unmount operations for the duration of the network call. It is recommended to collect the necessary information under the lock and then perform the RPCs outside of the critical section.
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| void Client::StartGracefulUnmountTimer(const UUID& segment_id, | ||
| uint64_t grace_period_ms) { | ||
| auto delay = | ||
| std::chrono::milliseconds(grace_period_ms) + std::chrono::seconds(10); | ||
| task_thread_pool_.enqueue([this, segment_id, delay]() { | ||
| if (this->WaitForGracefulUnmountDelay(delay)) { | ||
| return; | ||
| } | ||
| this->OnGracefulUnmountTimer(segment_id, /*retry_left=*/3); | ||
| }); | ||
| } |
There was a problem hiding this comment.
[use-after-free] The lambda captures this and is enqueued into task_thread_pool_, but ~Client() only sets graceful_unmount_timer_stopping_ = true and notifies — it does not wait for the pool task to finish. If the Client is destroyed while a timer task is still pending or sleeping in the pool, the task will invoke OnGracefulUnmountTimer on a dangling this pointer.
Consider using shared_from_this() / weak_ptr, or draining pending tasks in the destructor before proceeding with member teardown.
There was a problem hiding this comment.
Thanks for pointing at this path. The exact UAF concern is mitigated by our ThreadPool semantics: stop() wakes workers and joins them after draining already queued tasks, and Client calls task_thread_pool_.stop() before member destruction. So queued timer lambdas should not outlive the Client object. However, the destructor ordering still has a real cleanup issue: it clears graceful_unmount_cleanup_callbacks_ without invoking them. I will adjust the destructor to stop/drain graceful timer work before clearing graceful state, then move and invoke pending cleanup callbacks so local mmap/allocated records are released deterministically.
| std::lock_guard<std::mutex> lock(mounted_segments_mutex_); | ||
| mounted_segments_.clear(); | ||
| gracefully_unmounting_segments_.clear(); | ||
| graceful_unmount_cleanup_callbacks_.clear(); | ||
| } | ||
| { | ||
| std::lock_guard<std::mutex> lock(graceful_unmount_timer_mutex_); | ||
| graceful_unmount_timer_stopping_ = true; |
There was a problem hiding this comment.
[resource-leak] The destructor clears graceful_unmount_cleanup_callbacks_ without invoking the stored callbacks. Each callback holds a ReleaseMountedSegmentRecord call that does munmap. If the Client is destroyed while graceful unmounts are still in progress, the mmap regions and file descriptors are leaked. Invoke each callback before clearing the map.
There was a problem hiding this comment.
Fixed in f1f0ccd. Client::~Client now signals graceful timer shutdown and drains the task thread pool before clearing graceful state. It also moves pending graceful cleanup callbacks out of the map and invokes them, so local mmap/allocated records are released instead of being dropped.
| for (auto& rec : expired) { | ||
| { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| auto generation_it = client_generations_.find(rec.client_id); | ||
| if (generation_it != client_generations_.end() && | ||
| generation_it->second != rec.client_generation) { | ||
| continue; | ||
| } | ||
| } | ||
| if (service_) { | ||
| service_->UnmountSegment(rec.segment_id, rec.client_id); | ||
| } | ||
| } |
There was a problem hiding this comment.
[race-condition] TOCTOU window between generation check and UnmountSegment call
TOCTOU (Time-of-Check to Time-of-Use): A race condition where a state is checked at one time but used at a later time, allowing the state to change in between.
There is a TOCTOU window between the generation check (under lock) and the UnmountSegment call (after unlock). If RemoveClientRecords increments the generation in that window, the stale record will be executed instead of being skipped. If UnmountSegment is not idempotent in all edge cases, this could cause incorrect behavior.
There was a problem hiding this comment.
Fixed in f1f0ccd. I removed the generation-based lazy cancellation and restored direct removal of queued records under the scheduler mutex. This avoids the check/use window between generation validation and UnmountSegment execution.
| } | ||
| } | ||
| if (service_) { | ||
| service_->UnmountSegment(rec.segment_id, rec.client_id); |
There was a problem hiding this comment.
Nits: Records are popped from the queue before UnmountSegment is called. If the RPC fails (network timeout, segment already cleaned up by another path), the record is silently dropped with no retry and no log message. Add at least a LOG(WARNING) on failure so operators can diagnose stuck segments.
There was a problem hiding this comment.
Fixed in f1f0ccd. The scheduler now logs a warning with segment_id, client_id, and error code when the final UnmountSegment call fails after a graceful timer expires.
|
Thanks for this PR. The more critical issues have been raised in my earlier comments. From my perspective, there are still areas where performance could be optimized (e.g., exponential backoff for retries). However, I think it would be appropriate to merge this PR first and follow up with a dedicated performance optimization PR later. |
| return ErrorCode::OK; | ||
| } | ||
|
|
||
| ErrorCode ScopedSegmentAccess::PrepareGracefulUnmountSegment( |
There was a problem hiding this comment.
PrepareGracefulUnmountSegment destroys the allocator at t=0, which invalidates the segment's object replicas immediately.
PrepareGracefulUnmountSegment run at t=0 when the graceful-unmount request arrives) does:
- segment_manager_->allocator_manager_.removeAllocator(segment.name, allocator);
- mounted_segment.buf_allocator.reset();
Those are the only two shared_ptrs to this BufferAllocator (one in allocator_manager_, one in mounted_segment.buf_allocator). Replicas only hold a weak_ptr to it (allocator.h:79; replica.h:292-295 — has_invalid_mem_handle() → !allocator_.expired()).
So as soon as this runs, the allocator's refcount hits zero and it's destroyed, and every replica on this segment immediately becomes has_invalid_mem_handle() == true — at t=0, not when the grace period expires.
Consequences:
- ClientMonitorFunc calls ClearInvalidHandles() on every iteration, which EraseReplicas for these invalid handles; if an object's only replica was here, the whole metadata entry is erased too.
- Even before that sweep runs, a GetReplicaList for that key calls CleanupStaleHandles itself, wipes the replica on the spot, and returns object-not-found.
In other words: a Get that reaches the master during the grace period will, for any object whose data lives only on this segment, fail almost immediately (OBJECT_NOT_FOUND) — it does not survive until the end of the grace period. What the grace period actually preserves is only:
(1) the segment metadata (QuerySegmentStatus still finds it)
(2) the client-side TransferEngine registration of that memory (kept until grace_period + ~40s, when unregisterLocalMemory + munmap run).
That doesn't match the PR description ("This lets existing readers continue using the segment during the grace window"): it only helps a reader that has already obtained a buffer descriptor from the master and is about to issue / is issuing a one-sided RDMA read (descriptor in hand, memory still registered → the read works). It does not protect a reader that issues a new Get during the grace period.
Suggestion: during the graceful phase, don't destroy the allocator — just remove it from the "allocatable" set (keep the mounted_segment.buf_allocator shared_ptr, only call allocator_manager_.removeAllocator(...)), so replicas on the segment stay isAllocatorValid() for the duration. Then buf_allocator.reset() only at CommitUnmountSegment when the grace period expires. That way GetReplicaList keeps returning this segment's descriptors during the grace window, which is what "object stays readable during the grace period" should mean.
There's also a test gap: GracefulUnmountSegment_PreventAllocation only does a fresh PutStart/PutEnd after the graceful unmount — it never does a GetReplicaList for the key that already lived on segment1, so "is an existing object still readable during the grace period" isn't covered at all. Worth adding a case: mount → put object to seg → graceful unmount(seg) → GetReplicaList(key) within the grace window should still succeed and point at that seg.
There was a problem hiding this comment.
Fixed in f1f0ccd. PrepareGracefulUnmountSegment now only removes the allocator from the allocatable set and keeps mounted_segment.buf_allocator alive during the grace window. The allocator is released during the final unmount path. I also extended GracefulUnmountSegment_PreventAllocation to verify that an existing object on the graceful segment remains readable via GetReplicaList during the grace period, while new allocations avoid that segment.
| // Notify graceful unmount scheduler to drop pending records | ||
| // for expired clients. The actual unmount is handled below. | ||
| for (auto& cid : expired_clients) { | ||
| graceful_unmount_scheduler_.RemoveClientRecords(cid); | ||
| } | ||
|
|
There was a problem hiding this comment.
I checked the current branch and this specific state leak does not reproduce with the current PrepareUnmountSegment() implementation. It only rejects UNMOUNTING and does not reject GRACEFULLY_UNMOUNTING, so the client-expiry path can still prepare and commit cleanup for a segment already in graceful unmount. The snippet in the comment looks like it reflects an older version of the status check. I agree this state should remain covered by tests, and I can add/extend a test if needed, but the current code path should not permanently leak the segment for this reason.
LujhCoconut
left a comment
There was a problem hiding this comment.
Thanks for this PR. This proposal makes sense to me. The correctness of the current implementation is also LGTM.
Description
This PR adds graceful segment unmount support to Mooncake Store.
The change allows callers to request a grace period when unmounting mounted segments so the master first marks the segment as
GRACEFULLY_UNMOUNTING, removes it from allocation, and only completes the unmount after the grace period expires. This lets existing readers continue using the segment during the grace window while preventing new allocations from targeting it.The feature is exposed through both C++/pybind APIs and the Python HTTP service:
unmount_segment(segment_ids, grace_period_seconds=0)for mounted shm/file-backed segments.unmount_and_free_segment(segment_ids, grace_period_seconds=0)for allocate-backed segments./api/unmount_shmwithgrace_period_seconds./api/unmountwithgrace_period_seconds.The implementation also hardens cleanup behavior by polling graceful unmount completion by
segment_idrather than reusable segment names, so local memory registrations and backing resources are released correctly even if a client remounts another segment with the same name during the cleanup window.Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
The following tests/checks were run locally and passed:
Checklist
./scripts/code_format.shbefore submitting.