[Store] Introduce buffer pool for zero copy interfaces#2095
[Store] Introduce buffer pool for zero copy interfaces#2095zxpdemonio wants to merge 10 commits into
Conversation
Add a native Python registered buffer pool for reusable zero-copy scratch buffers, with lease lifetime checks and docs/tests for the public API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove nonessential helper methods from the native Python surface so the buffer pool PR exposes only the production lease lifecycle API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a pytest entrypoint so the focused native buffer pool test can be run directly as a script as well as through pytest discovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a RegisteredBufferPool to manage a bounded set of registered scratch buffers for zero-copy operations, reducing the overhead of repeated memory registration. The implementation includes C++ classes for the pool and its leases, exposed to Python via pybind11, along with comprehensive documentation and unit tests. Feedback suggests releasing the Python GIL during the prewarm method to improve concurrency and ensuring that buffer allocations strictly respect the specified alignment parameter by using appropriate allocation functions like posix_memalign instead of default character arrays.
|
With this new API, compared to the existing non-zero-copy interface, it seems that the number of memory copies is still the same, since the user still needs to copy the data out from the buffer. |
Apply project clang-format to the native registered buffer pool binding code so CI format checks pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the minimal-copy RegisteredBufferPool usage by consuming the returned memoryview directly and copying only when data must outlive the lease. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ykwd Good point. If the caller immediately converts the memoryview to bytes, then this usage does introduce the same final user-space copy as the non-zero-copy get path. The intended benefit of RegisteredBufferPool is different: it provides reusable registered destination buffers for get_into/get_into_ranges so repeated zero-copy reads do not need repeated allocate/register/unregister cycles. Callers that can consume the returned memoryview directly, or wrap it with np.frombuffer / another buffer-protocol consumer, avoid the extra copy. A copy is only needed when the data must outlive the lease. I updated the documentation example to show direct memoryview consumption first and mention bytes(view) only as the outliving-lease case. |
Release the GIL while prewarming registered buffers and allocate pool regions with the configured alignment so RDMA scratch buffers respect caller alignment requirements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Expose RegisteredBufferPool via mooncake.buffer_pool.BufferPool, update the native buffer pool test to use the new import path, and move the Python API docs to prefer the new helper entrypoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the registered buffer pool binding implementation into a dedicated buffer_pool.cpp translation unit so the store module keeps only the module wiring in store_py.cpp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove buffer-pool-only headers left behind in store_py.cpp now that the implementation lives in buffer_pool.cpp and the file only keeps module wiring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the explanation. I think I understand it now. This feels like a valuable piece of work, and we’ll start the review process. |
Motivation
Mooncake's zero-copy read APIs require destination buffers to be registered before
get_into()/get_into_ranges()/batch_get_into()can read into them. Higher-level users repeatedly need temporary registered receive buffers for rollout/data transfer paths.Registering and unregistering temporary buffers for every read adds unnecessary overhead and makes it harder for Python callers to safely manage registered-memory lifetimes. This PR adds a native reusable registered buffer pool to
mooncake.storeso callers can reuse bounded registered scratch memory with a small production API.Scenario
A typical user reads remote objects into pre-registered memory:
The pool owns registered regions, reuses released buffers by size class, and unregisters them when the pool is closed. This is intended for repeated zero-copy reads in rollout transfer workloads where registered memory should be reused rather than recreated for every object.
Description
This PR adds a native RegisteredBufferPool binding in mooncake.store.
Public API:
The implementation is in C++/pybind inside mooncake-integration/store/store_py.cpp and reuses the existing Python store wrapper registration primitives:
The pool supports:
Nonessential helper APIs were intentionally not exposed to keep the PR surface small. The public API is limited to the lease lifecycle needed by zero-copy reads.
Basic Usage
from mooncake.store import RegisteredBufferPool
pool = RegisteredBufferPool(
store,
max_bytes=256 * 1024 * 1024,
min_size_class=64 * 1024,
max_size_class=16 * 1024 * 1024,
alignment=8 * 1024 * 1024,
)
with pool.buffer(1024 * 1024) as lease:
n = store.get_into("my_key", lease.ptr, lease.size)
data = bytes(lease.buffer[:n])
pool.close()
Robustness and Safety
The native implementation handles registered-memory lifetime explicitly:
Performance
The pool avoids repeated register/unregister calls on hot read paths by reusing registered buffers. Released reusable regions are stored by size class, while oversized requests are supported without being retained in the reusable pool.
The implementation keeps the hot path small:
Module
mooncake.store
Type of Change
Checklist