Skip to content

[Store] Introduce buffer pool for zero copy interfaces#2095

Open
zxpdemonio wants to merge 10 commits into
kvcache-ai:mainfrom
openanolis:cruz/buffer-pool
Open

[Store] Introduce buffer pool for zero copy interfaces#2095
zxpdemonio wants to merge 10 commits into
kvcache-ai:mainfrom
openanolis:cruz/buffer-pool

Conversation

@zxpdemonio
Copy link
Copy Markdown
Collaborator

@zxpdemonio zxpdemonio commented May 14, 2026

Motivation

Mooncake's zero-copy read APIs require destination buffers to be registered before get_into() / get_into_ranges() / batch_get_into() can read into them. Higher-level users repeatedly need temporary registered receive buffers for rollout/data transfer paths.

Registering and unregistering temporary buffers for every read adds unnecessary overhead and makes it harder for Python callers to safely manage registered-memory lifetimes. This PR adds a native reusable registered buffer pool to mooncake.store so callers can reuse bounded registered scratch memory with a small production API.

Scenario

A typical user reads remote objects into pre-registered memory:

with pool.buffer(1024 * 1024) as lease:                                                                                                                                            
    n = store.get_into("my_key", lease.ptr, lease.size)                                                                                                                            
    view = lease.buffer[:n]                                                                                                                                                        
    # Consume view directly, or wrap it with np.frombuffer(view, dtype=...).                                                                                                       
    # Copy only if the data must outlive the lease: data = bytes(view) 

The pool owns registered regions, reuses released buffers by size class, and unregisters them when the pool is closed. This is intended for repeated zero-copy reads in rollout transfer workloads where registered memory should be reused rather than recreated for every object.

Description

This PR adds a native RegisteredBufferPool binding in mooncake.store.

Public API:

  • RegisteredBufferPool.acquire(size, block=None, timeout=None)
  • RegisteredBufferPool.buffer(size, block=None, timeout=None)
  • RegisteredBufferPool.prewarm(size, count)
  • RegisteredBufferPool.close()
  • RegisteredBufferLease.ptr
  • RegisteredBufferLease.size
  • RegisteredBufferLease.buffer
  • RegisteredBufferLease.release()
  • RegisteredBufferLease context manager support

The implementation is in C++/pybind inside mooncake-integration/store/store_py.cpp and reuses the existing Python store wrapper registration primitives:

  • register_buffer
  • unregister_buffer
  • BufferHandle

The pool supports:

  • bounded total registered memory via max_bytes
  • configurable min_size_class, max_size_class, and alignment
  • optional max_regions
  • optional prewarm
  • blocking acquire
  • nonblocking acquire through acquire(size, block=False)
  • timeout acquire through acquire(size, timeout=...)
  • arbitrary requested sizes, including oversize buffers that are allocated but not reused

Nonessential helper APIs were intentionally not exposed to keep the PR surface small. The public API is limited to the lease lifecycle needed by zero-copy reads.

Basic Usage

from mooncake.store import RegisteredBufferPool

pool = RegisteredBufferPool(
store,
max_bytes=256 * 1024 * 1024,
min_size_class=64 * 1024,
max_size_class=16 * 1024 * 1024,
alignment=8 * 1024 * 1024,
)

with pool.buffer(1024 * 1024) as lease:
n = store.get_into("my_key", lease.ptr, lease.size)
data = bytes(lease.buffer[:n])

pool.close()

Robustness and Safety

The native implementation handles registered-memory lifetime explicitly:

  • leases keep the pool alive while checked out
  • released buffers are returned to the pool or unregistered when oversized/closing
  • close() rejects active leases
  • lease destructors return buffers if users forget to release
  • exported Python memoryviews keep the lease alive
  • release() rejects while exported views still exist
  • blocking acquire releases the Python GIL while waiting
  • register/unregister calls are performed without holding the Python GIL
  • allocation/register failure paths roll back accounting
  • size alignment checks guard overflow
  • condition-variable wait paths retry acquisition to avoid false timeout races

Performance

The pool avoids repeated register/unregister calls on hot read paths by reusing registered buffers. Released reusable regions are stored by size class, while oversized requests are supported without being retained in the reusable pool.

The implementation keeps the hot path small:

  • reuse path takes a free region and returns a lease
  • allocation path reserves capacity before registering
  • transfer grouping or higher-level transfer policy is intentionally not part of this pool
  • Python-visible API is minimal to avoid extra review and maintenance surface

Module

mooncake.store

Type of Change

  • New feature
  • Python API extension backed by native C++ implementation
  • Documentation update
  • Test coverage update

Checklist

  • Native implementation added in mooncake.store
  • Reuses existing Mooncake buffer registration primitives
  • Public API minimized to the required lease lifecycle
  • Documentation added
  • Focused test coverage added
  • Test file can run via pytest discovery and direct script execution
  • Build and tests pass

zxpdemonio and others added 3 commits May 13, 2026 23:50
Add a native Python registered buffer pool for reusable zero-copy scratch buffers, with lease lifetime checks and docs/tests for the public API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove nonessential helper methods from the native Python surface so the buffer pool PR exposes only the production lease lifecycle API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a pytest entrypoint so the focused native buffer pool test can be run directly as a script as well as through pytest discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a RegisteredBufferPool to manage a bounded set of registered scratch buffers for zero-copy operations, reducing the overhead of repeated memory registration. The implementation includes C++ classes for the pool and its leases, exposed to Python via pybind11, along with comprehensive documentation and unit tests. Feedback suggests releasing the Python GIL during the prewarm method to improve concurrency and ensuring that buffer allocations strictly respect the specified alignment parameter by using appropriate allocation functions like posix_memalign instead of default character arrays.

Comment thread mooncake-integration/store/store_py.cpp Outdated
Comment thread mooncake-integration/store/store_py.cpp Outdated
@ykwd
Copy link
Copy Markdown
Collaborator

ykwd commented May 14, 2026

With this new API, compared to the existing non-zero-copy interface, it seems that the number of memory copies is still the same, since the user still needs to copy the data out from the buffer.

zxpdemonio and others added 2 commits May 14, 2026 15:10
Apply project clang-format to the native registered buffer pool binding code so CI format checks pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the minimal-copy RegisteredBufferPool usage by consuming the returned memoryview directly and copying only when data must outlive the lease.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zxpdemonio
Copy link
Copy Markdown
Collaborator Author

With this new API, compared to the existing non-zero-copy interface, it seems that the number of memory copies is still the same, since the user still needs to copy the data out from the buffer.

@ykwd Good point. If the caller immediately converts the memoryview to bytes, then this usage does introduce the same final user-space copy as the non-zero-copy get path.

The intended benefit of RegisteredBufferPool is different: it provides reusable registered destination buffers for get_into/get_into_ranges so repeated zero-copy reads do not need repeated allocate/register/unregister cycles. Callers that can consume the returned memoryview directly, or wrap it with np.frombuffer / another buffer-protocol consumer, avoid the extra copy. A copy is only needed when the data must outlive the lease.

I updated the documentation example to show direct memoryview consumption first and mention bytes(view) only as the outliving-lease case.

Release the GIL while prewarming registered buffers and allocate pool regions with the configured alignment so RDMA scratch buffers respect caller alignment requirements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 14, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 354 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-integration/store/buffer_pool.cpp 0.00% 354 Missing ⚠️

📢 Thoughts on this report? Let us know!

@zxpdemonio zxpdemonio changed the title Cruz/buffer pool [Store] Introduce buffer pool for zero copy interfaces May 14, 2026
zxpdemonio and others added 4 commits May 14, 2026 21:31
Expose RegisteredBufferPool via mooncake.buffer_pool.BufferPool, update the native buffer pool test to use the new import path, and move the Python API docs to prefer the new helper entrypoint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the registered buffer pool binding implementation into a dedicated
buffer_pool.cpp translation unit so the store module keeps only the
module wiring in store_py.cpp.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove buffer-pool-only headers left behind in store_py.cpp now that the
implementation lives in buffer_pool.cpp and the file only keeps module wiring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ykwd
Copy link
Copy Markdown
Collaborator

ykwd commented May 15, 2026

Thanks for the explanation. I think I understand it now. This feels like a valuable piece of work, and we’ll start the review process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants