[Store] Introduce buffer pool for zero copy interfaces by zxpdemonio · Pull Request #2095 · kvcache-ai/Mooncake

zxpdemonio · 2026-05-14T06:03:08Z

Motivation

Mooncake's zero-copy read APIs require destination buffers to be registered before get_into() / get_into_ranges() / batch_get_into() can read into them. Higher-level users repeatedly need temporary registered receive buffers for rollout/data transfer paths.

Registering and unregistering temporary buffers for every read adds unnecessary overhead and makes it harder for Python callers to safely manage registered-memory lifetimes. This PR adds a native reusable registered buffer pool to mooncake.store so callers can reuse bounded registered scratch memory with a small production API.

Scenario

A typical user reads remote objects into pre-registered memory:

with pool.buffer(1024 * 1024) as lease:                                                                                                                                            
    n = store.get_into("my_key", lease.ptr, lease.size)                                                                                                                            
    view = lease.buffer[:n]                                                                                                                                                        
    # Consume view directly, or wrap it with np.frombuffer(view, dtype=...).                                                                                                       
    # Copy only if the data must outlive the lease: data = bytes(view)

The pool owns registered regions, reuses released buffers by size class, and unregisters them when the pool is closed. This is intended for repeated zero-copy reads in rollout transfer workloads where registered memory should be reused rather than recreated for every object.

Description

This PR adds a native RegisteredBufferPool binding in mooncake.store.

Public API:

RegisteredBufferPool.acquire(size, block=None, timeout=None)
RegisteredBufferPool.buffer(size, block=None, timeout=None)
RegisteredBufferPool.prewarm(size, count)
RegisteredBufferPool.close()
RegisteredBufferLease.ptr
RegisteredBufferLease.size
RegisteredBufferLease.buffer
RegisteredBufferLease.release()
RegisteredBufferLease context manager support

The implementation is in C++/pybind inside mooncake-integration/store/store_py.cpp and reuses the existing Python store wrapper registration primitives:

register_buffer
unregister_buffer
BufferHandle

The pool supports:

bounded total registered memory via max_bytes
configurable min_size_class, max_size_class, and alignment
optional max_regions
optional prewarm
blocking acquire
nonblocking acquire through acquire(size, block=False)
timeout acquire through acquire(size, timeout=...)
arbitrary requested sizes, including oversize buffers that are allocated but not reused

Nonessential helper APIs were intentionally not exposed to keep the PR surface small. The public API is limited to the lease lifecycle needed by zero-copy reads.

Basic Usage

from mooncake.store import RegisteredBufferPool

pool = RegisteredBufferPool(
store,
max_bytes=256 * 1024 * 1024,
min_size_class=64 * 1024,
max_size_class=16 * 1024 * 1024,
alignment=8 * 1024 * 1024,
)

with pool.buffer(1024 * 1024) as lease:
n = store.get_into("my_key", lease.ptr, lease.size)
data = bytes(lease.buffer[:n])

pool.close()

Robustness and Safety

The native implementation handles registered-memory lifetime explicitly:

leases keep the pool alive while checked out
released buffers are returned to the pool or unregistered when oversized/closing
close() rejects active leases
lease destructors return buffers if users forget to release
exported Python memoryviews keep the lease alive
release() rejects while exported views still exist
blocking acquire releases the Python GIL while waiting
register/unregister calls are performed without holding the Python GIL
allocation/register failure paths roll back accounting
size alignment checks guard overflow
condition-variable wait paths retry acquisition to avoid false timeout races

Performance

The pool avoids repeated register/unregister calls on hot read paths by reusing registered buffers. Released reusable regions are stored by size class, while oversized requests are supported without being retained in the reusable pool.

The implementation keeps the hot path small:

reuse path takes a free region and returns a lease
allocation path reserves capacity before registering
transfer grouping or higher-level transfer policy is intentionally not part of this pool
Python-visible API is minimal to avoid extra review and maintenance surface

Module

mooncake.store

Type of Change

New feature
Python API extension backed by native C++ implementation
Documentation update
Test coverage update

Checklist

Native implementation added in mooncake.store
Reuses existing Mooncake buffer registration primitives
Public API minimized to the required lease lifecycle
Documentation added
Focused test coverage added
Test file can run via pytest discovery and direct script execution
Build and tests pass

Add a native Python registered buffer pool for reusable zero-copy scratch buffers, with lease lifetime checks and docs/tests for the public API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove nonessential helper methods from the native Python surface so the buffer pool PR exposes only the production lease lifecycle API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a pytest entrypoint so the focused native buffer pool test can be run directly as a script as well as through pytest discovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a RegisteredBufferPool to manage a bounded set of registered scratch buffers for zero-copy operations, reducing the overhead of repeated memory registration. The implementation includes C++ classes for the pool and its leases, exposed to Python via pybind11, along with comprehensive documentation and unit tests. Feedback suggests releasing the Python GIL during the prewarm method to improve concurrency and ensuring that buffer allocations strictly respect the specified alignment parameter by using appropriate allocation functions like posix_memalign instead of default character arrays.

ykwd · 2026-05-14T06:41:21Z

With this new API, compared to the existing non-zero-copy interface, it seems that the number of memory copies is still the same, since the user still needs to copy the data out from the buffer.

Apply project clang-format to the native registered buffer pool binding code so CI format checks pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Document the minimal-copy RegisteredBufferPool usage by consuming the returned memoryview directly and copying only when data must outlive the lease. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio · 2026-05-14T07:34:42Z

With this new API, compared to the existing non-zero-copy interface, it seems that the number of memory copies is still the same, since the user still needs to copy the data out from the buffer.

@ykwd Good point. If the caller immediately converts the memoryview to bytes, then this usage does introduce the same final user-space copy as the non-zero-copy get path.

The intended benefit of RegisteredBufferPool is different: it provides reusable registered destination buffers for get_into/get_into_ranges so repeated zero-copy reads do not need repeated allocate/register/unregister cycles. Callers that can consume the returned memoryview directly, or wrap it with np.frombuffer / another buffer-protocol consumer, avoid the extra copy. A copy is only needed when the data must outlive the lease.

I updated the documentation example to show direct memoryview consumption first and mention bytes(view) only as the outliving-lease case.

Release the GIL while prewarming registered buffers and allocate pool regions with the configured alignment so RDMA scratch buffers respect caller alignment requirements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-05-14T08:15:25Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 354 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-integration/store/buffer_pool.cpp	0.00%	354 Missing ⚠️

📢 Thoughts on this report? Let us know!

Expose RegisteredBufferPool via mooncake.buffer_pool.BufferPool, update the native buffer pool test to use the new import path, and move the Python API docs to prefer the new helper entrypoint. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the registered buffer pool binding implementation into a dedicated buffer_pool.cpp translation unit so the store module keeps only the module wiring in store_py.cpp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove buffer-pool-only headers left behind in store_py.cpp now that the implementation lives in buffer_pool.cpp and the file only keeps module wiring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ykwd · 2026-05-15T08:56:28Z

Thanks for the explanation. I think I understand it now. This feels like a valuable piece of work, and we’ll start the review process.

zxpdemonio and others added 3 commits May 13, 2026 23:50

[Store] Add native registered buffer pool

539ba5e

Add a native Python registered buffer pool for reusable zero-copy scratch buffers, with lease lifetime checks and docs/tests for the public API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Minimize registered buffer pool API

9e0061e

Remove nonessential helper methods from the native Python surface so the buffer pool PR exposes only the production lease lifecycle API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Allow native buffer pool test direct execution

6b45d80

Add a pytest entrypoint so the focused native buffer pool test can be run directly as a script as well as through pytest discovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio requested review from ShangmingCai, stmatengss and ykwd as code owners May 14, 2026 06:03

github-actions Bot added run-ci Installation labels May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread mooncake-integration/store/store_py.cpp Outdated

Comment thread mooncake-integration/store/store_py.cpp Outdated

zxpdemonio and others added 2 commits May 14, 2026 15:10

[Store] Format registered buffer pool bindings

e9d6da6

Apply project clang-format to the native registered buffer pool binding code so CI format checks pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Docs] Clarify zero-copy buffer pool usage

178e4b6

Document the minimal-copy RegisteredBufferPool usage by consuming the returned memoryview directly and copying only when data must outlive the lease. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Align registered buffer pool allocations

d2e0f83

Release the GIL while prewarming registered buffers and allocate pool regions with the configured alignment so RDMA scratch buffers respect caller alignment requirements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio changed the title ~~Cruz/buffer pool~~ [Store] Introduce buffer pool for zero copy interfaces May 14, 2026

zxpdemonio and others added 4 commits May 14, 2026 21:31

[Store] Split buffer pool bindings out of store_py.cpp

cd65f82

Move the registered buffer pool binding implementation into a dedicated buffer_pool.cpp translation unit so the store module keeps only the module wiring in store_py.cpp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Drop stale buffer pool includes from store_py.cpp

f2b7032

Remove buffer-pool-only headers left behind in store_py.cpp now that the implementation lives in buffer_pool.cpp and the file only keeps module wiring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Fix buffer pool formatting

090e532

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store] Introduce buffer pool for zero copy interfaces#2095

[Store] Introduce buffer pool for zero copy interfaces#2095
zxpdemonio wants to merge 10 commits into
kvcache-ai:mainfrom
openanolis:cruz/buffer-pool

zxpdemonio commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

ykwd commented May 14, 2026

Uh oh!

zxpdemonio commented May 14, 2026

Uh oh!

codecov-commenter commented May 14, 2026 •

edited

Loading

Uh oh!

ykwd commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zxpdemonio commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Scenario

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ykwd commented May 14, 2026

Uh oh!

zxpdemonio commented May 14, 2026

Uh oh!

codecov-commenter commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ykwd commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zxpdemonio commented May 14, 2026 •

edited

Loading

codecov-commenter commented May 14, 2026 •

edited

Loading