Skip to content

Commit 3432b28

Browse files
cuda.core: convert peer_accessible_by to a live MutableSet view (#2018)
* cuda.core: convert peer_accessible_by to a live MutableSet view DeviceMemoryResource.peer_accessible_by previously returned a sorted tuple[int, ...] backed by a Python-level cache, which was prone to divergence from driver state across multiple wrappers around the same memory pool. The setter accepted Device | int and emitted a single batched cuMemPoolSetAccess covering the diff against the cache. This commit replaces the property with a live driver-backed view: - Adds PeerAccessibleBySetProxy in _memory/_peer_access_utils.py, a collections.abc.MutableSet whose reads call cuMemPoolGetAccess and whose writes call cuMemPoolSetAccess. Iteration yields Device objects; add, discard, and __contains__ accept either a Device or a device-ordinal int. The proxy is constructed fresh on every property access, so there is nothing to cache or pickle. - Drops the _peer_accessible_by cache field (and its initializations in __cinit__, _DMR_init, and from_allocation_handle), eliminating the owned/non-owned read split. All pools now share the same code path and always query the driver. - All bulk operations on the proxy (update, |=, &=, -=, ^=, clear, pop) issue exactly one cuMemPoolSetAccess call. Peer-access transitions can take seconds per pool because every existing memory mapping is updated, so coalescing into a single driver call lets the toolkit handle the mappings in parallel. The property setter (mr.peer_accessible_by = [...]) preserves its original single-call behavior via the same shared planner path. - Single-element add validates can_access_peer through plan_peer_access_update, matching the existing setter contract. This is a breaking change captured in the v1.0.0 release notes. Callers comparing against tuples must update to set comparisons (mr.peer_accessible_by == {Device(0)}). Existing tests are migrated; new tests for set-interface conformance are intentionally deferred to a follow-up. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: move peer-access internals into a single _peer_access_utils.pyx The previous commit left DeviceMemoryResource carrying three pass-through def methods (_query_peer_access_ids, _peer_access_includes, _apply_peer_access_diff) whose only purpose was to give the pure-Python proxy in _peer_access_utils.py a way to call cdef helpers in _device_memory_resource.pyx. These methods served no public role and cluttered the class API. Promote _peer_access_utils.py to a Cython module so the proxy and the driver-touching helpers can live together: - Convert _peer_access_utils.py to _peer_access_utils.pyx. cimports cydriver and DeviceMemoryResource from the .pxd; uses nogil and direct CUmemAccessDesc packing identically to before. - Move _DMR_query_peer_access_ids, _DMR_peer_access_includes, _DMR_apply_peer_access_diff, and _DMR_replace_peer_accessible_by from _device_memory_resource.pyx into the new module as cdef helpers (and a cpdef replace_peer_accessible_by entry point used by the property setter). - Drop the three pass-through def methods from DeviceMemoryResource. The class is left with the property getter and setter only; everything else is module-level in _peer_access_utils. - The proxy now calls the module-level cdef helpers directly instead of routing through methods on mr. No behavior change. The public surface (PeerAccessibleBySetProxy, plan_peer_access_update, normalize_peer_access_targets, PeerAccessPlan) is preserved at the same import paths. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: tighten peer-access query loop and 1.0.0 release note Refactor _query_peer_access_ids so the entire driver loop runs inside a single nogil block instead of acquiring/releasing the GIL once per device. The flag query now uses a cached as_cu(mr._h_pool) handle and fills a libcpp.vector[int]; because range(total) ascends, the result is already sorted and the trailing sorted() call is dropped. Also tighten the peer_accessible_by entry in 1.0.0-notes.rst: the breaking-change blurb only needs to state the type/element change, so remove the implementation-flavored details about input acceptance and batched cuMemPoolSetAccess calls. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: cover PeerAccessibleBySetProxy interface, batching, and edge cases Existing peer-access tests covered the integration path well (real copies across peers, the full transition matrix, shared-pool consistency) but only touched ``in``, ``==``, and the property setter on the new set proxy. After the v1.0.0 break that surfaced ~25 ``MutableSet`` methods, nothing was pinning the type-coercion contract, the owner-filtering behavior, the ``KeyError``/value-error paths, or the "one ``cuMemPoolSetAccess`` per bulk op" performance invariant. Add the following coverage in ``test_memory_peer_access.py``: - A ``MutableSet`` conformance test using a relaxed ``assert_mutable_set_interface`` mode that admits subjects holding at most one insertable element. CI maxes at two GPUs (one peer), so the multi-element protocol pass cannot run there. The new ``support_multi_insert=False`` path takes one insertable item plus two non-member sentinels and exercises every ``MutableSet`` method (``add``/``discard``/``remove``/``pop``/``clear``/``update``, comparisons, isdisjoint, subset/superset, binary and in-place operators, ``__iter__``/``__len__``/``__repr__``). - ``Device``/``int`` interchangeability on ``add``/``discard``/``__contains__``. - The owner-device filtering contract on every write (silent no-op). - Error paths: ``add(out_of_range)`` and ``add(non_coercible)`` raise while the lenient ``discard``/``__contains__`` paths swallow the same inputs; ``remove(non_member)`` raises ``KeyError``. - "Live driver view" semantics: a proxy obtained before another wrapper modifies the pool reflects the change with no refresh step. - ``__iter__`` ordering is ascending by ``device_id`` and elements are ``Device`` instances; ``__repr__`` includes the class name and tracks live contents; the getter returns the documented proxy type. - A batching spy that monkeypatches the module-level ``_apply_peer_access_diff`` and asserts that every bulk op (``|=``/``&=``/``-=``/``^=``/``update``/``difference_update``/ ``clear``) and the property setter issues at most one driver call, zero when the diff is empty. To make the spy possible, ``_apply_peer_access_diff`` is now a Python-visible ``def`` wrapper around a renamed ``_apply_peer_access_diff_cython`` ``cdef inline``. The proxy and the property setter still call ``_apply_peer_access_diff`` by bare name, which Cython resolves through the module's globals at runtime, so a ``monkeypatch.setattr(_peer_access_utils, "_apply_peer_access_diff", ...)`` intercepts them. The extra Python-level dispatch is negligible next to ``cuMemPoolSetAccess`` itself. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: filter empty-delta calls in peer-access batching spy Augmented assignment on the ``peer_accessible_by`` property (``dmr.peer_accessible_by |= {...}``) is two trips through the proxy/setter pair, not one: Python fetches the proxy, the proxy mutates itself in place via ``__ior__``, and Python then assigns the (already-mutated) proxy back through the setter. That trailing setter call computes the diff against current driver state, finds it empty, and short-circuits inside the ``cdef inline`` before issuing any ``cuMemPoolSetAccess`` work — so the *driver-level* contract ("one batched call per bulk op") still holds, but the wrapper is invoked twice, which the spy was counting. Also, the fixture's ``dmr.peer_accessible_by = []`` reset on an already empty pool is itself an empty-delta wrapper call. Filter the recorded calls down to those with non-empty deltas (the ones that translate to real driver work) and switch the bulk-ops test to use a locally bound proxy so augmented assignment goes through ``__ior__`` once with no extra setter invocation. The setter test stays on ``dmr.peer_accessible_by = ...`` because that is the public API contract under test there. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: spy on the driver call directly to assert peer-access batching Move the actual ``cuMemPoolSetAccess`` invocation (descriptor-array build + driver call) into a thin Python-visible ``def _set_pool_access`` in ``_peer_access_utils.pyx``. ``_apply_peer_access_diff`` now does only the empty-diff short-circuit and delegates the work to ``_set_pool_access``, which Cython resolves through the module globals at runtime so tests can intercept it via ``monkeypatch.setattr``. Replace the previous internal-wrapper spy with a driver-call spy that counts every real ``cuMemPoolSetAccess`` invocation. Earlier no-op layers (e.g. the augmented-assignment-on-property pattern that writes an already-mutated proxy back through the setter) short-circuit before reaching ``_set_pool_access``, so the recorded count is exactly the number of driver calls. The empty-delta filter and the local-binding workaround in the bulk-ops test are gone; we now also assert that ``dmr.peer_accessible_by |= {...}`` directly on the property is still exactly one driver call. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: split MutableSet conformance helpers by capacity Replace the ``support_multi_insert`` flag and ``non_members`` keyword with two purpose-built helpers: - ``assert_mutable_set_interface(subject, items)`` keeps the original signature and contract: at least five distinct insertable items, exercised against a reference set in the standard way. The graph ``AdjacencySetProxy`` test continues to use this unchanged. - ``assert_single_member_mutable_set_interface(subject, member, non_member)`` is a focused pass for proxies whose backing store admits at most one insertable element at a time (here, the peer-access view on a 2-GPU box). It threads a single member and one non-member sentinel through every ``MutableSet`` method. The two helpers share small private utilities (empty-state checks, ``__repr__`` shape) but keep their public surfaces small and linear. A capacity-one proxy is a meaningfully different contract from a general mutable set; naming that explicitly in the API reads better than a flag and avoids forcing call sites to plumb sentinels through. Co-authored-by: Cursor <cursoragent@cursor.com> * Address review feedback on peer_accessible_by proxy - Optimize add/discard to check the single target device (1 driver call) instead of scanning all devices via _query_peer_access_ids - Add test for out-of-range int in __contains__ (returns False) - Fix stale comment referencing removed support_multi_insert flag - Add #2018 link to release notes entry Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 50c19d0 commit 3432b28

14 files changed

Lines changed: 913 additions & 198 deletions

cuda_core/cuda/core/_layout.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
#
33
# SPDX-License-Identifier: Apache-2.0
44

cuda_core/cuda/core/_memory/_device_memory_resource.pxd

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@ from cuda.core._memory._ipc cimport IPCDataForMR
99
cdef class DeviceMemoryResource(_MemPool):
1010
cdef:
1111
int _dev_id
12-
object _peer_accessible_by
1312

1413

1514
cpdef DMR_mempool_get_access(DeviceMemoryResource, int)

cuda_core/cuda/core/_memory/_device_memory_resource.pyx

Lines changed: 10 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,12 @@ from cuda.core._utils.cuda_utils cimport (
1818
check_or_create_options,
1919
HANDLE_RETURN,
2020
)
21-
from cpython.mem cimport PyMem_Malloc, PyMem_Free
22-
2321
from dataclasses import dataclass
2422
import multiprocessing
2523
import platform # no-cython-lint
2624
import uuid
2725

28-
from cuda.core._memory._peer_access_utils import plan_peer_access_update
26+
from cuda.core._memory._peer_access_utils import PeerAccessibleBySetProxy, replace_peer_accessible_by
2927
from cuda.core._utils.cuda_utils import check_multiprocessing_start_method
3028

3129
__all__ = ['DeviceMemoryResource', 'DeviceMemoryResourceOptions']
@@ -131,7 +129,6 @@ cdef class DeviceMemoryResource(_MemPool):
131129

132130
def __cinit__(self, *args, **kwargs):
133131
self._dev_id = cydriver.CU_DEVICE_INVALID
134-
self._peer_accessible_by = None
135132

136133
def __init__(self, device_id: Device | int, options=None):
137134
_DMR_init(self, device_id, options)
@@ -191,7 +188,6 @@ cdef class DeviceMemoryResource(_MemPool):
191188
_ipc.MP_from_allocation_handle(cls, alloc_handle))
192189
from .._device import Device
193190
mr._dev_id = Device(device_id).device_id
194-
mr._peer_accessible_by = ()
195191
return mr
196192

197193
@property
@@ -217,30 +213,23 @@ cdef class DeviceMemoryResource(_MemPool):
217213
pool. Access can be modified at any time and affects all allocations
218214
from this memory pool.
219215
220-
Returns a tuple of sorted device IDs that currently have peer access to
221-
allocations from this memory pool.
222-
223-
When setting, accepts a sequence of :obj:`~_device.Device` objects or device IDs.
224-
Setting to an empty sequence revokes all peer access.
225-
226-
For non-owned pools (the default or current device pool), the state
227-
is always queried from the driver to reflect changes made by other
228-
wrappers or direct driver calls.
216+
Returns a set-like proxy of :obj:`~_device.Device` objects that manages
217+
peer access. Inputs are accepted as either :obj:`~_device.Device`
218+
objects or device-ordinal :class:`int` values.
229219
230220
Examples
231221
--------
232222
>>> dmr = DeviceMemoryResource(0)
233-
>>> dmr.peer_accessible_by = [1] # Grant access to device 1
234-
>>> assert dmr.peer_accessible_by == (1,)
235-
>>> dmr.peer_accessible_by = [] # Revoke access
223+
>>> dmr.peer_accessible_by = {1} # grant access to device 1
224+
>>> assert 1 in dmr.peer_accessible_by
225+
>>> dmr.peer_accessible_by.add(2) # update access to include device 2
226+
>>> dmr.peer_accessible_by = [] # revoke peer access
236227
"""
237-
if not self._mempool_owned:
238-
_DMR_query_peer_access(self)
239-
return self._peer_accessible_by
228+
return PeerAccessibleBySetProxy(self)
240229

241230
@peer_accessible_by.setter
242231
def peer_accessible_by(self, devices):
243-
_DMR_set_peer_accessible_by(self, devices)
232+
replace_peer_accessible_by(self, devices)
244233

245234
@property
246235
def is_device_accessible(self) -> bool:
@@ -253,81 +242,6 @@ cdef class DeviceMemoryResource(_MemPool):
253242
return False
254243

255244

256-
cdef inline _DMR_query_peer_access(DeviceMemoryResource self):
257-
"""Query the driver for the actual peer access state of this pool."""
258-
cdef int total
259-
cdef cydriver.CUmemAccess_flags flags
260-
cdef cydriver.CUmemLocation location
261-
cdef list peers = []
262-
263-
with nogil:
264-
HANDLE_RETURN(cydriver.cuDeviceGetCount(&total))
265-
266-
location.type = cydriver.CUmemLocationType.CU_MEM_LOCATION_TYPE_DEVICE
267-
for dev_id in range(total):
268-
if dev_id == self._dev_id:
269-
continue
270-
location.id = dev_id
271-
with nogil:
272-
HANDLE_RETURN(cydriver.cuMemPoolGetAccess(&flags, as_cu(self._h_pool), &location))
273-
if flags == cydriver.CUmemAccess_flags.CU_MEM_ACCESS_FLAGS_PROT_READWRITE:
274-
peers.append(dev_id)
275-
276-
self._peer_accessible_by = tuple(sorted(peers))
277-
278-
279-
cdef inline _DMR_set_peer_accessible_by(DeviceMemoryResource self, devices):
280-
from .._device import Device
281-
282-
this_dev = Device(self._dev_id)
283-
cdef object resolve_device_id = lambda dev: Device(dev).device_id
284-
cdef object plan
285-
cdef tuple target_ids
286-
cdef tuple to_add
287-
cdef tuple to_rm
288-
if not self._mempool_owned:
289-
_DMR_query_peer_access(self)
290-
plan = plan_peer_access_update(
291-
owner_device_id=self._dev_id,
292-
current_peer_ids=self._peer_accessible_by,
293-
requested_devices=devices,
294-
resolve_device_id=resolve_device_id,
295-
can_access_peer=this_dev.can_access_peer,
296-
)
297-
target_ids = plan.target_ids
298-
to_add = plan.to_add
299-
to_rm = plan.to_remove
300-
cdef size_t count = len(to_add) + len(to_rm)
301-
cdef cydriver.CUmemAccessDesc* access_desc = NULL
302-
cdef size_t i = 0
303-
304-
if count > 0:
305-
access_desc = <cydriver.CUmemAccessDesc*>PyMem_Malloc(count * sizeof(cydriver.CUmemAccessDesc))
306-
if access_desc == NULL:
307-
raise MemoryError("Failed to allocate memory for access descriptors")
308-
309-
try:
310-
for dev_id in to_add:
311-
access_desc[i].flags = cydriver.CUmemAccess_flags.CU_MEM_ACCESS_FLAGS_PROT_READWRITE
312-
access_desc[i].location.type = cydriver.CUmemLocationType.CU_MEM_LOCATION_TYPE_DEVICE
313-
access_desc[i].location.id = dev_id
314-
i += 1
315-
316-
for dev_id in to_rm:
317-
access_desc[i].flags = cydriver.CUmemAccess_flags.CU_MEM_ACCESS_FLAGS_PROT_NONE
318-
access_desc[i].location.type = cydriver.CUmemLocationType.CU_MEM_LOCATION_TYPE_DEVICE
319-
access_desc[i].location.id = dev_id
320-
i += 1
321-
322-
with nogil:
323-
HANDLE_RETURN(cydriver.cuMemPoolSetAccess(as_cu(self._h_pool), access_desc, count))
324-
finally:
325-
if access_desc != NULL:
326-
PyMem_Free(access_desc)
327-
328-
self._peer_accessible_by = tuple(target_ids)
329-
330-
331245
cdef inline _DMR_init(DeviceMemoryResource self, device_id, options):
332246
from .._device import Device
333247
cdef int dev_id = Device(device_id).device_id
@@ -351,7 +265,6 @@ cdef inline _DMR_init(DeviceMemoryResource self, device_id, options):
351265
self._mempool_owned = False
352266
MP_raise_release_threshold(self)
353267
else:
354-
self._peer_accessible_by = ()
355268
MP_init_create_pool(
356269
self,
357270
cydriver.CUmemLocationType.CU_MEM_LOCATION_TYPE_DEVICE,

cuda_core/cuda/core/_memory/_ipc.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
#
33
# SPDX-License-Identifier: Apache-2.0
44

cuda_core/cuda/core/_memory/_peer_access_utils.py

Lines changed: 0 additions & 59 deletions
This file was deleted.

0 commit comments

Comments
 (0)