Commit 3432b28
cuda.core: convert peer_accessible_by to a live MutableSet view (#2018)
* cuda.core: convert peer_accessible_by to a live MutableSet view
DeviceMemoryResource.peer_accessible_by previously returned a sorted
tuple[int, ...] backed by a Python-level cache, which was prone to
divergence from driver state across multiple wrappers around the same
memory pool. The setter accepted Device | int and emitted a single
batched cuMemPoolSetAccess covering the diff against the cache.
This commit replaces the property with a live driver-backed view:
- Adds PeerAccessibleBySetProxy in _memory/_peer_access_utils.py, a
collections.abc.MutableSet whose reads call cuMemPoolGetAccess and
whose writes call cuMemPoolSetAccess. Iteration yields Device
objects; add, discard, and __contains__ accept either a Device or a
device-ordinal int. The proxy is constructed fresh on every property
access, so there is nothing to cache or pickle.
- Drops the _peer_accessible_by cache field (and its initializations
in __cinit__, _DMR_init, and from_allocation_handle), eliminating
the owned/non-owned read split. All pools now share the same code
path and always query the driver.
- All bulk operations on the proxy (update, |=, &=, -=, ^=, clear,
pop) issue exactly one cuMemPoolSetAccess call. Peer-access
transitions can take seconds per pool because every existing memory
mapping is updated, so coalescing into a single driver call lets the
toolkit handle the mappings in parallel. The property setter
(mr.peer_accessible_by = [...]) preserves its original single-call
behavior via the same shared planner path.
- Single-element add validates can_access_peer through
plan_peer_access_update, matching the existing setter contract.
This is a breaking change captured in the v1.0.0 release notes.
Callers comparing against tuples must update to set comparisons
(mr.peer_accessible_by == {Device(0)}). Existing tests are migrated;
new tests for set-interface conformance are intentionally deferred to
a follow-up.
Co-authored-by: Cursor <cursoragent@cursor.com>
* cuda.core: move peer-access internals into a single _peer_access_utils.pyx
The previous commit left DeviceMemoryResource carrying three pass-through
def methods (_query_peer_access_ids, _peer_access_includes,
_apply_peer_access_diff) whose only purpose was to give the pure-Python
proxy in _peer_access_utils.py a way to call cdef helpers in
_device_memory_resource.pyx. These methods served no public role and
cluttered the class API.
Promote _peer_access_utils.py to a Cython module so the proxy and the
driver-touching helpers can live together:
- Convert _peer_access_utils.py to _peer_access_utils.pyx. cimports
cydriver and DeviceMemoryResource from the .pxd; uses nogil and direct
CUmemAccessDesc packing identically to before.
- Move _DMR_query_peer_access_ids, _DMR_peer_access_includes,
_DMR_apply_peer_access_diff, and _DMR_replace_peer_accessible_by from
_device_memory_resource.pyx into the new module as cdef helpers (and a
cpdef replace_peer_accessible_by entry point used by the property
setter).
- Drop the three pass-through def methods from DeviceMemoryResource. The
class is left with the property getter and setter only; everything else
is module-level in _peer_access_utils.
- The proxy now calls the module-level cdef helpers directly instead of
routing through methods on mr.
No behavior change. The public surface (PeerAccessibleBySetProxy,
plan_peer_access_update, normalize_peer_access_targets, PeerAccessPlan)
is preserved at the same import paths.
Co-authored-by: Cursor <cursoragent@cursor.com>
* cuda.core: tighten peer-access query loop and 1.0.0 release note
Refactor _query_peer_access_ids so the entire driver loop runs inside a
single nogil block instead of acquiring/releasing the GIL once per
device. The flag query now uses a cached as_cu(mr._h_pool) handle and
fills a libcpp.vector[int]; because range(total) ascends, the result is
already sorted and the trailing sorted() call is dropped.
Also tighten the peer_accessible_by entry in 1.0.0-notes.rst: the
breaking-change blurb only needs to state the type/element change, so
remove the implementation-flavored details about input acceptance and
batched cuMemPoolSetAccess calls.
Co-authored-by: Cursor <cursoragent@cursor.com>
* cuda.core: cover PeerAccessibleBySetProxy interface, batching, and edge cases
Existing peer-access tests covered the integration path well (real
copies across peers, the full transition matrix, shared-pool consistency)
but only touched ``in``, ``==``, and the property setter on the new set
proxy. After the v1.0.0 break that surfaced ~25 ``MutableSet`` methods,
nothing was pinning the type-coercion contract, the owner-filtering
behavior, the ``KeyError``/value-error paths, or the "one
``cuMemPoolSetAccess`` per bulk op" performance invariant.
Add the following coverage in ``test_memory_peer_access.py``:
- A ``MutableSet`` conformance test using a relaxed
``assert_mutable_set_interface`` mode that admits subjects holding at
most one insertable element. CI maxes at two GPUs (one peer), so the
multi-element protocol pass cannot run there. The new
``support_multi_insert=False`` path takes one insertable item plus two
non-member sentinels and exercises every ``MutableSet`` method
(``add``/``discard``/``remove``/``pop``/``clear``/``update``,
comparisons, isdisjoint, subset/superset, binary and in-place
operators, ``__iter__``/``__len__``/``__repr__``).
- ``Device``/``int`` interchangeability on ``add``/``discard``/``__contains__``.
- The owner-device filtering contract on every write (silent no-op).
- Error paths: ``add(out_of_range)`` and ``add(non_coercible)`` raise
while the lenient ``discard``/``__contains__`` paths swallow the same
inputs; ``remove(non_member)`` raises ``KeyError``.
- "Live driver view" semantics: a proxy obtained before another wrapper
modifies the pool reflects the change with no refresh step.
- ``__iter__`` ordering is ascending by ``device_id`` and elements are
``Device`` instances; ``__repr__`` includes the class name and tracks
live contents; the getter returns the documented proxy type.
- A batching spy that monkeypatches the module-level
``_apply_peer_access_diff`` and asserts that every bulk op
(``|=``/``&=``/``-=``/``^=``/``update``/``difference_update``/
``clear``) and the property setter issues at most one driver call,
zero when the diff is empty.
To make the spy possible, ``_apply_peer_access_diff`` is now a
Python-visible ``def`` wrapper around a renamed
``_apply_peer_access_diff_cython`` ``cdef inline``. The proxy and the
property setter still call ``_apply_peer_access_diff`` by bare name,
which Cython resolves through the module's globals at runtime, so a
``monkeypatch.setattr(_peer_access_utils, "_apply_peer_access_diff", ...)``
intercepts them. The extra Python-level dispatch is negligible next to
``cuMemPoolSetAccess`` itself.
Co-authored-by: Cursor <cursoragent@cursor.com>
* cuda.core: filter empty-delta calls in peer-access batching spy
Augmented assignment on the ``peer_accessible_by`` property
(``dmr.peer_accessible_by |= {...}``) is two trips through the
proxy/setter pair, not one: Python fetches the proxy, the proxy mutates
itself in place via ``__ior__``, and Python then assigns the
(already-mutated) proxy back through the setter. That trailing setter
call computes the diff against current driver state, finds it empty,
and short-circuits inside the ``cdef inline`` before issuing any
``cuMemPoolSetAccess`` work — so the *driver-level* contract ("one
batched call per bulk op") still holds, but the wrapper is invoked
twice, which the spy was counting.
Also, the fixture's ``dmr.peer_accessible_by = []`` reset on an already
empty pool is itself an empty-delta wrapper call.
Filter the recorded calls down to those with non-empty deltas (the ones
that translate to real driver work) and switch the bulk-ops test to use
a locally bound proxy so augmented assignment goes through ``__ior__``
once with no extra setter invocation. The setter test stays on
``dmr.peer_accessible_by = ...`` because that is the public API
contract under test there.
Co-authored-by: Cursor <cursoragent@cursor.com>
* cuda.core: spy on the driver call directly to assert peer-access batching
Move the actual ``cuMemPoolSetAccess`` invocation (descriptor-array
build + driver call) into a thin Python-visible ``def _set_pool_access``
in ``_peer_access_utils.pyx``. ``_apply_peer_access_diff`` now does only
the empty-diff short-circuit and delegates the work to
``_set_pool_access``, which Cython resolves through the module globals
at runtime so tests can intercept it via ``monkeypatch.setattr``.
Replace the previous internal-wrapper spy with a driver-call spy that
counts every real ``cuMemPoolSetAccess`` invocation. Earlier no-op
layers (e.g. the augmented-assignment-on-property pattern that writes
an already-mutated proxy back through the setter) short-circuit before
reaching ``_set_pool_access``, so the recorded count is exactly the
number of driver calls. The empty-delta filter and the local-binding
workaround in the bulk-ops test are gone; we now also assert that
``dmr.peer_accessible_by |= {...}`` directly on the property is still
exactly one driver call.
Co-authored-by: Cursor <cursoragent@cursor.com>
* cuda.core: split MutableSet conformance helpers by capacity
Replace the ``support_multi_insert`` flag and ``non_members`` keyword
with two purpose-built helpers:
- ``assert_mutable_set_interface(subject, items)`` keeps the original
signature and contract: at least five distinct insertable items,
exercised against a reference set in the standard way. The graph
``AdjacencySetProxy`` test continues to use this unchanged.
- ``assert_single_member_mutable_set_interface(subject, member,
non_member)`` is a focused pass for proxies whose backing store admits
at most one insertable element at a time (here, the peer-access view
on a 2-GPU box). It threads a single member and one non-member
sentinel through every ``MutableSet`` method.
The two helpers share small private utilities (empty-state checks,
``__repr__`` shape) but keep their public surfaces small and linear.
A capacity-one proxy is a meaningfully different contract from a
general mutable set; naming that explicitly in the API reads better
than a flag and avoids forcing call sites to plumb sentinels through.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Address review feedback on peer_accessible_by proxy
- Optimize add/discard to check the single target device (1 driver
call) instead of scanning all devices via _query_peer_access_ids
- Add test for out-of-range int in __contains__ (returns False)
- Fix stale comment referencing removed support_multi_insert flag
- Add #2018 link to release notes entry
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent 50c19d0 commit 3432b28
14 files changed
Lines changed: 913 additions & 198 deletions
File tree
- cuda_core
- cuda/core
- _memory
- docs/source
- release
- tests
- helpers
- memory_ipc
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
13 | 12 | | |
14 | 13 | | |
15 | 14 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
22 | | - | |
23 | 21 | | |
24 | 22 | | |
25 | 23 | | |
26 | 24 | | |
27 | 25 | | |
28 | | - | |
| 26 | + | |
29 | 27 | | |
30 | 28 | | |
31 | 29 | | |
| |||
131 | 129 | | |
132 | 130 | | |
133 | 131 | | |
134 | | - | |
135 | 132 | | |
136 | 133 | | |
137 | 134 | | |
| |||
191 | 188 | | |
192 | 189 | | |
193 | 190 | | |
194 | | - | |
195 | 191 | | |
196 | 192 | | |
197 | 193 | | |
| |||
217 | 213 | | |
218 | 214 | | |
219 | 215 | | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
226 | | - | |
227 | | - | |
228 | | - | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
229 | 219 | | |
230 | 220 | | |
231 | 221 | | |
232 | 222 | | |
233 | | - | |
234 | | - | |
235 | | - | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
236 | 227 | | |
237 | | - | |
238 | | - | |
239 | | - | |
| 228 | + | |
240 | 229 | | |
241 | 230 | | |
242 | 231 | | |
243 | | - | |
| 232 | + | |
244 | 233 | | |
245 | 234 | | |
246 | 235 | | |
| |||
253 | 242 | | |
254 | 243 | | |
255 | 244 | | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
261 | | - | |
262 | | - | |
263 | | - | |
264 | | - | |
265 | | - | |
266 | | - | |
267 | | - | |
268 | | - | |
269 | | - | |
270 | | - | |
271 | | - | |
272 | | - | |
273 | | - | |
274 | | - | |
275 | | - | |
276 | | - | |
277 | | - | |
278 | | - | |
279 | | - | |
280 | | - | |
281 | | - | |
282 | | - | |
283 | | - | |
284 | | - | |
285 | | - | |
286 | | - | |
287 | | - | |
288 | | - | |
289 | | - | |
290 | | - | |
291 | | - | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
297 | | - | |
298 | | - | |
299 | | - | |
300 | | - | |
301 | | - | |
302 | | - | |
303 | | - | |
304 | | - | |
305 | | - | |
306 | | - | |
307 | | - | |
308 | | - | |
309 | | - | |
310 | | - | |
311 | | - | |
312 | | - | |
313 | | - | |
314 | | - | |
315 | | - | |
316 | | - | |
317 | | - | |
318 | | - | |
319 | | - | |
320 | | - | |
321 | | - | |
322 | | - | |
323 | | - | |
324 | | - | |
325 | | - | |
326 | | - | |
327 | | - | |
328 | | - | |
329 | | - | |
330 | | - | |
331 | 245 | | |
332 | 246 | | |
333 | 247 | | |
| |||
351 | 265 | | |
352 | 266 | | |
353 | 267 | | |
354 | | - | |
355 | 268 | | |
356 | 269 | | |
357 | 270 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| |||
This file was deleted.
0 commit comments