Skip to content

Commit 5befc7d

Browse files
rparolinclaude
andauthored
Add managed-memory advise, prefetch, and discard-prefetch free functions (#1775)
* wip * wip * fixing ci compiler errors * skipping tests that aren't supported * cu12 support * Moving to function from Buffer class methods to free standing functions in the cuda.core.managed_memory namespace * precommit format * iterating on implementation * Simplify managed-memory helpers: remove long-form aliases, cache lookups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(test): reset _V2_BINDINGS cache so legacy-signature tests take the legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(test): require concurrent_managed_access for advise tests that hit real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: validate managed buffer before checking discard_prefetch bindings support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: extract managed memory ops into dedicated _managed_memory_ops module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * pre-commit fix * Removing blank file * wip * fix(cuda.core): update binding_version import after upstream merge Upstream renamed get_binding_version → binding_version and moved it from cuda.core._utils.cuda_utils to cuda.core._utils.version. Update the managed-memory ops module to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * revert: drop managed_memory shim in cuda.core.experimental The cuda.core.experimental namespace is being deprecated and should not gain new submodules. Per review feedback, the managed_memory module should only be reachable via cuda.core.managed_memory, not via the experimental compatibility shim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): add Location dataclass for managed memory Frozen dataclass with classmethod constructors for the four CUmemLocationType kinds (device, host, host_numa, host_numa_current). Validates id constraints in __post_init__. Re-exported from cuda.core.managed_memory. This will replace the location=/location_type= kwargs in the upcoming unified 1..N managed-memory ops API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): add _coerce_location helper Centralizes back-compat coercion for managed-memory Location inputs: - Location → passthrough - Device → Location.device(device_id) - int >= 0 → Location.device(int) - int == -1 → Location.host() - None → None when allow_none=True, else ValueError Will be used by the unified 1..N managed-memory ops API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cuda.core): update monkeypatch target after binding_version rename The legacy-bindings monkeypatch tests still referenced get_binding_version, which was renamed to binding_version in cf2f20d. Update both occurrences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): tighten memory-attr query Address review feedback on _buffer.pyx: - Restore `inline` on `_init_mem_attrs` and `_query_memory_attrs`. - Set `out.is_managed = (is_managed != 0)` once outside the if/elif, rather than per-branch (driver leaves the attribute zero for non-managed pointers, so all three branches converged on the same value anyway). - Add a TODO noting that HMM/ATS-enabled sysmem should also report `is_managed=True`; the CU_POINTER_ATTRIBUTE_IS_MANAGED query does not capture that yet. The Cython modernization of _managed_memory_ops.pyx (cimport cydriver, IF/ELSE for the 12/13 ABI split) is folded into Tasks 5-8 where the public API is being rewritten anyway; doing it here would mean rewriting the same call sites twice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): unified 1..N managed_memory.prefetch with cydriver Rewrite prefetch() with the unified single-or-batched signature targeted by issue #1333: - prefetch(targets, location, *, options=None, stream) - targets accepts a single Buffer or a sequence of Buffers - location accepts a Location dataclass, Device, int (-1 = host), or a sequence broadcasting to per-buffer locations - length mismatch raises ValueError; empty targets raises ValueError - options is reserved for future per-call flags and must be None - stream moved to the end, kept keyword-only Internals: switch from Python-level driver.cuMemPrefetchAsync to Cython-level cydriver.cuMemPrefetchAsync via cimport cydriver, with HANDLE_RETURN. Replace the runtime _V2_BINDINGS check with compile-time IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE per the codebase precedent in _managed_memory_resource.pyx, _memory_pool.pyx, _tensor_map.pyx. N>1 dispatches to cydriver.cuMemPrefetchBatchAsync (CUDA 13 only); on CUDA 12 builds, batched prefetch raises NotImplementedError. Single-range prefetch continues to work on both CUDA 12 and 13 builds. The location_type= keyword is removed; callers express location kind via the Location dataclass added in 20d036e. The advise() and discard_prefetch() functions still use the legacy _normalize_managed_location helper and Python-level driver calls; they will be migrated in their own tasks. Also drops test_managed_memory_prefetch_uses_legacy_bindings_signature, which monkeypatched the Python-level driver.cuMemPrefetchAsync — no longer applicable since the prefetch path uses cydriver. The corresponding advise legacy-bindings test stays for now (advise still uses Python driver). Closes Andy-Jost's review comment that the existing API is "non-Pythonic" by making it Pythonic in a different direction (typed Location dataclass) while preserving the free-function shape pending Leo's tie-break on ManagedBuffer subclass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): add managed_memory.discard Adds a new discard(targets, *, options=None, stream) free function that wraps cuMemDiscardBatchAsync. Accepts a single Buffer or a sequence; N>=1 dispatches to the batched driver entry point. Requires a CUDA 13 build of cuda.core (NotImplementedError on CUDA 12 builds). Closes the second of three batched managed-memory operations from #1333: P1: cudaMemDiscardBatchAsync <- this commit P1: cudaMemPrefetchBatchAsync <- 818f5d2 P1: cudaMemDiscardAndPrefetchBatchAsync <- next commit Re-exported from cuda.core.managed_memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): unified 1..N managed_memory.discard_prefetch with cydriver Rewrite discard_prefetch() with the unified single-or-batched signature: discard_prefetch(targets, location, *, options=None, stream) - targets accepts a single Buffer or a sequence of Buffers - location accepts a Location, Device, int, or per-buffer sequence - length mismatch / empty targets raise ValueError - options must be None (reserved) - stream moved to end, kept keyword-only Internals: switch from Python-level driver.cuMemDiscardAndPrefetchBatchAsync to Cython-level cydriver.cuMemDiscardAndPrefetchBatchAsync. The runtime discard-prefetch availability check is replaced by compile-time IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE; on CUDA 12 builds the call raises NotImplementedError. The location_type= keyword is removed; use Location dataclass instead. Closes the third managed-memory batched op from #1333. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): unified 1..N managed_memory.advise + drop legacy apparatus Rewrite advise() with the unified single-or-batched signature: advise(targets, advice, location=None, *, options=None) - targets accepts a single Buffer or a sequence - advice still accepts string aliases or driver.CUmem_advise enum values - location accepts Location dataclass, Device, int, None, or per-buffer sequence (None permitted only for set_read_mostly, unset_read_mostly, unset_preferred_location) - Per-advice allowed-kind validation ported to operate on Location.kind (matches CUDA driver constraints from existing tables) - options reserved for future per-call flags - For N>1, loops cydriver.cuMemAdvise per buffer (no batched advise API exists in CUDA) Internals: switch to cydriver.cuMemAdvise (Cython-level); use compile-time IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE for the 12/13 ABI split. Drop the legacy apparatus that all four functions previously shared: - _normalize_managed_location (returned Python driver.CUmemLocation) - _make_managed_location, _managed_location_enum - _managed_location_uses_v2_bindings + _V2_BINDINGS lazy cache - _managed_location_to_legacy_device + _LEGACY_LOC_DEVICE/HOST cache - _require_managed_discard_prefetch_support - Unused module-level constants (_HOST_NUMA_CURRENT_ID, _SINGLE_RANGE_COUNT, _MANAGED_OPERATION_FLAGS, etc.) Also drop test_managed_memory_advise_uses_legacy_bindings_signature and the _LEGACY_BINDINGS_VERSION constant; the runtime version switch is gone, replaced by compile-time IF/ELSE that the test could not exercise. The CUDA 12 vs CUDA 13 paths are now covered by the build-matrix CI job. Closes Task 8 (advise) and Task 9 (legacy-bindings test cleanup) from docs/superpowers/plans/2026-04-27-managed-memory-ops-batched.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): use Buffer.is_managed property in managed_memory ops _require_managed_buffer was poking at Buffer._mem_attrs.is_managed directly via _init_mem_attrs(). PR #1924 added the public Buffer.is_managed property which falls back to MemoryResource.is_managed when the pointer attribute query does not advertise managed memory (the case for pool- allocated managed memory). Switch _require_managed_buffer to the public property. This also fixes a latent bug where pool-allocated managed buffers were being rejected by the managed_memory ops despite Buffer.is_managed correctly reporting True. Drops the no-longer-needed cimport of _init_mem_attrs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cuda.core): document Location, discard, and 1..N managed_memory ops api.rst: add Location and discard to the managed_memory autosummary. 1.0.0-notes.rst: replace the placeholder bullet with a description of the unified 1..N API, the Location dataclass, and the dispatch to batched driver entry points on cuda.bindings 12.8+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(cuda.core): drop narrative comments and tighten _coerce_location docstring Per /simplify review, remove WHAT-only comments that just restate the function signature in front of _coerce_buffer_targets and _broadcast_locations. Tighten the _coerce_location docstring to lead with the conversion intent rather than restate the type annotation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(cuda.core): satisfy pre-commit hooks - ruff auto-applied: * Drop unused `_managed_memory_ops` test import (no longer needed after the legacy-bindings monkeypatch test was deleted) * Drop "Location" string-quoted forward refs in _managed_location.py (file already uses `from __future__ import annotations`) * Reformat string concatenations and add blank-line-after-import spacing - cython-lint auto-applied: * Drop unused libc.stdint cimport of `uintptr_t` * Drop unused `Location` Python import (only used in docstrings) * Drop unused `n` local in `discard()` * Move `cpython.mem cimport` of PyMem_Free / PyMem_Malloc inside the `IF CUDA_CORE_BUILD_MAJOR >= 13:` block where the symbols are actually used; cython-lint cannot see across compile-time branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): move managed_memory ops to cuda.core.utils Per Leo's review request (#1775 (comment)), fold the managed-memory free functions and the Location dataclass into cuda.core.utils rather than maintaining a dedicated cuda.core.managed_memory namespace. - Re-export Location, advise, prefetch, discard, discard_prefetch from cuda.core.utils. - Delete cuda.core.managed_memory module. - Update cuda.core.__init__ to drop the managed_memory submodule import. - Update tests to import from cuda.core.utils. - Update api.rst: drop the dedicated Managed memory section; add the managed-memory entries to the Utility functions section. - Update 1.0.0-notes.rst accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(cuda.core): use __all__ in utils instead of per-import noqa Replace seven `# noqa: F401` comments with a single `__all__` block listing the public re-exports. Cleaner intent signal — these are deliberate facade exports, not accidental imports — and matches the existing __all__ convention used in cuda.core.system, _legacy.py, and typing.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(cuda.core): collapse nested if in Location.__post_init__ (SIM102) ruff SIM102 flagged the host/host_numa_current branch: elif self.kind in ("host", "host_numa_current"): if self.id is not None: raise ValueError(...) into a single condition with `and`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cuda.core): share one DummyUnifiedMemoryResource per batched test Both batched-advise tests previously created a throwaway DummyUnifiedMemoryResource per allocation inside a list comprehension: bufs = [DummyUnifiedMemoryResource(device).allocate(size) for _ in range(2)] The Buffer holds a reference to the throwaway MR via mr=self, so the MR should stay alive — but on CUDA 12.9.1 CI test_batched_same_advice fails with bufs[0] showing ptr=0x0 size=0 (the post-close state). On CUDA 13 the same pattern works. Switch to one MR shared across both allocations. This is cleaner anyway and removes the throwaway-per-iteration pattern as a possible source of the cu12 issue. If the failure persists, we'll know the MR lifetime wasn't the cause. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cuda.core): query all buffers before closing in test_batched_same_advice On CUDA 12, freeing one managed allocation appears to clear the read-mostly advice on neighboring ranges. The original test interleaved query-then-close inside one loop, so the second iteration would query bufs[1] *after* bufs[0] had been freed and observe a cleared advice flag — causing assert 0 == 1. Move the queries into a list comprehension that runs before any close, then close all buffers, then assert. Decouples the verification from the deallocation order. CUDA 13 was unaffected because its managed-memory bookkeeping does not exhibit the cross-range invalidation on free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * review(cuda.core): address PR #1775 feedback - Drop defensive cuInit retry in _query_memory_attrs (Andy): we don't auto-init CUDA elsewhere; let HANDLE_RETURN propagate the error. - Use checked Cython cast `<Buffer?>t` in _coerce_buffer_targets (Leo) in place of the manual isinstance loop. - Introduce *Options dataclasses (AdviseOptions, PrefetchOptions, DiscardOptions, DiscardPrefetchOptions) per cuda.core convention (Leo). Functions accept None or the matching dataclass; tests updated to match the new error message. * test(cuda.core): split managed-memory ops tests into tests/memory/ Move the managed-memory advise/prefetch/discard/discard_prefetch tests (plus their TestLocation/TestLocationCoerce/TestPrefetch/TestDiscard/ TestDiscardPrefetch/TestAdvise classes and skip helpers) from test_memory.py into tests/memory/test_managed_ops.py per Andy's nit. Promote DummyDeviceMemoryResource to helpers.buffers so both files can import it; the remaining DummyHost/DummyPinned/NullMemoryResource stay in test_memory.py since they're only used there. Broader memory-tests reorg ("and siblings": buffer/managed_resource/ pinned/vmm) tracked as a follow-up cleanup PR to keep this diff focused. * test(cuda.core): fix options regex for AdviseOptions ("an" vs "a") The advise() error message reads "must be an AdviseOptions instance or None" (vowel triggers "an"), but the regex matched only "must be a ". Relax to "must be an?" so all three op tests pass. * chore(cuda.core): drop unused utils import + trailing blank lines Pre-commit cleanup after splitting managed-memory ops tests out of test_memory.py: the `cuda.core.utils` import is no longer used here, and ruff trimmed trailing blank lines. * feat(cuda.core): add ManagedBuffer subclass + Host location Land Andy's ManagedBuffer + Device/Host design (review #3976251223, #3164213789). The free-function shape introduced earlier in this PR is preserved; ManagedBuffer methods delegate into it, so existing call sites keep working. ManagedBuffer - Subclass of Buffer returned by ManagedMemoryResource.allocate, also constructable from an external pointer via ManagedBuffer.from_handle. - Property-style advice API: - read_mostly (bool, driver-backed get/set) - preferred_location (Device | Host | None, get/set; None unsets) - accessed_by (live AccessedBySet view: __contains__/__iter__/len query the driver, add()/discard() issue advice; setter diffs and advises only the deltas) - Instance methods prefetch / discard / discard_prefetch delegate to the matching cuda.core.utils functions. Host - New top-level class symmetric to Device. Host(), Host(numa_id=N), Host.numa_current(). Replaces Location.host()/host_numa()/etc. Location -> Device|Host|int - Drop the public Location dataclass and its classmethod constructors. - _coerce_location now accepts Device | Host | int | None and produces an internal _LocSpec record; advise/prefetch/discard/discard_prefetch signatures and docstrings updated accordingly. - int still accepted for ergonomic compatibility (-1 = host, >=0 = device ordinal). Plumbing - Buffer_from_deviceptr_handle takes an optional `cls` parameter so the pool allocator can materialize Buffer subclasses; _MP_allocate threads the same parameter through; ManagedMemoryResource.allocate passes ManagedBuffer. Tests - TestHost replaces TestLocation; TestLocationCoerce adapted to the new coerce signature. New TestManagedBuffer covers from_handle, isinstance(allocate(), ManagedBuffer), read_mostly/preferred_location/ accessed_by roundtrips, and instance methods. Property tests use external (cuMemAllocManaged) backing wrapped via from_handle, since some driver/device combinations decline cuMemAdvise on pool-allocated managed memory. - Use cuDeviceGetCount in AccessedBySet._query so the read path doesn't pull in NVML. Docs - 1.0.0 notes describe Host, ManagedBuffer, the property API, and the Device/Host location inputs. api.rst lists Host, ManagedBuffer, and the *Options dataclasses; Location is removed. * chore(cuda.core): simplify ManagedBuffer per /simplify review - Buffer.from_handle is now a classmethod that dispatches via cls._init, so subclasses inherit it: ManagedBuffer.from_handle(...) returns a ManagedBuffer with no override needed. Drop ManagedBuffer.from_handle. - Hoist `advise / prefetch / discard / discard_prefetch` imports from per-method lazy imports to module-level (no circular import: they live in cuda.core._memory._managed_memory_ops, not cuda.core.utils). - Cache the CUmem_advise and CUmem_range_attribute enum lookups at module level and pass enum constants directly to advise() instead of re-resolving from string aliases on every property write. - Extract _query_accessed_by as a module-level helper; AccessedBySet delegates and the accessed_by setter calls it directly instead of constructing a throwaway view. * ci: re-trigger CI (transient cuInit INVALID_DEVICE on l4 runner) * refactor(cuda.core): use libcpp.vector for batched-op C arrays (R14) Per Andy's review nit (PR #1775, _managed_memory_ops.pyx:207), replace the manual PyMem_Malloc / PyMem_Free pattern in the three batch helpers (_do_batch_discard, _do_batch_prefetch, _do_batch_discard_prefetch) with libcpp.vector. RAII handles cleanup, eliminating the manual try/finally and removing a leak window if _to_cumemlocation raised mid-fill. Matches the precedent used in _program.pyx, _linker.pyx, _kernel_arg_handler.pyx, _graph_node.pyx, and others. Net change: 53 insertions, 85 deletions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cuda.core): restore CUDA_ERROR_NOT_INITIALIZED auto-init in _query_memory_attrs (R4) Per Leo's review on PR #1775 (_buffer.pyx:455), restore the auto-init retry that was removed in 10de998. cuPointerGetAttributes is the first driver call _query_memory_attrs makes, and a NOT_INITIALIZED result here would otherwise propagate out of every is_managed / is_host_accessible / is_device_accessible query before the user has called any other Device API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): make Host a plain class instead of a dataclass (R1) Per Leo's review on PR #1775 (_host.py:9), drop the @DataClass(frozen=True) in favor of a hand-written class with property accessors. Matches Leo's original sketch from the 2026-04-28 drive-by comment and aligns with how Device is structured in this codebase. Behavior preserved: Host(), Host(numa_id=N), and Host.numa_current() all work identically. __eq__, __hash__, and immutability are hand-rolled rather than dataclass-generated. is_numa_current is no longer an __init__ kwarg — it's internal state settable only via the Host.numa_current() classmethod. Two existing TestHost cases updated: - test_numa_current_with_id_rejected → test_numa_current_only_via_classmethod - test_frozen → test_immutable (AttributeError instead of FrozenInstanceError) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core)!: drop int location shorthand from managed-memory ops (R6, R8) Per Leo's review on PR #1775 (_managed_buffer.py:165) and Andy's parallel question (line 144), drop the `int` shorthand for prefetch/discard_prefetch/advise locations. The previous design accepted `Device | Host | int` where `int >= 0` meant a device ordinal and `-1` magically meant host. With first-class `Device` and `Host`, the int form was redundant and the `-1 → Host` magic was surprising. Public API change: prefetch(buf, Device(0), stream=...) # was: prefetch(buf, 0, stream=...) prefetch(buf, Host(), stream=...) # was: prefetch(buf, -1, stream=...) This also resolves an inconsistency: ManagedBuffer.preferred_location already accepted only Device | Host | None, but prefetch() and discard_prefetch() accepted int. Now uniformly Device | Host. Pre-1.0 breaking change. Anyone using the int shorthand should switch to the explicit Device(N) / Host() form. Files touched: - _managed_location.py: drop the int branch from _coerce_location; TypeError now reads "Device, Host, or None" - _managed_buffer.py: type signatures `Device | Host | int` → `Device | Host` - _managed_memory_ops.pyx: docstring updates (3 occurrences) - tests/memory/test_managed_ops.py: replace int call sites with Host()/Device(N); collapse three int-branch tests into one test_int_rejected - 1.0.0-notes.rst: drop the "int values are also accepted" sentence Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cuda.core): add AccessedBySet to api_private.rst (R5) Per Andy's review on PR #1775 (_managed_buffer.py:52), document `AccessedBySet` in the private API reference. It is returned by `ManagedBuffer.accessed_by` but not directly instantiable by users — matches the existing `_memory._ipc.*` entries in the same section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cuda.core): note the legacy NUMA round-trip limitation on preferred_location (R2, R7) Per Leo's questions on PR #1775 (_host.py:26 and _managed_buffer.py:140): R2 (Host numa_id): the dataclass surface is intentional. Three forms already cover the use cases — Host() / Host(numa_id=N) / Host.numa_current(). Auto-inferring numa_id at Host() construction would conflict with the "generic host" semantic. R7 (preferred_location getter): the underlying limitation is real but upstream-blocked. The legacy CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION returns only a single int (device id, -1 host, -2 none) — no NUMA. CUDA 13 added _PREFERRED_LOCATION_TYPE / _ID for full round-trip, and they are exposed in cydriver, but cuda.bindings' _HelperCUmem_range_attribute does not yet recognize them — calling driver.cuMemRangeGetAttribute with the new attributes raises "Unsupported attribute". Once cuda.bindings adds them, this getter can query the v2 attributes and return Host(numa_id=N). Add a docstring note documenting the limitation so users aren't surprised by the lossy round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): use collections.abc.Sequence for input checks (R12, R13) Per Andy's review on PR #1775 (_managed_memory_ops.pyx:102 and :118), replace `isinstance(x, (list, tuple))` with `isinstance(x, Sequence)` in `_coerce_buffer_targets` and `_broadcast_locations`. Matches the existing precedent in `cuda.core._utils.cuda_utils.is_sequence()`. The widened input set also accepts `str`, but neither `Buffer` nor `Location` is stringly-typed, so a `str` input still raises — just with a different message (Buffer cast error or Location TypeError from `_coerce_location`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): narrow Buffer.from_handle to Buffer-only (R3) Per Leo's review on PR #1775 (_buffer.pyx:135), make Buffer.from_handle a @staticmethod that always returns Buffer. Subclass-aware construction stays available via the private @classmethod Buffer._init, which is what Leo asked for ("use a private method for handling subclasses for now"). ManagedBuffer gains its own @classmethod from_handle that wraps cls._init, so user-facing call sites like ManagedBuffer.from_handle(ptr, size, owner=plain) continue to work unchanged. The narrowly-scoped subclass factory is on the subclass itself, not bolted onto Buffer's public surface. This addresses R3's spirit: cuda.core's public APIs no longer advertise generic subclass-construction support that conflicts with the broader subclassing story tracked in #750 / #1989. No test changes; behavior preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): single API surface per operation (R9, R10, R11) Per Leo's R11 ("if we prefer methods, don't expose free functions"): each managed-memory operation now has exactly one public surface, chosen by whether it acts on one buffer or many. Single buffer (instance methods + properties on ManagedBuffer): - buf.read_mostly = True - buf.preferred_location = Device(0) - buf.accessed_by.add(Device(1)) - buf.prefetch(Device(0), stream=stream) - buf.discard(stream=stream) - buf.discard_prefetch(Device(0), stream=stream) Multiple buffers (free functions in cuda.core.utils, CUDA 13+ only): - utils.prefetch_batch(buffers, locations, stream=stream) - utils.discard_batch(buffers, stream=stream) - utils.discard_prefetch_batch(buffers, locations, stream=stream) Removed: - cuda.core.utils.advise / prefetch / discard / discard_prefetch (single-buffer surfaces — replaced by ManagedBuffer methods/properties) - cuda.core._memory._managed_memory_options module and its four empty AdviseOptions / PrefetchOptions / DiscardOptions / DiscardPrefetchOptions dataclasses (R9 from Leo, R10 from Andy: empty placeholders that didn't carry information) - options=None parameter from every public surface - The single-buffer fast path inside the now-batched-only free functions; they always hit cuMem*BatchAsync now Internals: - Public def advise() deleted; _advise_one (cdef) is the new internal single-buffer entry point used by ManagedBuffer property setters. - Three new Python-level wrappers _do_single_prefetch_py / _do_single_discard_py / _do_single_discard_prefetch_py used by ManagedBuffer instance methods. These call the cdef _do_single_* helpers with the right Cython types after stream coercion. - _coerce_buffer_targets renamed to _coerce_batch_buffers; rejects a single Buffer with a TypeError pointing at the ManagedBuffer method. Tests: - TestPrefetch / TestDiscard / TestDiscardPrefetch / TestAdvise rewritten as TestPrefetchBatch / TestDiscardBatch / TestDiscardPrefetchBatch (batched-only, since single-buffer is covered by ManagedBuffer's TestManagedBuffer class) - Single-buffer external-allocation tests use ManagedBuffer.from_handle(plain.handle, plain.size, owner=plain) to wrap a DummyUnifiedMemoryResource buffer - options-related tests deleted (no options surface to test) - enum-value advise test deleted (property setters are typed; the string-alias / enum-value internal API isn't user-visible) Release notes updated. Closes R9, R10, R11. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): build advise reverse-lookup eagerly at module load (N4) Per Leo's review on PR #1775 (_managed_memory_ops.pyx:23), drop the lazy-init plumbing for the enum→alias reverse lookup table. The forward table _MANAGED_ADVICE_ALIASES has six entries; building the inverse at module load via a dict comprehension is the same data without the mutable-global pattern, the `if None` check, or the `global` declaration inside the function body. Forward lookup table (_MANAGED_ADVICE_ALIASES) is preserved as the source of truth — explicit alias→CUDA-name mapping, grep-friendly, no implicit naming-convention coupling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(cuda.core): factor shared body of _do_batch_{prefetch,discard_prefetch} (N2) Per Leo's review on PR #1775 (_managed_memory_ops.pyx:425), the two batched-with-locations helpers were byte-for-byte identical except for the driver function being called. Both: - declare the same four std::vectors (ptrs, sizes, loc_arr, loc_indices) - resize and fill them in the same loop - release the GIL and call cuMem{Prefetch,DiscardAndPrefetch}BatchAsync with the same argument shape Introduce a function-pointer typedef _BatchPrefetchFn (the two driver calls share signature), parameterize the shared body as _do_batch_prefetch_op, and have the two callers pass the appropriate driver function. Both the typedef and the helper live inside the IF CUDA_CORE_BUILD_MAJOR >= 13 block since they reference cu13-only types. Net: -28 lines duplication, +25 for the shared helper. No behavior change; tests unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cuda.core): reuse production _get_int_attr in managed-memory tests (N6) Per Leo's review on PR #1775 (test_managed_ops.py:28), the test file's _get_mem_range_attr / _get_int_mem_range_attr / the local _MEM_RANGE_ATTRIBUTE_VALUE_SIZE constant are functionally identical to the production _get_int_attr in _managed_buffer.py. Drop the duplicates and import the production helper. 14 call sites updated. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): cu12 fallback for prefetch_batch (N3) Per Leo's review on PR #1775 (_managed_memory_ops.pyx:228), raising NotImplementedError on cu12 forces users to write their own loop. The CUDA driver semantics for cuMemPrefetchBatchAsync are equivalent to per-range cuMemPrefetchAsync calls — just more efficient when batched at the driver level. On cu12 builds (where cuMemPrefetchBatchAsync is not exposed), fall back to a Python-level loop calling cuMemPrefetchAsync per buffer. The single-range path (_do_single_prefetch) already works on cu12 via the IF/ELSE split inside it. Note this fallback applies only to prefetch_batch — discard_batch and discard_prefetch_batch keep the cu12 NotImplementedError because the driver has no single-range cuMemDiscard{,AndPrefetch}Async to fall back to. Test skips for cuMemPrefetchBatchAsync unavailability dropped from TestPrefetchBatch.test_same_location and test_per_buffer_location; the fallback path now runs on cu12 builds too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cuda.core): cover AccessedBySet read methods (N7) Per Leo's review on PR #1775 (test_managed_ops.py:1), add a test for the read side of AccessedBySet: __iter__, __len__, __eq__, __repr__. These are part of the public set-like API (alongside __contains__, add(), discard(), and the setter, which are already covered) but were untested. The cu12 batch fallback path (Leo's other coverage point) is now exercised by TestPrefetchBatch.test_same_location and test_per_buffer_location running on cu12 CI — the cuMemPrefetchBatchAsync skip was dropped in d75a7bd when the fallback landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cuda.core): cu13 NUMA round-trip for ManagedBuffer.preferred_location (N8) Per the self-promised reply on PR #1775's R7 thread, fulfill the Host(numa_id=N) round-trip on CUDA 13 builds. The blocker before was that cuda.bindings's Python-level cuMemRangeGetAttribute wrapper rejects the new CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION_TYPE / _ID attributes via its allowlist. The workaround: call cydriver.cuMemRangeGetAttribute directly from a new Cython helper _read_preferred_location_v2, bypassing the Python wrapper. The helper queries TYPE then ID, then decodes the (kind, id) pair into Device | Host | Host(numa_id=N) | Host.numa_current() | None. ManagedBuffer.preferred_location getter dispatches to the v2 path on binding_version() >= (13, 0, 0); falls back to the legacy single-int attribute on cu12 (no NUMA info available). Test: - TestManagedBuffer.test_preferred_location_roundtrip already exercises the cu13 v2 path for Device(...) and Host() (no NUMA), which now passes through _read_preferred_location_v2. - New test_preferred_location_roundtrip_host_numa exercises Host(numa_id=0) round-trip; skips on cu12, and also skips on cu13 hardware/drivers where set_preferred_location with HOST_NUMA is not preserved (e.g. single-NUMA test machines). ManagedBuffer class docstring updated to reflect the cu12-only limitation note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cuda.core): replace stale utils autosummary entries api.rst still listed the single-buffer free functions and *Options dataclasses that were removed under R9/R11 (advise, prefetch, discard, discard_prefetch and their *Options classes). Replace with the actual cuda.core.utils exports: prefetch_batch, discard_batch, discard_prefetch_batch. Drop the now-orphan :template: dataclass.rst line. * feat(cuda.core): make Host a singleton class Mirror Device's singleton semantics so Host() is Host() and Host(numa_id=1) is Host(numa_id=1) hold. Host.numa_current() returns its own singleton, distinct from Host(), since it represents a thread-relative location rather than a fixed one. Construction routes through __new__ -> _get_or_create with a double-checked dict + Lock cache keyed on (numa_id, is_numa_current). __eq__ collapses to identity (consistent with the retained __hash__). __reduce__ added so pickled Host instances round-trip back through the singleton cache instead of stranding copies. Resolves PR #1775 review: leofang and Andy-Jost requested Host follow Device as a singleton so users can rely on `is` for identity checks. * refactor(cuda.core): rename AccessedBySet -> AccessedBySetProxy Align with the graph module's AdjacencySetProxy: rename the class and inherit from collections.abc.MutableSet so the full set interface (remove, pop, clear, |=, &=, -=, ^=, isdisjoint, subset/superset operators, etc.) is filled in automatically from the existing add / discard / __contains__ / __iter__ / __len__ primitives. Add classmethod _from_iterable so binary set operators (&|^) produce plain sets rather than constructing a buffer-less proxy. Tighten add to TypeError on non-Device/Host inputs and discard / __contains__ to silently ignore them, matching MutableSet contracts. The hand-rolled __eq__ (set/frozenset comparison) is dropped: Set ABC's default implementation handles it correctly. Resolves PR #1775 review (Andy-Jost, 2026-05-04): naming consistency with AdjacencySetProxy and full MutableSet conformance. * fix(cuda.core): silence ruff lints on Host singleton - Annotate _instances / _instances_lock as ClassVar (RUF012). - Sort __slots__ alphabetically (RUF023, auto-fixed by ruff). * fix(cuda.core): reject bool as Host(numa_id=...) bool is an int subclass, so the previous guard let Host(True) and Host(False) seed the singleton cache under the same keys as Host(1) and Host(0). Whichever call landed first won, leaving repr(Host(1)) potentially showing as Host(numa_id=True). Reject bool explicitly. Addresses rwgk's Low finding on PR #1775. * fix(cuda.core): hoist managed-buffer check in _advise_one Move _require_managed_buffer to the first statement of _advise_one so a non-managed buffer is rejected before advice/location parsing, matching the order in _do_single_prefetch_py and _do_single_discard_prefetch_py. This prevents surfacing an advice-validation error when the real problem is the buffer kind. * fix(cuda.core): clarify CUDA 12 NUMA-host error message Rephrase the RuntimeError raised from _to_legacy_device when a caller passes Host(numa_id=...) or Host.numa_current() on a CUDA 12 build. The new message names the unsupported APIs and points the user at Host() as the working alternative, instead of leaking the internal location_type discriminator. * fix(cuda.core): reject Host(numa_id=...) up-front on CUDA 12 The CUDA 12 cuMemPrefetchAsync / cuMemAdvise ABI takes a plain device ordinal and cannot represent a specific host NUMA node. Previously _coerce_location accepted Host(numa_id=...) and Host.numa_current() on a CUDA 12 build and let the operation fail late inside the Cython layer with RuntimeError, which the public APIs surfaced as a confusing error from deep in the stack. Reject NUMA-host kinds at the call boundary in _coerce_location with a TypeError that names the unsupported APIs and points at Host() as the working alternative. Update the ManagedBuffer docstring to match the new contract, and broaden two host_numa-rejection test asserts to accept either the CUDA 13 kind-allowed ValueError or the CUDA 12 boundary TypeError. Addresses rwgk's Medium finding on PR #1775. * fix(cuda.core): make ManagedBuffer.accessed_by setter atomic The previous setter computed (current - target) and (target - current) and called _advise_one in two loops. set(locations) raised TypeError on unhashable elements, but only after the first diff pair had already been issued, so an invalid RHS could leave accessed_by partially mutated. Reproduce: starting from {Device(0)}, assigning {Host(numa_id=0)} on CUDA 12 raises and leaves accessed_by == set(). Validate every target up-front (per-element isinstance(Device|Host)) and only then issue the diff loops, so a bad RHS raises before any driver state changes. Addresses rwgk's High finding on PR #1775. * style(cuda.core): apply ruff format Collapses multi-line string concats and conditions back to single lines under the project's line-length limit. No behavior change. * Skip NUMA-aware Host coerce tests on CUDA 12 builds Host(numa_id=N) and Host.numa_current() require CUDA 13 bindings; the TestLocationCoerce passthroughs were missing the binding_version guard already used by test_preferred_location_roundtrip_host_numa. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 42eecda commit 5befc7d

19 files changed

Lines changed: 1589 additions & 40 deletions

cuda_core/cuda/core/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ class _PatchedProperty(metaclass=_PatchedPropMeta):
8080
)
8181
from cuda.core._event import Event, EventOptions
8282
from cuda.core._graphics import GraphicsResource
83+
from cuda.core._host import Host
8384
from cuda.core._launch_config import LaunchConfig
8485
from cuda.core._launcher import launch
8586
from cuda.core._linker import Linker, LinkerOptions
@@ -89,6 +90,7 @@ class _PatchedProperty(metaclass=_PatchedPropMeta):
8990
DeviceMemoryResourceOptions,
9091
GraphMemoryResource,
9192
LegacyPinnedMemoryResource,
93+
ManagedBuffer,
9294
ManagedMemoryResource,
9395
ManagedMemoryResourceOptions,
9496
MemoryResource,

cuda_core/cuda/core/_host.py

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
from __future__ import annotations
5+
6+
import threading
7+
from typing import ClassVar
8+
9+
10+
class Host:
11+
"""Host (CPU) location for managed-memory operations.
12+
13+
Use one of the following forms:
14+
15+
* ``Host()`` — generic host (any NUMA node).
16+
* ``Host(numa_id=N)`` — specific NUMA node ``N``.
17+
* ``Host.numa_current()`` or ``Host(is_numa_current=True)`` — NUMA node
18+
of the calling thread. ``numa_id`` and ``is_numa_current`` are
19+
mutually exclusive.
20+
21+
``Host`` is the symmetric counterpart of :class:`~cuda.core.Device`
22+
for managed-memory `prefetch`, `advise`, and `discard_prefetch`
23+
targets. Pass either a ``Device`` or a ``Host`` to those operations
24+
and to ``ManagedBuffer.preferred_location`` / ``accessed_by``.
25+
26+
``Host`` is a singleton class, mirroring :class:`~cuda.core.Device`:
27+
constructor calls with the same arguments return the same instance,
28+
so ``Host() is Host()`` and ``Host(numa_id=1) is Host(numa_id=1)``.
29+
``Host.numa_current()`` returns its own singleton, distinct from
30+
``Host()`` because it represents a thread-relative location rather
31+
than a fixed one.
32+
"""
33+
34+
__slots__ = ("__weakref__", "_is_numa_current", "_numa_id")
35+
36+
# Singleton cache keyed by (numa_id, is_numa_current).
37+
_instances: ClassVar[dict[tuple[int | None, bool], Host]] = {}
38+
_instances_lock: ClassVar[threading.Lock] = threading.Lock()
39+
40+
def __new__(cls, numa_id: int | None = None, *, is_numa_current: bool = False) -> Host:
41+
if is_numa_current and numa_id is not None:
42+
raise ValueError("numa_id and is_numa_current are mutually exclusive")
43+
if numa_id is not None and (isinstance(numa_id, bool) or not isinstance(numa_id, int) or numa_id < 0):
44+
raise ValueError(f"numa_id must be a non-negative int, got {numa_id!r}")
45+
return cls._get_or_create(numa_id, is_numa_current)
46+
47+
@classmethod
48+
def _get_or_create(cls, numa_id: int | None, is_numa_current: bool) -> Host:
49+
key = (numa_id, is_numa_current)
50+
cache = cls._instances
51+
inst = cache.get(key)
52+
if inst is not None:
53+
return inst
54+
with cls._instances_lock:
55+
inst = cache.get(key)
56+
if inst is None:
57+
inst = object.__new__(cls)
58+
inst._numa_id = numa_id
59+
inst._is_numa_current = is_numa_current
60+
cache[key] = inst
61+
return inst
62+
63+
@property
64+
def numa_id(self) -> int | None:
65+
return self._numa_id
66+
67+
@property
68+
def is_numa_current(self) -> bool:
69+
return self._is_numa_current
70+
71+
@classmethod
72+
def numa_current(cls) -> Host:
73+
"""Construct a ``Host`` referring to the calling thread's NUMA node."""
74+
return cls(is_numa_current=True)
75+
76+
def __eq__(self, other) -> bool:
77+
if not isinstance(other, Host):
78+
return NotImplemented
79+
return self is other
80+
81+
def __hash__(self) -> int:
82+
return hash((Host, self._numa_id, self._is_numa_current))
83+
84+
def __reduce__(self):
85+
if self._is_numa_current:
86+
return (_reconstruct_numa_current, ())
87+
return (Host, (self._numa_id,))
88+
89+
def __repr__(self) -> str:
90+
if self.is_numa_current:
91+
return "Host.numa_current()"
92+
if self.numa_id is None:
93+
return "Host()"
94+
return f"Host(numa_id={self.numa_id})"
95+
96+
97+
def _reconstruct_numa_current() -> Host:
98+
return Host.numa_current()

cuda_core/cuda/core/_memory/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from ._graph_memory_resource import *
88
from ._ipc import *
99
from ._legacy import *
10+
from ._managed_buffer import ManagedBuffer
1011
from ._managed_memory_resource import *
1112
from ._pinned_memory_resource import *
1213
from ._virtual_memory_resource import *

cuda_core/cuda/core/_memory/_buffer.pxd

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
from libc.stdint cimport uintptr_t
66

7+
from cuda.bindings cimport cydriver
78
from cuda.core._resource_handles cimport DevicePtrHandle
89
from cuda.core._stream cimport Stream
910

@@ -34,10 +35,20 @@ cdef class MemoryResource:
3435
pass
3536

3637

37-
# Helper function to create a Buffer from a DevicePtrHandle
38+
# Helper function to create a Buffer from a DevicePtrHandle.
39+
# `cls` lets callers materialize Buffer subclasses (e.g. ManagedBuffer for
40+
# managed-memory allocations); defaults to Buffer.
3841
cdef Buffer Buffer_from_deviceptr_handle(
3942
DevicePtrHandle h_ptr,
4043
size_t size,
4144
MemoryResource mr,
42-
object ipc_descriptor = *
45+
object ipc_descriptor = *,
46+
type cls = *,
4347
)
48+
49+
# Memory attribute query helpers (used by _managed_memory_ops)
50+
cdef void _init_mem_attrs(Buffer self)
51+
cdef int _query_memory_attrs(
52+
_MemAttrs& out,
53+
cydriver.CUdeviceptr ptr,
54+
) except -1 nogil

cuda_core/cuda/core/_memory/_buffer.pyx

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ __all__ = ['Buffer', 'MemoryResource']
7474

7575

7676

77+
7778
cdef class Buffer:
7879
"""Represent a handle to allocated memory.
7980
@@ -475,12 +476,15 @@ cdef inline int _query_memory_attrs(
475476
ret = cydriver.cuPointerGetAttributes(3, attrs, <void**>vals, ptr)
476477
HANDLE_RETURN(ret)
477478

479+
# TODO: HMM/ATS-enabled sysmem should also report is_managed=True; the
480+
# CU_POINTER_ATTRIBUTE_IS_MANAGED query does not capture that yet.
481+
out.is_managed = is_managed != 0
482+
478483
if memory_type == 0:
479484
# unregistered host pointer
480485
out.is_host_accessible = True
481486
out.is_device_accessible = False
482487
out.device_id = -1
483-
out.is_managed = False
484488
elif (
485489
is_managed
486490
or memory_type == cydriver.CUmemorytype.CU_MEMORYTYPE_HOST
@@ -489,12 +493,10 @@ cdef inline int _query_memory_attrs(
489493
out.is_host_accessible = True
490494
out.is_device_accessible = True
491495
out.device_id = device_id
492-
out.is_managed = is_managed
493496
elif memory_type == cydriver.CUmemorytype.CU_MEMORYTYPE_DEVICE:
494497
out.is_host_accessible = False
495498
out.is_device_accessible = True
496499
out.device_id = device_id
497-
out.is_managed = False
498500
else:
499501
with cython.gil:
500502
raise ValueError(f"Unsupported memory type: {memory_type}")
@@ -572,14 +574,15 @@ cdef class MemoryResource:
572574

573575
# Buffer Implementation Helpers
574576
# -----------------------------
575-
cdef inline Buffer Buffer_from_deviceptr_handle(
577+
cdef Buffer Buffer_from_deviceptr_handle(
576578
DevicePtrHandle h_ptr,
577579
size_t size,
578580
MemoryResource mr,
579-
object ipc_descriptor = None
581+
object ipc_descriptor = None,
582+
type cls = Buffer,
580583
):
581-
"""Create a Buffer from an existing DevicePtrHandle."""
582-
cdef Buffer buf = Buffer.__new__(Buffer)
584+
"""Create a Buffer (or subclass instance) from an existing DevicePtrHandle."""
585+
cdef Buffer buf = cls.__new__(cls)
583586
buf._h_ptr = h_ptr
584587
buf._size = size
585588
buf._memory_resource = mr

0 commit comments

Comments
 (0)