Commit 5befc7d
Add managed-memory advise, prefetch, and discard-prefetch free functions (#1775)
* wip
* wip
* fixing ci compiler errors
* skipping tests that aren't supported
* cu12 support
* Moving to function from Buffer class methods to free standing functions in the cuda.core.managed_memory namespace
* precommit format
* iterating on implementation
* Simplify managed-memory helpers: remove long-form aliases, cache lookups, fix docs
- Remove duplicate long-form "cu_mem_advise_*" string aliases from
_MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly
- Replace 4 boolean allow_* params in _normalize_managed_location with a
single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES
- Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag,
discard_prefetch support, and advice enum-to-alias reverse map
- Collapse hasattr+getattr to single getattr in _managed_location_enum
- Move _require_managed_discard_prefetch_support to top of discard_prefetch
for fail-fast behavior
- Fix docs build: reset Sphinx module scope after managed_memory section in
api.rst so subsequent sections resolve under cuda.core
- Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(test): reset _V2_BINDINGS cache so legacy-signature tests take the legacy path
The _V2_BINDINGS cache in _buffer.pyx persists across tests, so
monkeypatching get_binding_version alone is insufficient when earlier
tests have already populated the cache with the v2 value. Promote
_V2_BINDINGS from cdef int to a Python-level variable so tests can
monkeypatch it directly via monkeypatch.setattr, and reset it to -1
in both legacy-signature tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(test): require concurrent_managed_access for advise tests that hit real hardware
These three tests call cuMemAdvise on real CUDA devices and verify
memory range attributes. On devices without concurrent_managed_access
(e.g. Windows/WDDM), set_read_mostly silently no-ops and
set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the
stricter _skip_if_managed_location_ops_unsupported guard, matching the
pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate managed buffer before checking discard_prefetch bindings support
Reorder checks in discard_prefetch so _normalize_managed_target_range
runs before _require_managed_discard_prefetch_support. This ensures
non-managed buffers raise ValueError before the RuntimeError for missing
cuMemDiscardAndPrefetchBatchAsync support.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: extract managed memory ops into dedicated _managed_memory_ops module
Move advise, prefetch, and discard_prefetch functions and their helpers
out of _buffer.pyx into a new _managed_memory_ops Cython module to
improve separation of concerns. Expose _init_mem_attrs and
_query_memory_attrs as non-inline cdef functions in _buffer.pxd so the
new module can reuse them.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* pre-commit fix
* Removing blank file
* wip
* fix(cuda.core): update binding_version import after upstream merge
Upstream renamed get_binding_version → binding_version and moved it from
cuda.core._utils.cuda_utils to cuda.core._utils.version. Update the
managed-memory ops module to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* revert: drop managed_memory shim in cuda.core.experimental
The cuda.core.experimental namespace is being deprecated and should not
gain new submodules. Per review feedback, the managed_memory module
should only be reachable via cuda.core.managed_memory, not via the
experimental compatibility shim.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): add Location dataclass for managed memory
Frozen dataclass with classmethod constructors for the four CUmemLocationType
kinds (device, host, host_numa, host_numa_current). Validates id constraints
in __post_init__. Re-exported from cuda.core.managed_memory.
This will replace the location=/location_type= kwargs in the upcoming
unified 1..N managed-memory ops API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): add _coerce_location helper
Centralizes back-compat coercion for managed-memory Location inputs:
- Location → passthrough
- Device → Location.device(device_id)
- int >= 0 → Location.device(int)
- int == -1 → Location.host()
- None → None when allow_none=True, else ValueError
Will be used by the unified 1..N managed-memory ops API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cuda.core): update monkeypatch target after binding_version rename
The legacy-bindings monkeypatch tests still referenced get_binding_version,
which was renamed to binding_version in cf2f20d. Update both occurrences.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): tighten memory-attr query
Address review feedback on _buffer.pyx:
- Restore `inline` on `_init_mem_attrs` and `_query_memory_attrs`.
- Set `out.is_managed = (is_managed != 0)` once outside the if/elif,
rather than per-branch (driver leaves the attribute zero for
non-managed pointers, so all three branches converged on the same
value anyway).
- Add a TODO noting that HMM/ATS-enabled sysmem should also report
`is_managed=True`; the CU_POINTER_ATTRIBUTE_IS_MANAGED query does
not capture that yet.
The Cython modernization of _managed_memory_ops.pyx (cimport cydriver,
IF/ELSE for the 12/13 ABI split) is folded into Tasks 5-8 where the
public API is being rewritten anyway; doing it here would mean
rewriting the same call sites twice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): unified 1..N managed_memory.prefetch with cydriver
Rewrite prefetch() with the unified single-or-batched signature targeted by
issue #1333:
- prefetch(targets, location, *, options=None, stream)
- targets accepts a single Buffer or a sequence of Buffers
- location accepts a Location dataclass, Device, int (-1 = host), or a
sequence broadcasting to per-buffer locations
- length mismatch raises ValueError; empty targets raises ValueError
- options is reserved for future per-call flags and must be None
- stream moved to the end, kept keyword-only
Internals: switch from Python-level driver.cuMemPrefetchAsync to
Cython-level cydriver.cuMemPrefetchAsync via cimport cydriver, with
HANDLE_RETURN. Replace the runtime _V2_BINDINGS check with compile-time
IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE per the codebase precedent in
_managed_memory_resource.pyx, _memory_pool.pyx, _tensor_map.pyx.
N>1 dispatches to cydriver.cuMemPrefetchBatchAsync (CUDA 13 only); on
CUDA 12 builds, batched prefetch raises NotImplementedError. Single-range
prefetch continues to work on both CUDA 12 and 13 builds.
The location_type= keyword is removed; callers express location kind via
the Location dataclass added in 20d036e.
The advise() and discard_prefetch() functions still use the legacy
_normalize_managed_location helper and Python-level driver calls; they
will be migrated in their own tasks.
Also drops test_managed_memory_prefetch_uses_legacy_bindings_signature,
which monkeypatched the Python-level driver.cuMemPrefetchAsync — no
longer applicable since the prefetch path uses cydriver. The corresponding
advise legacy-bindings test stays for now (advise still uses Python driver).
Closes Andy-Jost's review comment that the existing API is "non-Pythonic"
by making it Pythonic in a different direction (typed Location dataclass)
while preserving the free-function shape pending Leo's tie-break on
ManagedBuffer subclass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): add managed_memory.discard
Adds a new discard(targets, *, options=None, stream) free function that
wraps cuMemDiscardBatchAsync. Accepts a single Buffer or a sequence;
N>=1 dispatches to the batched driver entry point. Requires a CUDA 13
build of cuda.core (NotImplementedError on CUDA 12 builds).
Closes the second of three batched managed-memory operations from #1333:
P1: cudaMemDiscardBatchAsync <- this commit
P1: cudaMemPrefetchBatchAsync <- 818f5d2
P1: cudaMemDiscardAndPrefetchBatchAsync <- next commit
Re-exported from cuda.core.managed_memory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): unified 1..N managed_memory.discard_prefetch with cydriver
Rewrite discard_prefetch() with the unified single-or-batched signature:
discard_prefetch(targets, location, *, options=None, stream)
- targets accepts a single Buffer or a sequence of Buffers
- location accepts a Location, Device, int, or per-buffer sequence
- length mismatch / empty targets raise ValueError
- options must be None (reserved)
- stream moved to end, kept keyword-only
Internals: switch from Python-level driver.cuMemDiscardAndPrefetchBatchAsync
to Cython-level cydriver.cuMemDiscardAndPrefetchBatchAsync. The runtime
discard-prefetch availability check is replaced by compile-time
IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE; on CUDA 12 builds the call raises
NotImplementedError.
The location_type= keyword is removed; use Location dataclass instead.
Closes the third managed-memory batched op from #1333.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): unified 1..N managed_memory.advise + drop legacy apparatus
Rewrite advise() with the unified single-or-batched signature:
advise(targets, advice, location=None, *, options=None)
- targets accepts a single Buffer or a sequence
- advice still accepts string aliases or driver.CUmem_advise enum values
- location accepts Location dataclass, Device, int, None, or per-buffer
sequence (None permitted only for set_read_mostly, unset_read_mostly,
unset_preferred_location)
- Per-advice allowed-kind validation ported to operate on Location.kind
(matches CUDA driver constraints from existing tables)
- options reserved for future per-call flags
- For N>1, loops cydriver.cuMemAdvise per buffer (no batched advise API
exists in CUDA)
Internals: switch to cydriver.cuMemAdvise (Cython-level); use compile-time
IF CUDA_CORE_BUILD_MAJOR >= 13 / ELSE for the 12/13 ABI split.
Drop the legacy apparatus that all four functions previously shared:
- _normalize_managed_location (returned Python driver.CUmemLocation)
- _make_managed_location, _managed_location_enum
- _managed_location_uses_v2_bindings + _V2_BINDINGS lazy cache
- _managed_location_to_legacy_device + _LEGACY_LOC_DEVICE/HOST cache
- _require_managed_discard_prefetch_support
- Unused module-level constants (_HOST_NUMA_CURRENT_ID,
_SINGLE_RANGE_COUNT, _MANAGED_OPERATION_FLAGS, etc.)
Also drop test_managed_memory_advise_uses_legacy_bindings_signature and
the _LEGACY_BINDINGS_VERSION constant; the runtime version switch is
gone, replaced by compile-time IF/ELSE that the test could not exercise.
The CUDA 12 vs CUDA 13 paths are now covered by the build-matrix CI job.
Closes Task 8 (advise) and Task 9 (legacy-bindings test cleanup) from
docs/superpowers/plans/2026-04-27-managed-memory-ops-batched.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): use Buffer.is_managed property in managed_memory ops
_require_managed_buffer was poking at Buffer._mem_attrs.is_managed
directly via _init_mem_attrs(). PR #1924 added the public Buffer.is_managed
property which falls back to MemoryResource.is_managed when the pointer
attribute query does not advertise managed memory (the case for pool-
allocated managed memory).
Switch _require_managed_buffer to the public property. This also fixes
a latent bug where pool-allocated managed buffers were being rejected
by the managed_memory ops despite Buffer.is_managed correctly reporting
True.
Drops the no-longer-needed cimport of _init_mem_attrs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(cuda.core): document Location, discard, and 1..N managed_memory ops
api.rst: add Location and discard to the managed_memory autosummary.
1.0.0-notes.rst: replace the placeholder bullet with a description of the
unified 1..N API, the Location dataclass, and the dispatch to batched
driver entry points on cuda.bindings 12.8+.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(cuda.core): drop narrative comments and tighten _coerce_location docstring
Per /simplify review, remove WHAT-only comments that just restate the
function signature in front of _coerce_buffer_targets and
_broadcast_locations. Tighten the _coerce_location docstring to lead
with the conversion intent rather than restate the type annotation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(cuda.core): satisfy pre-commit hooks
- ruff auto-applied:
* Drop unused `_managed_memory_ops` test import (no longer needed
after the legacy-bindings monkeypatch test was deleted)
* Drop "Location" string-quoted forward refs in
_managed_location.py (file already uses `from __future__ import
annotations`)
* Reformat string concatenations and add blank-line-after-import
spacing
- cython-lint auto-applied:
* Drop unused libc.stdint cimport of `uintptr_t`
* Drop unused `Location` Python import (only used in docstrings)
* Drop unused `n` local in `discard()`
* Move `cpython.mem cimport` of PyMem_Free / PyMem_Malloc inside
the `IF CUDA_CORE_BUILD_MAJOR >= 13:` block where the symbols
are actually used; cython-lint cannot see across compile-time
branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): move managed_memory ops to cuda.core.utils
Per Leo's review request (#1775 (comment)),
fold the managed-memory free functions and the Location dataclass into
cuda.core.utils rather than maintaining a dedicated cuda.core.managed_memory
namespace.
- Re-export Location, advise, prefetch, discard, discard_prefetch from
cuda.core.utils.
- Delete cuda.core.managed_memory module.
- Update cuda.core.__init__ to drop the managed_memory submodule import.
- Update tests to import from cuda.core.utils.
- Update api.rst: drop the dedicated Managed memory section; add the
managed-memory entries to the Utility functions section.
- Update 1.0.0-notes.rst accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(cuda.core): use __all__ in utils instead of per-import noqa
Replace seven `# noqa: F401` comments with a single `__all__` block
listing the public re-exports. Cleaner intent signal — these are
deliberate facade exports, not accidental imports — and matches the
existing __all__ convention used in cuda.core.system, _legacy.py,
and typing.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(cuda.core): collapse nested if in Location.__post_init__ (SIM102)
ruff SIM102 flagged the host/host_numa_current branch:
elif self.kind in ("host", "host_numa_current"):
if self.id is not None:
raise ValueError(...)
into a single condition with `and`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cuda.core): share one DummyUnifiedMemoryResource per batched test
Both batched-advise tests previously created a throwaway
DummyUnifiedMemoryResource per allocation inside a list comprehension:
bufs = [DummyUnifiedMemoryResource(device).allocate(size) for _ in range(2)]
The Buffer holds a reference to the throwaway MR via mr=self, so the MR
should stay alive — but on CUDA 12.9.1 CI test_batched_same_advice fails
with bufs[0] showing ptr=0x0 size=0 (the post-close state). On CUDA 13
the same pattern works.
Switch to one MR shared across both allocations. This is cleaner anyway
and removes the throwaway-per-iteration pattern as a possible source of
the cu12 issue. If the failure persists, we'll know the MR lifetime
wasn't the cause.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cuda.core): query all buffers before closing in test_batched_same_advice
On CUDA 12, freeing one managed allocation appears to clear the
read-mostly advice on neighboring ranges. The original test interleaved
query-then-close inside one loop, so the second iteration would query
bufs[1] *after* bufs[0] had been freed and observe a cleared advice
flag — causing assert 0 == 1.
Move the queries into a list comprehension that runs before any close,
then close all buffers, then assert. Decouples the verification from
the deallocation order.
CUDA 13 was unaffected because its managed-memory bookkeeping does not
exhibit the cross-range invalidation on free.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* review(cuda.core): address PR #1775 feedback
- Drop defensive cuInit retry in _query_memory_attrs (Andy): we don't
auto-init CUDA elsewhere; let HANDLE_RETURN propagate the error.
- Use checked Cython cast `<Buffer?>t` in _coerce_buffer_targets (Leo)
in place of the manual isinstance loop.
- Introduce *Options dataclasses (AdviseOptions, PrefetchOptions,
DiscardOptions, DiscardPrefetchOptions) per cuda.core convention
(Leo). Functions accept None or the matching dataclass; tests updated
to match the new error message.
* test(cuda.core): split managed-memory ops tests into tests/memory/
Move the managed-memory advise/prefetch/discard/discard_prefetch tests
(plus their TestLocation/TestLocationCoerce/TestPrefetch/TestDiscard/
TestDiscardPrefetch/TestAdvise classes and skip helpers) from
test_memory.py into tests/memory/test_managed_ops.py per Andy's nit.
Promote DummyDeviceMemoryResource to helpers.buffers so both files can
import it; the remaining DummyHost/DummyPinned/NullMemoryResource stay
in test_memory.py since they're only used there.
Broader memory-tests reorg ("and siblings": buffer/managed_resource/
pinned/vmm) tracked as a follow-up cleanup PR to keep this diff focused.
* test(cuda.core): fix options regex for AdviseOptions ("an" vs "a")
The advise() error message reads "must be an AdviseOptions instance or
None" (vowel triggers "an"), but the regex matched only "must be a ".
Relax to "must be an?" so all three op tests pass.
* chore(cuda.core): drop unused utils import + trailing blank lines
Pre-commit cleanup after splitting managed-memory ops tests out of
test_memory.py: the `cuda.core.utils` import is no longer used here,
and ruff trimmed trailing blank lines.
* feat(cuda.core): add ManagedBuffer subclass + Host location
Land Andy's ManagedBuffer + Device/Host design (review #3976251223,
#3164213789). The free-function shape introduced earlier in this PR is
preserved; ManagedBuffer methods delegate into it, so existing call
sites keep working.
ManagedBuffer
- Subclass of Buffer returned by ManagedMemoryResource.allocate, also
constructable from an external pointer via ManagedBuffer.from_handle.
- Property-style advice API:
- read_mostly (bool, driver-backed get/set)
- preferred_location (Device | Host | None, get/set; None unsets)
- accessed_by (live AccessedBySet view: __contains__/__iter__/len
query the driver, add()/discard() issue advice; setter diffs and
advises only the deltas)
- Instance methods prefetch / discard / discard_prefetch delegate to
the matching cuda.core.utils functions.
Host
- New top-level class symmetric to Device. Host(), Host(numa_id=N),
Host.numa_current(). Replaces Location.host()/host_numa()/etc.
Location -> Device|Host|int
- Drop the public Location dataclass and its classmethod constructors.
- _coerce_location now accepts Device | Host | int | None and produces
an internal _LocSpec record; advise/prefetch/discard/discard_prefetch
signatures and docstrings updated accordingly.
- int still accepted for ergonomic compatibility (-1 = host, >=0 =
device ordinal).
Plumbing
- Buffer_from_deviceptr_handle takes an optional `cls` parameter so the
pool allocator can materialize Buffer subclasses; _MP_allocate threads
the same parameter through; ManagedMemoryResource.allocate passes
ManagedBuffer.
Tests
- TestHost replaces TestLocation; TestLocationCoerce adapted to the new
coerce signature. New TestManagedBuffer covers from_handle,
isinstance(allocate(), ManagedBuffer), read_mostly/preferred_location/
accessed_by roundtrips, and instance methods. Property tests use
external (cuMemAllocManaged) backing wrapped via from_handle, since
some driver/device combinations decline cuMemAdvise on
pool-allocated managed memory.
- Use cuDeviceGetCount in AccessedBySet._query so the read path doesn't
pull in NVML.
Docs
- 1.0.0 notes describe Host, ManagedBuffer, the property API, and the
Device/Host location inputs. api.rst lists Host, ManagedBuffer, and
the *Options dataclasses; Location is removed.
* chore(cuda.core): simplify ManagedBuffer per /simplify review
- Buffer.from_handle is now a classmethod that dispatches via cls._init,
so subclasses inherit it: ManagedBuffer.from_handle(...) returns a
ManagedBuffer with no override needed. Drop ManagedBuffer.from_handle.
- Hoist `advise / prefetch / discard / discard_prefetch` imports from
per-method lazy imports to module-level (no circular import: they live
in cuda.core._memory._managed_memory_ops, not cuda.core.utils).
- Cache the CUmem_advise and CUmem_range_attribute enum lookups at
module level and pass enum constants directly to advise() instead of
re-resolving from string aliases on every property write.
- Extract _query_accessed_by as a module-level helper; AccessedBySet
delegates and the accessed_by setter calls it directly instead of
constructing a throwaway view.
* ci: re-trigger CI (transient cuInit INVALID_DEVICE on l4 runner)
* refactor(cuda.core): use libcpp.vector for batched-op C arrays (R14)
Per Andy's review nit (PR #1775, _managed_memory_ops.pyx:207), replace
the manual PyMem_Malloc / PyMem_Free pattern in the three batch helpers
(_do_batch_discard, _do_batch_prefetch, _do_batch_discard_prefetch)
with libcpp.vector. RAII handles cleanup, eliminating the manual
try/finally and removing a leak window if _to_cumemlocation raised
mid-fill. Matches the precedent used in _program.pyx, _linker.pyx,
_kernel_arg_handler.pyx, _graph_node.pyx, and others.
Net change: 53 insertions, 85 deletions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cuda.core): restore CUDA_ERROR_NOT_INITIALIZED auto-init in _query_memory_attrs (R4)
Per Leo's review on PR #1775 (_buffer.pyx:455), restore the auto-init
retry that was removed in 10de998. cuPointerGetAttributes is the
first driver call _query_memory_attrs makes, and a NOT_INITIALIZED
result here would otherwise propagate out of every is_managed /
is_host_accessible / is_device_accessible query before the user has
called any other Device API.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): make Host a plain class instead of a dataclass (R1)
Per Leo's review on PR #1775 (_host.py:9), drop the @DataClass(frozen=True)
in favor of a hand-written class with property accessors. Matches Leo's
original sketch from the 2026-04-28 drive-by comment and aligns with
how Device is structured in this codebase.
Behavior preserved: Host(), Host(numa_id=N), and Host.numa_current()
all work identically. __eq__, __hash__, and immutability are
hand-rolled rather than dataclass-generated.
is_numa_current is no longer an __init__ kwarg — it's internal state
settable only via the Host.numa_current() classmethod. Two existing
TestHost cases updated:
- test_numa_current_with_id_rejected → test_numa_current_only_via_classmethod
- test_frozen → test_immutable (AttributeError instead of FrozenInstanceError)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core)!: drop int location shorthand from managed-memory ops (R6, R8)
Per Leo's review on PR #1775 (_managed_buffer.py:165) and Andy's
parallel question (line 144), drop the `int` shorthand for
prefetch/discard_prefetch/advise locations. The previous design
accepted `Device | Host | int` where `int >= 0` meant a device ordinal
and `-1` magically meant host. With first-class `Device` and `Host`,
the int form was redundant and the `-1 → Host` magic was surprising.
Public API change:
prefetch(buf, Device(0), stream=...) # was: prefetch(buf, 0, stream=...)
prefetch(buf, Host(), stream=...) # was: prefetch(buf, -1, stream=...)
This also resolves an inconsistency: ManagedBuffer.preferred_location
already accepted only Device | Host | None, but prefetch() and
discard_prefetch() accepted int. Now uniformly Device | Host.
Pre-1.0 breaking change. Anyone using the int shorthand should switch
to the explicit Device(N) / Host() form.
Files touched:
- _managed_location.py: drop the int branch from _coerce_location;
TypeError now reads "Device, Host, or None"
- _managed_buffer.py: type signatures `Device | Host | int` → `Device | Host`
- _managed_memory_ops.pyx: docstring updates (3 occurrences)
- tests/memory/test_managed_ops.py: replace int call sites with
Host()/Device(N); collapse three int-branch tests into one
test_int_rejected
- 1.0.0-notes.rst: drop the "int values are also accepted" sentence
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(cuda.core): add AccessedBySet to api_private.rst (R5)
Per Andy's review on PR #1775 (_managed_buffer.py:52), document
`AccessedBySet` in the private API reference. It is returned by
`ManagedBuffer.accessed_by` but not directly instantiable by users —
matches the existing `_memory._ipc.*` entries in the same section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(cuda.core): note the legacy NUMA round-trip limitation on preferred_location (R2, R7)
Per Leo's questions on PR #1775 (_host.py:26 and _managed_buffer.py:140):
R2 (Host numa_id): the dataclass surface is intentional. Three forms
already cover the use cases — Host() / Host(numa_id=N) /
Host.numa_current(). Auto-inferring numa_id at Host() construction
would conflict with the "generic host" semantic.
R7 (preferred_location getter): the underlying limitation is real but
upstream-blocked. The legacy CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION
returns only a single int (device id, -1 host, -2 none) — no NUMA. CUDA
13 added _PREFERRED_LOCATION_TYPE / _ID for full round-trip, and they
are exposed in cydriver, but cuda.bindings'
_HelperCUmem_range_attribute does not yet recognize them — calling
driver.cuMemRangeGetAttribute with the new attributes raises
"Unsupported attribute". Once cuda.bindings adds them, this getter can
query the v2 attributes and return Host(numa_id=N).
Add a docstring note documenting the limitation so users aren't
surprised by the lossy round-trip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): use collections.abc.Sequence for input checks (R12, R13)
Per Andy's review on PR #1775 (_managed_memory_ops.pyx:102 and :118),
replace `isinstance(x, (list, tuple))` with `isinstance(x, Sequence)`
in `_coerce_buffer_targets` and `_broadcast_locations`. Matches the
existing precedent in `cuda.core._utils.cuda_utils.is_sequence()`.
The widened input set also accepts `str`, but neither `Buffer` nor
`Location` is stringly-typed, so a `str` input still raises — just
with a different message (Buffer cast error or Location TypeError
from `_coerce_location`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): narrow Buffer.from_handle to Buffer-only (R3)
Per Leo's review on PR #1775 (_buffer.pyx:135), make
Buffer.from_handle a @staticmethod that always returns Buffer.
Subclass-aware construction stays available via the private
@classmethod Buffer._init, which is what Leo asked for ("use a
private method for handling subclasses for now").
ManagedBuffer gains its own @classmethod from_handle that wraps
cls._init, so user-facing call sites like
ManagedBuffer.from_handle(ptr, size, owner=plain) continue to work
unchanged. The narrowly-scoped subclass factory is on the subclass
itself, not bolted onto Buffer's public surface.
This addresses R3's spirit: cuda.core's public APIs no longer
advertise generic subclass-construction support that conflicts
with the broader subclassing story tracked in #750 / #1989.
No test changes; behavior preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): single API surface per operation (R9, R10, R11)
Per Leo's R11 ("if we prefer methods, don't expose free functions"):
each managed-memory operation now has exactly one public surface,
chosen by whether it acts on one buffer or many.
Single buffer (instance methods + properties on ManagedBuffer):
- buf.read_mostly = True
- buf.preferred_location = Device(0)
- buf.accessed_by.add(Device(1))
- buf.prefetch(Device(0), stream=stream)
- buf.discard(stream=stream)
- buf.discard_prefetch(Device(0), stream=stream)
Multiple buffers (free functions in cuda.core.utils, CUDA 13+ only):
- utils.prefetch_batch(buffers, locations, stream=stream)
- utils.discard_batch(buffers, stream=stream)
- utils.discard_prefetch_batch(buffers, locations, stream=stream)
Removed:
- cuda.core.utils.advise / prefetch / discard / discard_prefetch
(single-buffer surfaces — replaced by ManagedBuffer methods/properties)
- cuda.core._memory._managed_memory_options module and its four empty
AdviseOptions / PrefetchOptions / DiscardOptions /
DiscardPrefetchOptions dataclasses (R9 from Leo, R10 from Andy:
empty placeholders that didn't carry information)
- options=None parameter from every public surface
- The single-buffer fast path inside the now-batched-only free
functions; they always hit cuMem*BatchAsync now
Internals:
- Public def advise() deleted; _advise_one (cdef) is the new internal
single-buffer entry point used by ManagedBuffer property setters.
- Three new Python-level wrappers _do_single_prefetch_py /
_do_single_discard_py / _do_single_discard_prefetch_py used by
ManagedBuffer instance methods. These call the cdef _do_single_*
helpers with the right Cython types after stream coercion.
- _coerce_buffer_targets renamed to _coerce_batch_buffers; rejects a
single Buffer with a TypeError pointing at the ManagedBuffer method.
Tests:
- TestPrefetch / TestDiscard / TestDiscardPrefetch / TestAdvise
rewritten as TestPrefetchBatch / TestDiscardBatch /
TestDiscardPrefetchBatch (batched-only, since single-buffer is
covered by ManagedBuffer's TestManagedBuffer class)
- Single-buffer external-allocation tests use
ManagedBuffer.from_handle(plain.handle, plain.size, owner=plain)
to wrap a DummyUnifiedMemoryResource buffer
- options-related tests deleted (no options surface to test)
- enum-value advise test deleted (property setters are typed; the
string-alias / enum-value internal API isn't user-visible)
Release notes updated.
Closes R9, R10, R11.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): build advise reverse-lookup eagerly at module load (N4)
Per Leo's review on PR #1775 (_managed_memory_ops.pyx:23), drop the
lazy-init plumbing for the enum→alias reverse lookup table. The forward
table _MANAGED_ADVICE_ALIASES has six entries; building the inverse at
module load via a dict comprehension is the same data without the
mutable-global pattern, the `if None` check, or the `global` declaration
inside the function body.
Forward lookup table (_MANAGED_ADVICE_ALIASES) is preserved as the source
of truth — explicit alias→CUDA-name mapping, grep-friendly, no implicit
naming-convention coupling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(cuda.core): factor shared body of _do_batch_{prefetch,discard_prefetch} (N2)
Per Leo's review on PR #1775 (_managed_memory_ops.pyx:425), the two
batched-with-locations helpers were byte-for-byte identical except for
the driver function being called. Both:
- declare the same four std::vectors (ptrs, sizes, loc_arr, loc_indices)
- resize and fill them in the same loop
- release the GIL and call cuMem{Prefetch,DiscardAndPrefetch}BatchAsync
with the same argument shape
Introduce a function-pointer typedef _BatchPrefetchFn (the two driver
calls share signature), parameterize the shared body as
_do_batch_prefetch_op, and have the two callers pass the appropriate
driver function. Both the typedef and the helper live inside the
IF CUDA_CORE_BUILD_MAJOR >= 13 block since they reference cu13-only
types.
Net: -28 lines duplication, +25 for the shared helper. No behavior
change; tests unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cuda.core): reuse production _get_int_attr in managed-memory tests (N6)
Per Leo's review on PR #1775 (test_managed_ops.py:28), the test
file's _get_mem_range_attr / _get_int_mem_range_attr / the local
_MEM_RANGE_ATTRIBUTE_VALUE_SIZE constant are functionally identical
to the production _get_int_attr in _managed_buffer.py. Drop the
duplicates and import the production helper.
14 call sites updated. No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): cu12 fallback for prefetch_batch (N3)
Per Leo's review on PR #1775 (_managed_memory_ops.pyx:228), raising
NotImplementedError on cu12 forces users to write their own loop. The
CUDA driver semantics for cuMemPrefetchBatchAsync are equivalent to
per-range cuMemPrefetchAsync calls — just more efficient when batched
at the driver level.
On cu12 builds (where cuMemPrefetchBatchAsync is not exposed), fall
back to a Python-level loop calling cuMemPrefetchAsync per buffer.
The single-range path (_do_single_prefetch) already works on cu12
via the IF/ELSE split inside it.
Note this fallback applies only to prefetch_batch — discard_batch and
discard_prefetch_batch keep the cu12 NotImplementedError because the
driver has no single-range cuMemDiscard{,AndPrefetch}Async to fall
back to.
Test skips for cuMemPrefetchBatchAsync unavailability dropped from
TestPrefetchBatch.test_same_location and test_per_buffer_location;
the fallback path now runs on cu12 builds too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cuda.core): cover AccessedBySet read methods (N7)
Per Leo's review on PR #1775 (test_managed_ops.py:1), add a test for
the read side of AccessedBySet: __iter__, __len__, __eq__, __repr__.
These are part of the public set-like API (alongside __contains__,
add(), discard(), and the setter, which are already covered) but
were untested.
The cu12 batch fallback path (Leo's other coverage point) is now
exercised by TestPrefetchBatch.test_same_location and
test_per_buffer_location running on cu12 CI — the
cuMemPrefetchBatchAsync skip was dropped in d75a7bd when the
fallback landed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cuda.core): cu13 NUMA round-trip for ManagedBuffer.preferred_location (N8)
Per the self-promised reply on PR #1775's R7 thread, fulfill the
Host(numa_id=N) round-trip on CUDA 13 builds.
The blocker before was that cuda.bindings's Python-level
cuMemRangeGetAttribute wrapper rejects the new
CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION_TYPE / _ID attributes via
its allowlist. The workaround: call cydriver.cuMemRangeGetAttribute
directly from a new Cython helper _read_preferred_location_v2,
bypassing the Python wrapper.
The helper queries TYPE then ID, then decodes the (kind, id) pair into
Device | Host | Host(numa_id=N) | Host.numa_current() | None.
ManagedBuffer.preferred_location getter dispatches to the v2 path on
binding_version() >= (13, 0, 0); falls back to the legacy single-int
attribute on cu12 (no NUMA info available).
Test:
- TestManagedBuffer.test_preferred_location_roundtrip already exercises
the cu13 v2 path for Device(...) and Host() (no NUMA), which now
passes through _read_preferred_location_v2.
- New test_preferred_location_roundtrip_host_numa exercises Host(numa_id=0)
round-trip; skips on cu12, and also skips on cu13 hardware/drivers
where set_preferred_location with HOST_NUMA is not preserved (e.g.
single-NUMA test machines).
ManagedBuffer class docstring updated to reflect the cu12-only
limitation note.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(cuda.core): replace stale utils autosummary entries
api.rst still listed the single-buffer free functions and *Options
dataclasses that were removed under R9/R11 (advise, prefetch, discard,
discard_prefetch and their *Options classes). Replace with the actual
cuda.core.utils exports: prefetch_batch, discard_batch,
discard_prefetch_batch. Drop the now-orphan :template: dataclass.rst
line.
* feat(cuda.core): make Host a singleton class
Mirror Device's singleton semantics so Host() is Host() and
Host(numa_id=1) is Host(numa_id=1) hold. Host.numa_current() returns
its own singleton, distinct from Host(), since it represents a
thread-relative location rather than a fixed one.
Construction routes through __new__ -> _get_or_create with a
double-checked dict + Lock cache keyed on (numa_id, is_numa_current).
__eq__ collapses to identity (consistent with the retained __hash__).
__reduce__ added so pickled Host instances round-trip back through
the singleton cache instead of stranding copies.
Resolves PR #1775 review: leofang and Andy-Jost requested Host follow
Device as a singleton so users can rely on `is` for identity checks.
* refactor(cuda.core): rename AccessedBySet -> AccessedBySetProxy
Align with the graph module's AdjacencySetProxy: rename the class and
inherit from collections.abc.MutableSet so the full set interface
(remove, pop, clear, |=, &=, -=, ^=, isdisjoint, subset/superset
operators, etc.) is filled in automatically from the existing add /
discard / __contains__ / __iter__ / __len__ primitives.
Add classmethod _from_iterable so binary set operators (&|^) produce
plain sets rather than constructing a buffer-less proxy. Tighten add
to TypeError on non-Device/Host inputs and discard / __contains__ to
silently ignore them, matching MutableSet contracts. The hand-rolled
__eq__ (set/frozenset comparison) is dropped: Set ABC's default
implementation handles it correctly.
Resolves PR #1775 review (Andy-Jost, 2026-05-04): naming consistency
with AdjacencySetProxy and full MutableSet conformance.
* fix(cuda.core): silence ruff lints on Host singleton
- Annotate _instances / _instances_lock as ClassVar (RUF012).
- Sort __slots__ alphabetically (RUF023, auto-fixed by ruff).
* fix(cuda.core): reject bool as Host(numa_id=...)
bool is an int subclass, so the previous guard let Host(True) and
Host(False) seed the singleton cache under the same keys as Host(1)
and Host(0). Whichever call landed first won, leaving repr(Host(1))
potentially showing as Host(numa_id=True). Reject bool explicitly.
Addresses rwgk's Low finding on PR #1775.
* fix(cuda.core): hoist managed-buffer check in _advise_one
Move _require_managed_buffer to the first statement of _advise_one so
a non-managed buffer is rejected before advice/location parsing,
matching the order in _do_single_prefetch_py and
_do_single_discard_prefetch_py. This prevents surfacing an
advice-validation error when the real problem is the buffer kind.
* fix(cuda.core): clarify CUDA 12 NUMA-host error message
Rephrase the RuntimeError raised from _to_legacy_device when a caller
passes Host(numa_id=...) or Host.numa_current() on a CUDA 12 build.
The new message names the unsupported APIs and points the user at
Host() as the working alternative, instead of leaking the internal
location_type discriminator.
* fix(cuda.core): reject Host(numa_id=...) up-front on CUDA 12
The CUDA 12 cuMemPrefetchAsync / cuMemAdvise ABI takes a plain device
ordinal and cannot represent a specific host NUMA node. Previously
_coerce_location accepted Host(numa_id=...) and Host.numa_current()
on a CUDA 12 build and let the operation fail late inside the Cython
layer with RuntimeError, which the public APIs surfaced as a confusing
error from deep in the stack.
Reject NUMA-host kinds at the call boundary in _coerce_location with
a TypeError that names the unsupported APIs and points at Host() as
the working alternative. Update the ManagedBuffer docstring to match
the new contract, and broaden two host_numa-rejection test asserts to
accept either the CUDA 13 kind-allowed ValueError or the CUDA 12
boundary TypeError.
Addresses rwgk's Medium finding on PR #1775.
* fix(cuda.core): make ManagedBuffer.accessed_by setter atomic
The previous setter computed (current - target) and (target - current)
and called _advise_one in two loops. set(locations) raised TypeError
on unhashable elements, but only after the first diff pair had already
been issued, so an invalid RHS could leave accessed_by partially
mutated. Reproduce: starting from {Device(0)}, assigning
{Host(numa_id=0)} on CUDA 12 raises and leaves accessed_by == set().
Validate every target up-front (per-element isinstance(Device|Host))
and only then issue the diff loops, so a bad RHS raises before any
driver state changes.
Addresses rwgk's High finding on PR #1775.
* style(cuda.core): apply ruff format
Collapses multi-line string concats and conditions back to single lines
under the project's line-length limit. No behavior change.
* Skip NUMA-aware Host coerce tests on CUDA 12 builds
Host(numa_id=N) and Host.numa_current() require CUDA 13 bindings; the
TestLocationCoerce passthroughs were missing the binding_version guard
already used by test_preferred_location_roundtrip_host_numa.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 42eecda commit 5befc7d
19 files changed
Lines changed: 1589 additions & 40 deletions
File tree
- cuda_core
- cuda/core
- _memory
- utils
- docs/source
- release
- tests
- helpers
- memory
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| 83 | + | |
83 | 84 | | |
84 | 85 | | |
85 | 86 | | |
| |||
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
| 93 | + | |
92 | 94 | | |
93 | 95 | | |
94 | 96 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
| |||
34 | 35 | | |
35 | 36 | | |
36 | 37 | | |
37 | | - | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
38 | 41 | | |
39 | 42 | | |
40 | 43 | | |
41 | 44 | | |
42 | | - | |
| 45 | + | |
| 46 | + | |
43 | 47 | | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| 77 | + | |
77 | 78 | | |
78 | 79 | | |
79 | 80 | | |
| |||
475 | 476 | | |
476 | 477 | | |
477 | 478 | | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
478 | 483 | | |
479 | 484 | | |
480 | 485 | | |
481 | 486 | | |
482 | 487 | | |
483 | | - | |
484 | 488 | | |
485 | 489 | | |
486 | 490 | | |
| |||
489 | 493 | | |
490 | 494 | | |
491 | 495 | | |
492 | | - | |
493 | 496 | | |
494 | 497 | | |
495 | 498 | | |
496 | 499 | | |
497 | | - | |
498 | 500 | | |
499 | 501 | | |
500 | 502 | | |
| |||
572 | 574 | | |
573 | 575 | | |
574 | 576 | | |
575 | | - | |
| 577 | + | |
576 | 578 | | |
577 | 579 | | |
578 | 580 | | |
579 | | - | |
| 581 | + | |
| 582 | + | |
580 | 583 | | |
581 | | - | |
582 | | - | |
| 584 | + | |
| 585 | + | |
583 | 586 | | |
584 | 587 | | |
585 | 588 | | |
| |||
0 commit comments