Skip to content

Commit 50c19d0

Browse files
cuda.core: require explicit stream for stream-scheduling APIs (#2020)
* cuda.core: require explicit stream for stream-scheduling APIs (#2001) Removes the implicit fallback to default_stream() (or NULL) on APIs that schedule work on a stream. `stream` is now a required keyword-only argument; `Stream_accept(None)` raises TypeError. Affected APIs: - MemoryResource.allocate / deallocate and overrides on DeviceMemoryResource, PinnedMemoryResource, ManagedMemoryResource, LegacyPinnedMemoryResource, GraphMemoryResource. - Device.allocate. - GraphicsResource.map. - KernelOccupancy.max_potential_cluster_size / max_active_clusters. - Graph.launch (stream was previously positional). Stream_accept is promoted to cpdef so the pure-Python legacy/sync resources can call it. Also fixes a latent bug uncovered while doing this: the C++ MR deallocation callback in Buffer's GC path was calling `mr.deallocate(ptr, size, stream)` positionally, which would fail with the new keyword-only signature for every garbage-collected DeviceMemoryResource/GraphMemoryResource buffer. Switched to `stream=stream`. VirtualMemoryResource is exempt because cuMemCreate / cuMemMap are synchronous and not stream-ordered; it now accepts (and validates) an optional stream instead of rejecting any non-None value. Buffer.from_ipc_descriptor is also exempt: stream there only seeds the deallocation stream stored in the handle (no work is scheduled), the same shape as Buffer.close(stream=None). Tests, examples, and the v1.0.0 release note are updated accordingly. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: also require explicit stream for Buffer.from_ipc_descriptor (#2001) Buffer.from_ipc_descriptor previously fell back to default_stream() when stream=None. That fallback is exactly the implicit-fallback pattern issue #2001 removes (the chosen stream depends on global state, not the call site), so it does not belong in the same exemption category as Buffer.close(stream=None) / GraphicsResource.unmap(stream=None) which genuinely reuse an existing stream. stream is now keyword-only and required. Internal validation goes through Stream_accept like the other tightened APIs. Tests and the v1.0.0 release note updated accordingly. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: align deallocate signatures and revert Graph.launch (#2001) - Make `deallocate` keyword-only on the synchronous resources (`LegacyPinnedMemoryResource`, `_SynchronousMemoryResource`, `VirtualMemoryResource`) so every memory-resource API obeys the kw-only rule, with `stream=None` as the default since these resources do not actually use the stream. - Revert `Graph.launch` to take `stream` positionally. It is the same shape as the kernel `launch(stream, config, kernel, *args)` API (already exempt in the issue) and shouldn't be the odd one out. - Tighten `VirtualMemoryResource.deallocate` docstring to match `allocate`. - Mark unused lambda args in `test_pass_object` as `_stream` to silence ARG005. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: tighten test mocks and add Stream_accept(None) test (#2001) Review follow-ups: - Tighten the test-only `MemoryResource` subclasses (`DummyDeviceMemoryResource`, `DummyHostMemoryResource`, `DummyPinnedMemoryResource`, `DummyUnifiedMemoryResource`, `TrackingMR`, `StreamCaptureMR`) to match the new public API: `allocate(self, size, *, stream)` and `deallocate(self, ptr, size, *, stream)` with no default. Previously the mocks accepted `stream=None` positionally, which let tests bypass the new explicit-stream policy. - Update the affected helper functions and call sites in `test_memory.py` to pass `stream=device.default_stream` explicitly. Fix the `super().deallocate(ptr, size, stream)` positional call in `test_mr_deallocate_receives_stream` to use `stream=stream`. - Update `helpers/buffers.py` similarly (`make_scratch_buffer`, `PatternGen`). - Add a direct test for the centralized `Stream_accept(None)` -> `TypeError` behavior in `test_stream.py`. - Tighten the release note for `Buffer.from_ipc_descriptor`: lead with the removal of the silent fallback to the default stream rather than the positional-to-keyword shift. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: fix Buffer pickle path broken by kw-only stream (#2001) `Buffer._reduce_helper` (the pickle/unpickle factory) previously called `Buffer.from_ipc_descriptor(mr, ipc_descriptor)` without a stream and relied on the implicit `default_stream()` fallback inside `Buffer_from_ipc_descriptor`. Making `from_ipc_descriptor`'s stream a required keyword-only argument broke this code path, causing every multiprocessing IPC test that pickles a `Buffer` (test_send_buffers, test_memory_ipc, test_event_ipc, test_serialize, test_workerpool, ...) to fail in the child process with: TypeError: from_ipc_descriptor() needs keyword-only argument stream Fix: pass `default_stream()` explicitly from `_reduce_helper`. The parent process's stream isn't portable across processes, so the pickle path cannot thread an explicit stream through. The receiver can still override the deallocation stream via `buffer.close(stream=...)`. The user-facing rule still holds: callers of `Buffer.from_ipc_descriptor` must pass an explicit stream. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: relax kw-only TypeError regex for Cython funcs (#2001) Cython-generated functions raise "FUNC() needs keyword-only argument stream" while pure-Python functions raise "FUNC() missing 1 required keyword-only argument: 'stream'" The new tests for `Kernel.occupancy.max_potential_cluster_size`, `Kernel.occupancy.max_active_clusters`, and `GraphicsResource.map` were matching only the CPython phrasing and failed against the Cython forms. Loosen the regex to `keyword-only argument`, which matches both. Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: review fixes for #2001 (graph_update.py + _legacy.py) - examples/graph_update.py: use the dedicated `stream` created at the top of the example for the pinned allocation, instead of `device.default_stream`. Better model for users (Leo). - _memory/_legacy.py: route the user-supplied `stream` through `Stream_accept` in `LegacyPinnedMemoryResource.deallocate` and `_SynchronousMemoryResource.deallocate` so a non-`Stream` argument raises the clean `TypeError` from `Stream_accept` instead of an `AttributeError` from `.sync()` (matches the validation the matching `allocate` methods already do). Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: drop unused stream= kwargs from sync MR call sites (#2001) Synchronous memory resources (`LegacyPinnedMemoryResource`, `_SynchronousMemoryResource`, the various test mocks `DummyDeviceMR`, `DummyHostMR`, `DummyPinnedMR`, `DummyUnifiedMR`, `NullMemoryResource`, `TrackingMR`, `StreamCaptureMR`) take a stream argument purely for interface conformance with stream-ordered MRs but never use it. Forcing every caller to manufacture a stream just to discard it adds ceremony and a misleading model. Switch these MRs' allocate/deallocate signatures to keyword-only `stream=None` (validated via `Stream_accept` when provided), and drop the now-unused `stream=...` kwargs from ~35 call sites across examples, tests, and helpers. Also drop the `device` parameter from `buffer_initialization` and `buffer_close` test helpers (no longer needed) and remove leftover Device-setup boilerplate from the NullMemoryResource dlpack-failure tests. The user-facing rule is unchanged for the genuinely stream-ordered APIs (`DeviceMemoryResource`, `PinnedMemoryResource`, `ManagedMemoryResource`, `GraphMemoryResource`, `Device.allocate`, `Buffer.from_ipc_descriptor`, etc.): stream remains required and keyword-only. The release note is updated to reflect the sync-MR exemption (folding `LegacyPinnedMemoryResource` in alongside `VirtualMemoryResource`). Co-authored-by: Cursor <cursoragent@cursor.com> * cuda.core: fix C++ teardown leak when buffer has no attached stream (#2001) Issue: the C++ ``shared_ptr`` deleter for a buffer's device-pointer handle invokes ``MemoryResource.deallocate`` via ``_mr_dealloc_callback``. The handle's deallocation stream is set separately via ``set_deallocation_stream``; if it was never set (e.g. buffers minted via ``Buffer.from_handle(ptr, size, mr=mr)`` from DLPack import, IPC import, or third-party adapters), the callback would pass ``stream=None`` to ``mr.deallocate``. After the strict-stream changes for #2001, the stream-ordered MR overrides reject ``stream=None`` via ``Stream_accept`` and raise ``TypeError``. The ``noexcept`` callback catches the exception, prints a warning to stderr, and returns -- silently **leaking** the underlying CUDA allocation (and any associated IPC handles). Fix: when ``h_stream`` is empty in ``_mr_dealloc_callback``, fall back to ``default_stream()`` instead of ``None``. The C++ teardown path is the unique legitimate "no-stream-context" caller (no Python frame from which to obtain a stream), so this is the one place where an implicit default-stream fallback is necessary; everywhere else the policy remains "stream is required and must be passed explicitly". Add ``test_mr_dealloc_callback_falls_back_to_default_stream`` covering the regression: a strict stream-ordered mock MR is used to back a ``Buffer.from_handle`` (no attached stream), and the test asserts that ``deallocate`` is invoked with the default stream rather than failing with ``TypeError`` and leaking. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 598a966 commit 50c19d0

36 files changed

Lines changed: 396 additions & 250 deletions

cuda_core/cuda/core/_context.pxd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
#
33
# SPDX-License-Identifier: Apache-2.0
44

cuda_core/cuda/core/_device.pyx

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1394,14 +1394,12 @@ class Device:
13941394
cdef Context ctx = self._context
13951395
return cyEvent._init(cyEvent, self._device_id, ctx._h_context, options, True)
13961396

1397-
def allocate(self, size, stream: Stream | GraphBuilder | None = None) -> Buffer:
1397+
def allocate(self, size, *, stream: Stream | GraphBuilder) -> Buffer:
13981398
"""Allocate device memory from a specified stream.
13991399

14001400
Allocates device memory of `size` bytes on the specified `stream`
14011401
using the memory resource currently associated with this Device.
14021402

1403-
Parameter `stream` is optional, using a default stream by default.
1404-
14051403
Note
14061404
----
14071405
Device must be initialized.
@@ -1410,9 +1408,10 @@ class Device:
14101408
----------
14111409
size : int
14121410
Number of bytes to allocate.
1413-
stream : :obj:`~_stream.Stream`, optional
1414-
The stream establishing the stream ordering semantic.
1415-
Default value of `None` uses default stream.
1411+
stream : :obj:`~_stream.Stream` | :obj:`~graph.GraphBuilder`
1412+
Keyword-only. The stream establishing the stream ordering semantic.
1413+
Must be passed explicitly; pass ``self.default_stream`` to use
1414+
the default stream.
14161415

14171416
Returns
14181417
-------
@@ -1421,7 +1420,7 @@ class Device:
14211420

14221421
"""
14231422
self._check_context_initialized()
1424-
return self.memory_resource.allocate(size, stream)
1423+
return self.memory_resource.allocate(size, stream=stream)
14251424

14261425
def sync(self):
14271426
"""Synchronize the device.

cuda_core/cuda/core/_graphics.pyx

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ from cuda.core._resource_handles cimport (
1212
as_intptr,
1313
)
1414
from cuda.core._memory._buffer cimport Buffer, Buffer_from_deviceptr_handle
15-
from cuda.core._stream cimport Stream, Stream_accept, default_stream
15+
from cuda.core._stream cimport Stream, Stream_accept
1616
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
1717

1818
__all__ = ['GraphicsResource']
@@ -206,7 +206,7 @@ cdef class GraphicsResource:
206206
return None
207207
return self._mapped_buffer
208208

209-
def map(self, *, stream: Stream | None = None) -> Buffer:
209+
def map(self, *, stream: Stream) -> Buffer:
210210
"""Map this graphics resource for CUDA access.
211211

212212
After mapping, a CUDA device pointer into the underlying graphics
@@ -220,9 +220,10 @@ cdef class GraphicsResource:
220220

221221
Parameters
222222
----------
223-
stream : :class:`~cuda.core.Stream`, optional
224-
The CUDA stream on which to perform the mapping. If ``None``,
225-
the current default stream is used.
223+
stream : :class:`~cuda.core.Stream`
224+
Keyword-only. The CUDA stream on which to perform the mapping.
225+
Must be passed explicitly; pass ``device.default_stream`` to use
226+
the default stream.
226227

227228
Returns
228229
-------
@@ -248,7 +249,7 @@ cdef class GraphicsResource:
248249
if self._get_mapped_buffer() is not None:
249250
raise RuntimeError("GraphicsResource is already mapped")
250251

251-
s_obj = default_stream() if stream is None else Stream_accept(stream)
252+
s_obj = Stream_accept(stream)
252253
raw = as_cu(self._handle)
253254
cy_stream = as_cu(s_obj._h_stream)
254255
with nogil:

cuda_core/cuda/core/_layout.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -460,7 +460,7 @@ cdef class _StridedLayout:
460460
required_size = layout.required_size_in_bytes()
461461
# allocate the memory on the device
462462
device.set_current()
463-
mem = device.allocate(required_size)
463+
mem = device.allocate(required_size, stream=device.default_stream)
464464
# create a view on the newly allocated device memory
465465
b_view = StridedMemoryView.from_buffer(mem, layout, a_view.dtype)
466466
return b_view

cuda_core/cuda/core/_memory/_buffer.pyx

Lines changed: 47 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ from cuda.core._resource_handles cimport (
2424
)
2525
from cuda.core.typing import DevicePointerType
2626

27-
from cuda.core._stream cimport Stream, Stream_accept
27+
from cuda.core._stream cimport Stream, Stream_accept, default_stream
2828
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN, _parse_fill_value
2929

3030
import sys
@@ -49,12 +49,24 @@ cdef void _mr_dealloc_callback(
4949
size_t size,
5050
const StreamHandle& h_stream,
5151
) noexcept:
52-
"""Called by the C++ deleter to deallocate via MemoryResource.deallocate."""
52+
"""Called by the C++ deleter to deallocate via MemoryResource.deallocate.
53+
54+
This is the C++ teardown path: there is no Python caller frame from
55+
which to obtain a stream. If the device-pointer handle was created
56+
without ``set_deallocation_stream`` being called (e.g. buffers minted
57+
via ``Buffer.from_handle(ptr, size, mr=mr)`` from DLPack import,
58+
third-party adapters, or other foreign sources), ``h_stream`` is
59+
empty here. Stream-ordered MR ``deallocate`` overrides reject
60+
``stream=None`` (issue #2001), so without a fallback the destructor
61+
would print a warning and leak the allocation. Fall back to the
62+
legacy/per-thread default stream so the free still happens; this is
63+
the unique exception to the "no implicit default-stream fallback"
64+
policy because the teardown has no other source of truth.
65+
"""
66+
cdef Stream stream
5367
try:
54-
stream = None
55-
if h_stream:
56-
stream = Stream._from_handle(Stream, h_stream)
57-
mr.deallocate(int(ptr), size, stream)
68+
stream = Stream._from_handle(Stream, h_stream) if h_stream else default_stream()
69+
mr.deallocate(int(ptr), size, stream=stream)
5870
except Exception as exc:
5971
print(f"Warning: mr.deallocate() failed during Buffer destruction: {exc}",
6072
file=sys.stderr)
@@ -119,7 +131,11 @@ cdef class Buffer:
119131

120132
@staticmethod
121133
def _reduce_helper(mr, ipc_descriptor):
122-
return Buffer.from_ipc_descriptor(mr, ipc_descriptor)
134+
# The parent process's stream is not portable across processes, so the
135+
# pickle path cannot thread an explicit stream through. Seed the
136+
# imported buffer's deallocation with the current context's default
137+
# stream; the receiver can override via buffer.close(stream).
138+
return Buffer.from_ipc_descriptor(mr, ipc_descriptor, stream=default_stream())
123139

124140
def __reduce__(self):
125141
# Must not serialize the parent's stream!
@@ -158,9 +174,20 @@ cdef class Buffer:
158174
@classmethod
159175
def from_ipc_descriptor(
160176
cls, mr: DeviceMemoryResource | PinnedMemoryResource, ipc_descriptor: IPCBufferDescriptor,
161-
stream: Stream = None
177+
*, stream: Stream
162178
) -> Buffer:
163-
"""Import a buffer that was exported from another process."""
179+
"""Import a buffer that was exported from another process.
180+
181+
Parameters
182+
----------
183+
mr : :obj:`~_memory.DeviceMemoryResource` | :obj:`~_memory.PinnedMemoryResource`
184+
The IPC-enabled memory resource matching the exporting process.
185+
ipc_descriptor : :obj:`~_memory.IPCBufferDescriptor`
186+
The descriptor exported from another process.
187+
stream : :obj:`~_stream.Stream`
188+
Keyword-only. The stream used for asynchronous deallocation when
189+
the buffer is closed or garbage collected.
190+
"""
164191
return _ipc.Buffer_from_ipc_descriptor(cls, mr, ipc_descriptor, stream)
165192

166193
@property
@@ -215,7 +242,7 @@ cdef class Buffer:
215242
if self._memory_resource is None:
216243
raise ValueError("a destination buffer must be provided (this "
217244
"buffer does not have a memory_resource)")
218-
dst = self._memory_resource.allocate(src_size, s)
245+
dst = self._memory_resource.allocate(src_size, stream=s)
219246

220247
cdef size_t dst_size = dst._size
221248
if dst_size != src_size:
@@ -490,17 +517,17 @@ cdef class MemoryResource:
490517
resource's respective property.)
491518
"""
492519

493-
def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
520+
def allocate(self, size_t size, *, stream: Stream | GraphBuilder) -> Buffer:
494521
"""Allocate a buffer of the requested size.
495522

496523
Parameters
497524
----------
498525
size : int
499526
The size of the buffer to allocate, in bytes.
500-
stream : :obj:`~_stream.Stream` | :obj:`~graph.GraphBuilder`, optional
501-
The stream on which to perform the allocation asynchronously.
502-
If None, it is up to each memory resource implementation to decide
503-
and document the behavior.
527+
stream : :obj:`~_stream.Stream` | :obj:`~graph.GraphBuilder`
528+
Keyword-only. The stream on which to perform the allocation
529+
asynchronously. Must be passed explicitly; pass
530+
``device.default_stream`` to use the default stream.
504531

505532
Returns
506533
-------
@@ -510,7 +537,7 @@ cdef class MemoryResource:
510537
"""
511538
raise TypeError("MemoryResource.allocate must be implemented by subclasses.")
512539

513-
def deallocate(self, ptr: DevicePointerType, size_t size, stream: Stream | GraphBuilder | None = None):
540+
def deallocate(self, ptr: DevicePointerType, size_t size, *, stream: Stream | GraphBuilder):
514541
"""Deallocate a buffer previously allocated by this resource.
515542
516543
Parameters
@@ -519,10 +546,10 @@ cdef class MemoryResource:
519546
The pointer or handle to the buffer to deallocate.
520547
size : int
521548
The size of the buffer to deallocate, in bytes.
522-
stream : :obj:`~_stream.Stream` | :obj:`~graph.GraphBuilder`, optional
523-
The stream on which to perform the deallocation asynchronously.
524-
If None, it is up to each memory resource implementation to decide
525-
and document the behavior.
549+
stream : :obj:`~_stream.Stream` | :obj:`~graph.GraphBuilder`
550+
Keyword-only. The stream on which to perform the deallocation
551+
asynchronously. Must be passed explicitly; pass
552+
``device.default_stream`` to use the default stream.
526553
"""
527554
raise TypeError("MemoryResource.deallocate must be implemented by subclasses.")
528555

cuda_core/cuda/core/_memory/_graph_memory_resource.pyx

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ from cuda.core._resource_handles cimport (
1414
as_cu,
1515
)
1616

17-
from cuda.core._stream cimport default_stream, Stream_accept, Stream
17+
from cuda.core._stream cimport Stream_accept, Stream
1818
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
1919

2020
from functools import cache
@@ -104,19 +104,19 @@ cdef class cyGraphMemoryResource(MemoryResource):
104104
def __cinit__(self, int device_id):
105105
self._device_id = device_id
106106

107-
def allocate(self, size_t size, stream: Stream | GraphBuilder | None = None) -> Buffer:
107+
def allocate(self, size_t size, *, stream: Stream | GraphBuilder) -> Buffer:
108108
"""
109109
Allocate a buffer of the requested size. See documentation for :obj:`~_memory.MemoryResource`.
110110
"""
111-
stream = Stream_accept(stream) if stream is not None else default_stream()
112-
return GMR_allocate(self, size, <Stream> stream)
111+
cdef Stream s = Stream_accept(stream)
112+
return GMR_allocate(self, size, s)
113113

114-
def deallocate(self, ptr: "DevicePointerType", size_t size, stream: Stream | GraphBuilder | None = None):
114+
def deallocate(self, ptr: "DevicePointerType", size_t size, *, stream: Stream | GraphBuilder):
115115
"""
116116
Deallocate a buffer of the requested size. See documentation for :obj:`~_memory.MemoryResource`.
117117
"""
118-
stream = Stream_accept(stream) if stream is not None else default_stream()
119-
return GMR_deallocate(ptr, size, <Stream> stream)
118+
cdef Stream s = Stream_accept(stream)
119+
return GMR_deallocate(ptr, size, s)
120120

121121
def close(self):
122122
"""No operation (provided for compatibility)."""

cuda_core/cuda/core/_memory/_ipc.pyx

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ cimport cpython
77
from cuda.bindings cimport cydriver
88
from cuda.core._memory._buffer cimport Buffer, Buffer_from_deviceptr_handle
99
from cuda.core._memory._memory_pool cimport _MemPool
10-
from cuda.core._stream cimport Stream
10+
from cuda.core._stream cimport Stream, Stream_accept
1111
from cuda.core._resource_handles cimport (
1212
DevicePtrHandle,
1313
create_fd_handle,
@@ -19,7 +19,6 @@ from cuda.core._resource_handles cimport (
1919
as_py,
2020
)
2121

22-
from cuda.core._stream cimport default_stream
2322
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
2423
from cuda.core._utils.cuda_utils import check_multiprocessing_start_method
2524

@@ -171,10 +170,7 @@ cdef Buffer Buffer_from_ipc_descriptor(
171170
"""Import a buffer that was exported from another process."""
172171
if not mr.is_ipc_enabled:
173172
raise RuntimeError("Memory resource is not IPC-enabled")
174-
if stream is None:
175-
# Note: match this behavior to _MemPool.allocate()
176-
stream = default_stream()
177-
cdef Stream s = <Stream>stream
173+
cdef Stream s = Stream_accept(stream)
178174
cdef DevicePtrHandle h_ptr = deviceptr_import_ipc(
179175
mr._h_pool,
180176
ipc_descriptor.payload_ptr(),

cuda_core/cuda/core/_memory/_legacy.py

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
if TYPE_CHECKING:
1010
from cuda.core._memory._buffer import DevicePointerType
11+
from cuda.core._stream import Stream
1112

1213
from cuda.core._memory._buffer import Buffer, MemoryResource
1314
from cuda.core._utils.cuda_utils import (
@@ -27,33 +28,38 @@ class LegacyPinnedMemoryResource(MemoryResource):
2728

2829
# TODO: support creating this MR with flags that are later passed to cuMemHostAlloc?
2930

30-
def allocate(self, size, stream=None) -> Buffer:
31+
def allocate(self, size, *, stream: Stream | None = None) -> Buffer:
3132
"""Allocate a buffer of the requested size.
3233
34+
``cuMemAllocHost`` is synchronous, so this resource ignores any
35+
supplied stream. The argument is accepted (and validated when
36+
non-``None``) for interface conformance with stream-ordered
37+
memory resources.
38+
3339
Parameters
3440
----------
3541
size : int
3642
The size of the buffer to allocate, in bytes.
3743
stream : Stream, optional
38-
Currently ignored
44+
Keyword-only. Validated when provided but otherwise unused.
3945
4046
Returns
4147
-------
4248
Buffer
4349
The allocated buffer object, which is accessible on both host and device.
4450
"""
45-
if stream is None:
46-
from cuda.core._stream import default_stream
51+
from cuda.core._stream import Stream_accept
4752

48-
stream = default_stream()
53+
if stream is not None:
54+
Stream_accept(stream)
4955
if size:
5056
err, ptr = driver.cuMemAllocHost(size)
5157
raise_if_driver_error(err)
5258
else:
5359
ptr = 0
5460
return Buffer._init(ptr, size, self)
5561

56-
def deallocate(self, ptr: DevicePointerType, size, stream):
62+
def deallocate(self, ptr: DevicePointerType, size, *, stream: Stream | None = None):
5763
"""Deallocate a buffer previously allocated by this resource.
5864
5965
Parameters
@@ -62,11 +68,14 @@ def deallocate(self, ptr: DevicePointerType, size, stream):
6268
The pointer or handle to the buffer to deallocate.
6369
size : int
6470
The size of the buffer to deallocate, in bytes.
65-
stream : Stream
66-
The stream on which to perform the deallocation synchronously.
71+
stream : Stream, optional
72+
Keyword-only. If provided, ``stream.sync()`` is called before the
73+
host allocation is freed. ``None`` skips the sync.
6774
"""
75+
from cuda.core._stream import Stream_accept
76+
6877
if stream is not None:
69-
stream.sync()
78+
Stream_accept(stream).sync()
7079

7180
if size:
7281
(err,) = driver.cuMemFreeHost(ptr)
@@ -96,21 +105,25 @@ def __init__(self, device_id):
96105

97106
self._device_id = Device(device_id).device_id
98107

99-
def allocate(self, size, stream=None) -> Buffer:
100-
if stream is None:
101-
from cuda.core._stream import default_stream
108+
def allocate(self, size, *, stream: Stream | None = None) -> Buffer:
109+
# cuMemAlloc is synchronous; stream is accepted (and validated)
110+
# for interface conformance but not used.
111+
from cuda.core._stream import Stream_accept
102112

103-
stream = default_stream()
113+
if stream is not None:
114+
Stream_accept(stream)
104115
if size:
105116
err, ptr = driver.cuMemAlloc(size)
106117
raise_if_driver_error(err)
107118
else:
108119
ptr = 0
109120
return Buffer._init(ptr, size, self)
110121

111-
def deallocate(self, ptr, size, stream):
122+
def deallocate(self, ptr, size, *, stream: Stream | None = None):
123+
from cuda.core._stream import Stream_accept
124+
112125
if stream is not None:
113-
stream.sync()
126+
Stream_accept(stream).sync()
114127
if size:
115128
(err,) = driver.cuMemFree(ptr)
116129
raise_if_driver_error(err)

0 commit comments

Comments
 (0)