1010Highlights
1111----------
1212
13- None.
13+ - Added :class: `GraphMemoryResource ` for memory allocation in CUDA graphs.
14+ - Added :meth: `Buffer.fill ` method for efficient memory initialization using ``cuMemsetAsync ``.
15+ - Enhanced :class: `StridedMemoryView ` with new constructors and better external allocation support.
16+ - Significant performance improvements through Cythonization of core components.
1417
1518
1619Breaking Changes
@@ -28,10 +31,50 @@ The following APIs have been deprecated and will be removed in 0.6.0:
2831New features
2932------------
3033
34+ Memory Resources
35+ ~~~~~~~~~~~~~~~~
36+
3137- Added :class: `PinnedMemoryResource ` and :class: `PinnedMemoryResourceOptions ` for managing
3238 host-pinned memory pools with optional IPC support.
3339- Added :class: `ManagedMemoryResource ` and :class: `ManagedMemoryResourceOptions ` for managing
3440 unified memory pools accessible from both host and device.
41+ - Added :class: `GraphMemoryResource ` and :class: `GraphMemoryResourceAttributes ` for memory
42+ allocation within CUDA graph capture, supporting both pooling and non-pooling modes.
43+ - Added peer access control to :class: `DeviceMemoryResource ` for multi-GPU IPC scenarios.
44+
45+ Buffer and Memory Operations
46+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47+
48+ - Added :meth: `Buffer.fill ` method for efficient memory initialization using ``cuMemsetAsync ``,
49+ supporting ``int ``, ``bytes ``, and general buffer protocol objects.
50+ - :class: `Buffer ` can now wrap external memory allocations with an owner object, enabling
51+ integration with external memory management systems.
52+ - IPC-imported buffers can now be re-exported to other processes.
53+
54+ StridedMemoryView Enhancements
55+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56+
57+ - Added :meth: `StridedMemoryView.from_buffer ` classmethod for creating views from buffers
58+ with explicit shape, strides, and dtype parameters.
59+ - Added :meth: `StridedMemoryView.from_dlpack ` and :meth: `StridedMemoryView.from_cuda_array_interface `
60+ classmethods for explicit protocol selection (the ``__init__ `` constructor is now deprecated).
61+ - Added :attr: `StridedMemoryView.size ` property to get total number of elements.
62+ - Added :meth: `StridedMemoryView.view ` method for creating sliced views with custom layouts.
63+
64+ Program and Linker Options
65+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
66+
67+ - Added :meth: `ProgramOptions.as_bytes ` and :meth: `LinkerOptions.as_bytes ` public APIs for
68+ converting options to backend-specific byte representations.
69+ - Added ``numba_debug `` flag to :class: `ProgramOptions ` for enabling Numba CUDA debug support.
70+
71+ Device API
72+ ~~~~~~~~~~
73+
74+ - Updated :class: `Device ` constructor to accept either a :class: `Device ` instance or a device
75+ ordinal (``int ``). Public APIs that previously only accepted ``int `` now accept either.
76+ - Added :meth: `Device.get_all_devices ` classmethod as replacement for deprecated
77+ ``cuda.core.experimental.system.devices ``.
3578
3679
3780New examples
4386Fixes and enhancements
4487----------------------
4588
46- - Cythonized :func: `launch ` and :class: `LaunchConfig ` for improved performance.
89+ Performance Improvements
90+ ~~~~~~~~~~~~~~~~~~~~~~~~
91+
92+ - Cythonized :func: `launch ` and :class: `LaunchConfig ` for improved kernel launch performance.
93+ - Cythonized :class: `DeviceMemoryResource ` attributes for faster property access.
94+ - Optimized :meth: `Buffer.fill ` to avoid intermediate object creation by using Cython typed
95+ parameters and C buffer API.
96+
97+ API Improvements
98+ ~~~~~~~~~~~~~~~~
99+
47100- Python ``bool `` objects are now converted to C++ ``bool `` type when passed as kernel
48- arguments. Previously, they were converted to ``int ``. This brings them inline
101+ arguments. Previously, they were converted to ``int ``. This brings them inline
49102 with ``ctypes.c_bool `` and ``numpy.bool_ ``.
103+ - :class: `Stream ` and :class: `GraphBuilder ` methods now accept stream-like objects (anything
104+ with a ``.handle `` attribute) via the ``IsStreamT `` protocol.
105+ - Restored v0.3.x :class: `MemoryResource ` behaviors and added missing memory pool attributes
106+ to maintain backward compatibility.
107+ - Simplified and standardized memory resource APIs across different resource types.
108+ - Added warning when multiprocessing start method is set to ``'fork' ``, which can cause issues
109+ with CUDA contexts.
110+
111+ Bug Fixes
112+ ~~~~~~~~~
113+
114+ - Fixed exception clauses in :meth: `Buffer.fill ` helper functions to properly handle errors.
115+ - Fixed potential memory leaks when DLPack capsule creation is interrupted.
116+ - Fixed virtual memory management (VMM) allocator on Windows platforms.
117+ - Fixed NVRTC program name handling on Windows to avoid filesystem issues.
118+ - Fixed NVRTC trace.json generation issue with NVRTC 12.9.
119+ - Fixed test infrastructure to avoid leaving temporary files during program options testing.
120+
121+ Internal Changes
122+ ~~~~~~~~~~~~~~~~
123+
124+ - ``StridedLayout `` has been privatized as ``_StridedLayout `` and is no longer part of the
125+ public API. Use :meth: `StridedMemoryView.from_buffer ` instead of creating layouts directly.
126+ - Replaced some ``isinstance(obj, T) `` checks with ``type(obj) is T `` for performance, with
127+ fallback to ``isinstance `` to maintain backward compatibility for subclasses.
128+ - Standardized deallocation behavior: :class: `Buffer ` objects now deallocate on the stream
129+ that was used to allocate them, instead of on the default stream. Users can override the
130+ deallocation stream explicitly through :meth: `Buffer.close `.
131+
132+ Test and Development Infrastructure
133+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
134+
135+ - Added ``--showlocals `` flag to pytest configurations for better debugging output.
136+ - Added ``pytest-randomly `` to randomize test execution order and catch ordering dependencies.
137+ - Added test IDs for GPU array tests to facilitate easier debugging.
138+ - Improved test determinism by replacing OS sleep with GPU nanosleep kernel in event timing tests.
139+ - Added coverage automation for tracking code coverage metrics.
140+ - Expanded Windows test matrix to cover different driver modes (WDDM/MCDM) and GPU types.
0 commit comments