Skip to content

Commit d06a702

Browse files
Copilotleofang
andcommitted
Comprehensively update release notes with all changes since 2607e0e
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
1 parent e4f6522 commit d06a702

1 file changed

Lines changed: 94 additions & 3 deletions

File tree

cuda_core/docs/source/release/0.5.0-notes.rst

Lines changed: 94 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,10 @@
1010
Highlights
1111
----------
1212

13-
None.
13+
- Added :class:`GraphMemoryResource` for memory allocation in CUDA graphs.
14+
- Added :meth:`Buffer.fill` method for efficient memory initialization using ``cuMemsetAsync``.
15+
- Enhanced :class:`StridedMemoryView` with new constructors and better external allocation support.
16+
- Significant performance improvements through Cythonization of core components.
1417

1518

1619
Breaking Changes
@@ -28,10 +31,50 @@ The following APIs have been deprecated and will be removed in 0.6.0:
2831
New features
2932
------------
3033

34+
Memory Resources
35+
~~~~~~~~~~~~~~~~
36+
3137
- Added :class:`PinnedMemoryResource` and :class:`PinnedMemoryResourceOptions` for managing
3238
host-pinned memory pools with optional IPC support.
3339
- Added :class:`ManagedMemoryResource` and :class:`ManagedMemoryResourceOptions` for managing
3440
unified memory pools accessible from both host and device.
41+
- Added :class:`GraphMemoryResource` and :class:`GraphMemoryResourceAttributes` for memory
42+
allocation within CUDA graph capture, supporting both pooling and non-pooling modes.
43+
- Added peer access control to :class:`DeviceMemoryResource` for multi-GPU IPC scenarios.
44+
45+
Buffer and Memory Operations
46+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
47+
48+
- Added :meth:`Buffer.fill` method for efficient memory initialization using ``cuMemsetAsync``,
49+
supporting ``int``, ``bytes``, and general buffer protocol objects.
50+
- :class:`Buffer` can now wrap external memory allocations with an owner object, enabling
51+
integration with external memory management systems.
52+
- IPC-imported buffers can now be re-exported to other processes.
53+
54+
StridedMemoryView Enhancements
55+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56+
57+
- Added :meth:`StridedMemoryView.from_buffer` classmethod for creating views from buffers
58+
with explicit shape, strides, and dtype parameters.
59+
- Added :meth:`StridedMemoryView.from_dlpack` and :meth:`StridedMemoryView.from_cuda_array_interface`
60+
classmethods for explicit protocol selection (the ``__init__`` constructor is now deprecated).
61+
- Added :attr:`StridedMemoryView.size` property to get total number of elements.
62+
- Added :meth:`StridedMemoryView.view` method for creating sliced views with custom layouts.
63+
64+
Program and Linker Options
65+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
66+
67+
- Added :meth:`ProgramOptions.as_bytes` and :meth:`LinkerOptions.as_bytes` public APIs for
68+
converting options to backend-specific byte representations.
69+
- Added ``numba_debug`` flag to :class:`ProgramOptions` for enabling Numba CUDA debug support.
70+
71+
Device API
72+
~~~~~~~~~~
73+
74+
- Updated :class:`Device` constructor to accept either a :class:`Device` instance or a device
75+
ordinal (``int``). Public APIs that previously only accepted ``int`` now accept either.
76+
- Added :meth:`Device.get_all_devices` classmethod as replacement for deprecated
77+
``cuda.core.experimental.system.devices``.
3578

3679

3780
New examples
@@ -43,7 +86,55 @@ None.
4386
Fixes and enhancements
4487
----------------------
4588

46-
- Cythonized :func:`launch` and :class:`LaunchConfig` for improved performance.
89+
Performance Improvements
90+
~~~~~~~~~~~~~~~~~~~~~~~~
91+
92+
- Cythonized :func:`launch` and :class:`LaunchConfig` for improved kernel launch performance.
93+
- Cythonized :class:`DeviceMemoryResource` attributes for faster property access.
94+
- Optimized :meth:`Buffer.fill` to avoid intermediate object creation by using Cython typed
95+
parameters and C buffer API.
96+
97+
API Improvements
98+
~~~~~~~~~~~~~~~~
99+
47100
- Python ``bool`` objects are now converted to C++ ``bool`` type when passed as kernel
48-
arguments. Previously, they were converted to ``int``. This brings them inline
101+
arguments. Previously, they were converted to ``int``. This brings them inline
49102
with ``ctypes.c_bool`` and ``numpy.bool_``.
103+
- :class:`Stream` and :class:`GraphBuilder` methods now accept stream-like objects (anything
104+
with a ``.handle`` attribute) via the ``IsStreamT`` protocol.
105+
- Restored v0.3.x :class:`MemoryResource` behaviors and added missing memory pool attributes
106+
to maintain backward compatibility.
107+
- Simplified and standardized memory resource APIs across different resource types.
108+
- Added warning when multiprocessing start method is set to ``'fork'``, which can cause issues
109+
with CUDA contexts.
110+
111+
Bug Fixes
112+
~~~~~~~~~
113+
114+
- Fixed exception clauses in :meth:`Buffer.fill` helper functions to properly handle errors.
115+
- Fixed potential memory leaks when DLPack capsule creation is interrupted.
116+
- Fixed virtual memory management (VMM) allocator on Windows platforms.
117+
- Fixed NVRTC program name handling on Windows to avoid filesystem issues.
118+
- Fixed NVRTC trace.json generation issue with NVRTC 12.9.
119+
- Fixed test infrastructure to avoid leaving temporary files during program options testing.
120+
121+
Internal Changes
122+
~~~~~~~~~~~~~~~~
123+
124+
- ``StridedLayout`` has been privatized as ``_StridedLayout`` and is no longer part of the
125+
public API. Use :meth:`StridedMemoryView.from_buffer` instead of creating layouts directly.
126+
- Replaced some ``isinstance(obj, T)`` checks with ``type(obj) is T`` for performance, with
127+
fallback to ``isinstance`` to maintain backward compatibility for subclasses.
128+
- Standardized deallocation behavior: :class:`Buffer` objects now deallocate on the stream
129+
that was used to allocate them, instead of on the default stream. Users can override the
130+
deallocation stream explicitly through :meth:`Buffer.close`.
131+
132+
Test and Development Infrastructure
133+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
134+
135+
- Added ``--showlocals`` flag to pytest configurations for better debugging output.
136+
- Added ``pytest-randomly`` to randomize test execution order and catch ordering dependencies.
137+
- Added test IDs for GPU array tests to facilitate easier debugging.
138+
- Improved test determinism by replacing OS sleep with GPU nanosleep kernel in event timing tests.
139+
- Added coverage automation for tracking code coverage metrics.
140+
- Expanded Windows test matrix to cover different driver modes (WDDM/MCDM) and GPU types.

0 commit comments

Comments
 (0)