cuda.core.system: Add MIG-related APIs by mdboom · Pull Request #1916 · NVIDIA/cuda-python

mdboom · 2026-04-15T17:10:29Z

These APIs are required by dask-cuda.

rwgk · 2026-04-20T23:20:46Z

+        device, as a 5 part hexadecimal string, that augments the immutable,
+        board serial identifier.
+        """
+        # NVML UUIDs have a `GPU-` or `MIG-` prefix.  We remove that here.


Strong suggestion to move this comment into the docstring, mainly so that it appears in the online documentation.

Got it. Updated.

Minor nit: It'd be nice to be consistent; or simply remove the comment (preferred).

Lower-case here:

In the upstream NVML C++ API, the UUID includes a ``gpu-`` or ``mig-``

Upper-case here:

# NVML UUIDs have a `GPU-` or `MIG-` prefix. We remove that here.

rwgk · 2026-04-20T23:41:36Z

Generated with Cursor GPT-5.4 Extra High Fast

I did not check these findings manually. (Recently such findings have become generally highly reliable.)

High: cuda_core/cuda/core/system/_mig.pxi:126 uses self._handle inside MigInfo.parent, but MigInfo only stores self._device. Accessing device.mig.parent will raise AttributeError instead of returning the parent device.
High: the get_device_count -> device_count rename was only half-applied. cuda_core/cuda/core/system/_mig.pxi:176 still calls self.get_device_count(), and cuda_core/tests/system/test_system_device.py:745 still calls mig.get_device_count(). On any system that exercises the MIG path, both mig.get_all_devices() and the test will fail with AttributeError.
Medium: the MIG mode logic is inverted for normal callers. system.Device.get_all_devices() returns top-level NVML devices from cuda_core/cuda/core/system/_device.pyx:296, but cuda_core/cuda/core/system/_mig.pxi:46, cuda_core/cuda/core/system/_mig.pxi:67, and cuda_core/cuda/core/system/_mig.pxi:92 gate mode/pending_mode/setter on is_mig_device. That means device.mig.mode reports False and the setter raises on the parent GPU devices that actually own MIG mode, so the new API does not expose enabled MIG mode through the normal enumeration path.

mdboom · 2026-04-21T15:43:35Z

Generated with Cursor GPT-5.4 Extra High Fast

I did not check these findings manually. (Recently such findings have become generally highly reliable.)

High: cuda_core/cuda/core/system/_mig.pxi:126 uses self._handle inside MigInfo.parent, but MigInfo only stores self._device. Accessing device.mig.parent will raise AttributeError instead of returning the parent device.

High: the get_device_count -> device_count rename was only half-applied. cuda_core/cuda/core/system/_mig.pxi:176 still calls self.get_device_count(), and cuda_core/tests/system/test_system_device.py:745 still calls mig.get_device_count(). On any system that exercises the MIG path, both mig.get_all_devices() and the test will fail with AttributeError.

Medium: the MIG mode logic is inverted for normal callers. system.Device.get_all_devices() returns top-level NVML devices from cuda_core/cuda/core/system/_device.pyx:296, but cuda_core/cuda/core/system/_mig.pxi:46, cuda_core/cuda/core/system/_mig.pxi:67, and cuda_core/cuda/core/system/_mig.pxi:92 gate mode/pending_mode/setter on is_mig_device. That means device.mig.mode reports False and the setter raises on the parent GPU devices that actually own MIG mode, so the new API does not expose enabled MIG mode through the normal enumeration path.

1 and 2 are good catches. 3 is word salad, but it probably makes sense to remove the gating until we understand how the API works.

…sk-cuda

rwgk · 2026-04-21T17:26:19Z

Using different kinds of agents seems super useful, but "fighting" surely doesn't make much sense.

Maybe we should have a team discussion how we could make better use of agents. Technically, I think it'd be most efficient to have multiple agents/reviewers actively work together on a PR (e.g. these are the fixes my agent worked out in a couple minutes: commit 94677f6). That's a big shift away from traditional reviews though.

  Findings
  • High: cuda_core/cuda/core/system/_mig.pxi:115 still calls nvml.device_get_handle_from_mig_device_handle(...), but the bindings only
    expose device_get_device_handle_from_mig_device_handle in cuda_bindings/cuda/bindings/nvml.pxd:412. So device.mig.parent still fails
     at runtime. This partially fixes my first finding, but it does not close it.
  • High: cuda_core/tests/system/test_system_device.py:745 still calls mig.get_device_count(), while MigInfo now only exposes the
    device_count property in cuda_core/cuda/core/system/_mig.pxi:88. On any machine that actually enters that MIG branch, the test still
     fails. My second finding is still open in the test.
  • Medium: cuda_core/tests/system/test_system_device.py:741 still guards the important MIG assertions behind if mig.is_mig_device,
    while system.Device.get_all_devices() in cuda_core/cuda/core/system/_device.pyx:299 enumerates top-level devices. That means the
    test still does not cover the normal device.mig.mode / pending_mode / device_count path for parent GPUs. The runtime logic looks
    improved, but the testing half of my third finding is still unresolved.

  Assumptions
  • Reviewed current HEAD 58b4b21bac.
  • Did not run local tests.

  Change context
  • The branch does address part of the earlier feedback: mode and pending_mode no longer early-return on non-MIG handles, parent now
    uses self._device, and get_all_devices() now uses self.device_count.
  • But no, I would not say all three earlier findings are addressed yet. One is still a runtime bug, one is still a stale test failure,
     and the last is only partially fixed because the test still misses the normal caller path.

…sk-cuda

mdboom · 2026-04-21T20:38:44Z

Ok. Should be addressed now.

github-actions · 2026-04-22T11:54:59Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

mdboom added this to the cuda.core v1.0.0 milestone Apr 15, 2026

mdboom self-assigned this Apr 15, 2026

mdboom added the cuda.core Everything related to the cuda.core module label Apr 15, 2026

mdboom requested a review from cpcloud April 15, 2026 17:20

This was referenced Apr 15, 2026

Migrate from pynvml to cuda.core.system rapidsai/dask-cuda#1645

Open

EPIC: Port Rapids projects from pynvml to cuda.core.system #1915

Open

This comment has been minimized.

Sign in to view

cuda.core.system: Add MIG-related APIs

f29b935

mdboom force-pushed the cuda-core-system-dask-cuda branch from b877b46 to f29b935 Compare April 20, 2026 17:21

mdboom added 3 commits April 20, 2026 13:23

Add "need"

884c0a7

Add missing file

b97924a

Make properties

32dff8e

mdboom requested a review from rparolin April 20, 2026 17:28

mdboom added 2 commits April 20, 2026 15:12

Fix test

2053c11

Fix test

4d6e3a2

rwgk reviewed Apr 20, 2026

View reviewed changes

Elaborate in the docstring

95f5c1b

mdboom requested a review from rwgk April 21, 2026 12:55

mdboom added 2 commits April 21, 2026 11:45

Merge remote-tracking branch 'upstream/main' into cuda-core-system-da…

ad7fd63

…sk-cuda

Address comments in PR

58b4b21

mdboom added 2 commits April 21, 2026 16:37

Address comments in PR

6db93f4

Merge remote-tracking branch 'upstream/main' into cuda-core-system-da…

7985112

…sk-cuda

rwgk approved these changes Apr 21, 2026

View reviewed changes

mdboom merged commit b162f64 into NVIDIA:main Apr 22, 2026
93 of 94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda.core.system: Add MIG-related APIs#1916

cuda.core.system: Add MIG-related APIs#1916
mdboom merged 11 commits into
NVIDIA:mainfrom
mdboom:cuda-core-system-dask-cuda

mdboom commented Apr 15, 2026

Uh oh!

This comment has been minimized.

rwgk Apr 20, 2026

Uh oh!

mdboom Apr 21, 2026

Uh oh!

rwgk Apr 21, 2026

Uh oh!

rwgk commented Apr 20, 2026

Uh oh!

mdboom commented Apr 21, 2026

Uh oh!

rwgk commented Apr 21, 2026

Uh oh!

mdboom commented Apr 21, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mdboom commented Apr 15, 2026

Uh oh!

This comment has been minimized.

rwgk Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

mdboom Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

rwgk commented Apr 20, 2026

Uh oh!

mdboom commented Apr 21, 2026

Uh oh!

rwgk commented Apr 21, 2026

Uh oh!

mdboom commented Apr 21, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants