Skip to content

cuda.core.system: Add MIG-related APIs#1916

Open
mdboom wants to merge 9 commits intoNVIDIA:mainfrom
mdboom:cuda-core-system-dask-cuda
Open

cuda.core.system: Add MIG-related APIs#1916
mdboom wants to merge 9 commits intoNVIDIA:mainfrom
mdboom:cuda-core-system-dask-cuda

Conversation

@mdboom
Copy link
Copy Markdown
Contributor

@mdboom mdboom commented Apr 15, 2026

These APIs are required by dask-cuda.

@mdboom mdboom added this to the cuda.core v1.0.0 milestone Apr 15, 2026
@mdboom mdboom self-assigned this Apr 15, 2026
@mdboom mdboom added the cuda.core Everything related to the cuda.core module label Apr 15, 2026
@mdboom mdboom requested a review from cpcloud April 15, 2026 17:20
@github-actions
Copy link
Copy Markdown

@mdboom mdboom force-pushed the cuda-core-system-dask-cuda branch from b877b46 to f29b935 Compare April 20, 2026 17:21
@mdboom mdboom requested a review from rparolin April 20, 2026 17:28
Comment thread cuda_core/cuda/core/system/_device.pyx Outdated
device, as a 5 part hexadecimal string, that augments the immutable,
board serial identifier.
"""
# NVML UUIDs have a `GPU-` or `MIG-` prefix. We remove that here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong suggestion to move this comment into the docstring, mainly so that it appears in the online documentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Updated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: It'd be nice to be consistent; or simply remove the comment (preferred).

Lower-case here:

        In the upstream NVML C++ API, the UUID includes a ``gpu-`` or ``mig-``

Upper-case here:

        # NVML UUIDs have a `GPU-` or `MIG-` prefix.  We remove that here.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 20, 2026

Generated with Cursor GPT-5.4 Extra High Fast

I did not check these findings manually. (Recently such findings have become generally highly reliable.)


  1. High: cuda_core/cuda/core/system/_mig.pxi:126 uses self._handle inside MigInfo.parent, but MigInfo only stores self._device. Accessing device.mig.parent will raise AttributeError instead of returning the parent device.

  2. High: the get_device_count -> device_count rename was only half-applied. cuda_core/cuda/core/system/_mig.pxi:176 still calls self.get_device_count(), and cuda_core/tests/system/test_system_device.py:745 still calls mig.get_device_count(). On any system that exercises the MIG path, both mig.get_all_devices() and the test will fail with AttributeError.

  3. Medium: the MIG mode logic is inverted for normal callers. system.Device.get_all_devices() returns top-level NVML devices from cuda_core/cuda/core/system/_device.pyx:296, but cuda_core/cuda/core/system/_mig.pxi:46, cuda_core/cuda/core/system/_mig.pxi:67, and cuda_core/cuda/core/system/_mig.pxi:92 gate mode/pending_mode/setter on is_mig_device. That means device.mig.mode reports False and the setter raises on the parent GPU devices that actually own MIG mode, so the new API does not expose enabled MIG mode through the normal enumeration path.

@mdboom mdboom requested a review from rwgk April 21, 2026 12:55
@mdboom
Copy link
Copy Markdown
Contributor Author

mdboom commented Apr 21, 2026

Generated with Cursor GPT-5.4 Extra High Fast

I did not check these findings manually. (Recently such findings have become generally highly reliable.)

  1. High: cuda_core/cuda/core/system/_mig.pxi:126 uses self._handle inside MigInfo.parent, but MigInfo only stores self._device. Accessing device.mig.parent will raise AttributeError instead of returning the parent device.
  2. High: the get_device_count -> device_count rename was only half-applied. cuda_core/cuda/core/system/_mig.pxi:176 still calls self.get_device_count(), and cuda_core/tests/system/test_system_device.py:745 still calls mig.get_device_count(). On any system that exercises the MIG path, both mig.get_all_devices() and the test will fail with AttributeError.
  3. Medium: the MIG mode logic is inverted for normal callers. system.Device.get_all_devices() returns top-level NVML devices from cuda_core/cuda/core/system/_device.pyx:296, but cuda_core/cuda/core/system/_mig.pxi:46, cuda_core/cuda/core/system/_mig.pxi:67, and cuda_core/cuda/core/system/_mig.pxi:92 gate mode/pending_mode/setter on is_mig_device. That means device.mig.mode reports False and the setter raises on the parent GPU devices that actually own MIG mode, so the new API does not expose enabled MIG mode through the normal enumeration path.

1 and 2 are good catches. 3 is word salad, but it probably makes sense to remove the gating until we understand how the API works.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 21, 2026

Using different kinds of agents seems super useful, but "fighting" surely doesn't make much sense.

Maybe we should have a team discussion how we could make better use of agents. Technically, I think it'd be most efficient to have multiple agents/reviewers actively work together on a PR (e.g. these are the fixes my agent worked out in a couple minutes: commit 94677f6). That's a big shift away from traditional reviews though.

  Findings
  • High: cuda_core/cuda/core/system/_mig.pxi:115 still calls nvml.device_get_handle_from_mig_device_handle(...), but the bindings only
    expose device_get_device_handle_from_mig_device_handle in cuda_bindings/cuda/bindings/nvml.pxd:412. So device.mig.parent still fails
     at runtime. This partially fixes my first finding, but it does not close it.
  • High: cuda_core/tests/system/test_system_device.py:745 still calls mig.get_device_count(), while MigInfo now only exposes the
    device_count property in cuda_core/cuda/core/system/_mig.pxi:88. On any machine that actually enters that MIG branch, the test still
     fails. My second finding is still open in the test.
  • Medium: cuda_core/tests/system/test_system_device.py:741 still guards the important MIG assertions behind if mig.is_mig_device,
    while system.Device.get_all_devices() in cuda_core/cuda/core/system/_device.pyx:299 enumerates top-level devices. That means the
    test still does not cover the normal device.mig.mode / pending_mode / device_count path for parent GPUs. The runtime logic looks
    improved, but the testing half of my third finding is still unresolved.

  Assumptions
  • Reviewed current HEAD 58b4b21bac.
  • Did not run local tests.

  Change context
  • The branch does address part of the earlier feedback: mode and pending_mode no longer early-return on non-MIG handles, parent now
    uses self._device, and get_all_devices() now uses self.device_count.
  • But no, I would not say all three earlier findings are addressed yet. One is still a runtime bug, one is still a stale test failure,
     and the last is only partially fixed because the test still misses the normal caller path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants