Add CUDA process checkpointing helpers by kkraus14 · Pull Request #1983 · NVIDIA/cuda-python

kkraus14 · 2026-04-28T16:19:57Z

Summary

add a dedicated cuda.core.checkpoint module for CUDA process checkpointing APIs while keeping cuda.core.system focused on CUDA system/NVML capabilities
expose a narrow runtime API via checkpoint.Process(pid): read-only pid, state, restore_thread_id, lock, checkpoint, restore, and unlock
keep the checkpoint module public surface limited to Process; the state return type lives in cuda.core.typing.ProcessStateT and is rendered in the private API docs
map Process.state from the CUDA driver CUprocessState enumerators rather than raw integer values
support restore-time GPU UUID remapping using either driver CUuuid values or Device.uuid strings; migration docs and tests now describe the stricter kernel-mode-driver visibility requirement rather than user-space CUDA visibility
document the coordinator/target-process model, Linux permission requirements such as CAP_SYS_PTRACE, the CRIU/CPU-process-image boundary, restore-thread requirement, and persistence mode/cuInit restore requirement
validate checkpoint API availability lazily and cache the successful check, covering the cuda-bindings version, required binding symbols, and CUDA driver version
consolidate checkpoint driver call handling in one boundary that translates missing checkpoint symbols and unsupported checkpoint CUDA results into a checkpoint-specific RuntimeError
re-enable checkpoint driver coverage in CI by running driver-backed lifecycle tests in isolated coordinator/target subprocess scenarios with parent-side timeouts; migration tests skip when the CUDA device view is masked because the mapping cannot be proven KMD-complete
run checkpoint scenario and target subprocesses through python -I cuda_core/tests/test_checkpoint.py ... as a file-backed child entrypoint, avoiding manual cwd/PYTHONPATH rewriting and stringified child Python while preserving installed-wheel imports in CI
treat accepted-but-no-op GPU migration as a hardware/driver skip for both swap and rotation scenarios while still failing if the driver moves a context to an unexpected GPU

Testing

git commit -S pre-commit hooks: ruff, formatting, SPDX, whitespace, RST, and related checks passed
git diff --check
pixi run ruff check cuda_core/tests/test_checkpoint.py (All checks passed)
pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py cuda_core/tests/test_typing_imports.py (10 passed, 6 skipped) after the file-backed subprocess entrypoint simplification
pixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py (All checks passed)
pixi run --manifest-path cuda_core -e docs docs-build-latest (Sphinx build succeeded)
previous broader local run: pixi run --manifest-path cuda_core pytest cuda_core/tests --ignore=cuda_core/tests/cython (2798 passed, 352 skipped, 2 failed)

The two Python-suite failures in the broader run are existing local NVML/system environment failures and are not related to this checkpointing change:

cuda_core/tests/system/test_system_device.py::test_get_inforom_version returns an empty InfoROM board part number locally.
cuda_core/tests/system/test_system_system.py::test_get_process_name hits an NVML UTF-8 decode error locally.

Additional local build/test notes:

pixi run --manifest-path cuda_core test stops before pytest in the existing build-cython-tests pre-step because cuda_core/tests/cython/test_get_cuda_native_handle.pyx cannot find the expected cuda.bindings .pxd files in this local pixi environment.
pixi build from cuda_core reaches the existing native cuda-core extension build and then fails with CUDA 12.9 headers that do not declare CU_MEM_ALLOCATION_TYPE_MANAGED; this is in the existing graph/managed-memory extension build path and is not checkpoint-specific.

CI note:

The previous CI attempt on 8192df67 exposed two unrelated/runner issues: one Windows py3.12 build failed inside the shared mini-CTK cache setup before any cuda.core build step, and CUDA 13.x GPU test jobs were canceled after the old in-process checkpoint test hung in cuCheckpointProcessCheckpoint.
The current head removes the broad CI skip. Driver-backed checkpoint lifecycle tests now run through isolated subprocess coordinator/target scenarios, and the parent pytest process can kill and skip a scenario that times out instead of letting the CI job hang.
The CI attempt on 7a2e6830 passed the checkpoint lifecycle tests and only failed the two multi-GPU jobs. Both failed in test_rotation_migrates_context because the driver accepted the rotation mapping but reported the target context still on the original UUID; the swap migration test already observed and skipped the same no-op behavior. The current head extends that no-op skip handling to rotation while preserving the assertion for any unexpected migrated UUID.

Current Test Implementation

The checkpoint tests in cuda_core/tests/test_checkpoint.py are real driver/GPU tests, not broad mocks.

Input validation and public-symbol checks run everywhere. Driver-backed lifecycle tests create a target process that initializes a real CUDA context, then a coordinator scenario calls checkpoint.Process(target.pid) and exercises state, restore_thread_id, lock, checkpoint, restore, and unlock through the real driver. The parent pytest process enforces a timeout around each scenario so unsupported driver/hardware paths skip cleanly instead of hanging the test job.

The scenario subprocess is launched as python -I cuda_core/tests/test_checkpoint.py scenario <name>, and its target process is launched as python -I cuda_core/tests/test_checkpoint.py target <device_index>. Python isolated mode ignores PYTHON* environment variables and omits the script/current directory from import resolution, so wheel CI imports the installed package without manual cwd or PYTHONPATH surgery. The child processes still inherit non-Python CUDA environment such as driver/library paths and device visibility.

Migration tests require at least two same-chip GPUs and an unmasked CUDA device view. They build full UUID mappings using Device.uuid strings, then exercise rotation and pair-swap migration patterns through Process.restore(gpu_mapping=...) in the isolated target process. They skip gracefully when CUDA_VISIBLE_DEVICES is set, when the local hardware lacks a same-chip GPU pair, when the driver rejects checkpoint migration, or when the driver accepts the mapping but leaves the target context on the original GPU.

copy-pr-bot · 2026-04-28T16:20:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kkraus14 · 2026-04-28T16:31:42Z

/ok to test

copy-pr-bot · 2026-04-28T16:31:46Z

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

kkraus14 · 2026-04-28T16:38:41Z

/ok to test 7c66b2f

copy-pr-bot · 2026-04-28T16:44:37Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

kkraus14 · 2026-04-28T16:44:59Z

/ok to test

github-actions · 2026-04-28T17:07:34Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1983/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

kkraus14 · 2026-04-28T18:33:27Z

/ok to test

kkraus14 · 2026-04-28T19:14:18Z

/ok to test

kkraus14 · 2026-04-28T20:24:54Z

/ok to test

copy-pr-bot · 2026-05-04T15:59:58Z

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

kkraus14 · 2026-05-04T16:00:52Z

/ok to test 8192df6

leofang · 2026-05-04T16:43:11Z

Copy-pasta from my bot, with internal info redacted.

What I did

Discovered GPU migration doesn't work on this machine (heterogeneous GPUs + known driver bugs)
Searched internal resources (Confluence, gdrive, NVBugs, P4) to understand the root cause
Rewrote the full test suite with real GPU tests and no mocks
Pushed 4 commits to Keith's PR branch

Test suite structure (13 tests)

5 input validation (no GPU needed): invalid pid types/values, public symbols
6 lifecycle (single GPU, real driver): state transitions at every step, restore_thread_id, lock/unlock with timeouts, full checkpoint/restore cycle
2 migration (≥2 same-chip GPUs): rotation and swap — exercise real driver API, skip gracefully when unsupported

Lessons learned

nvidia-smi order ≠ CUDA device order. We wasted time testing the wrong GPU pair because CUDA_VISIBLE_DEVICES=0,3 selected Ada + A6000 (different chips) instead of two A6000s. Always use Device.uuid to identify GPUs, never assume nvidia-smi indices match CUDA indices.
CUDA_VISIBLE_DEVICES breaks checkpoint migration. Tested locally: even a 2-pair identity mapping fails with CUDA_ERROR_INVALID_VALUE when only 2 of 4 GPUs are visible, while the same 4-pair identity mapping succeeds with all GPUs visible. The CUDA driver docs state the mapping must cover "every checkpointed GPU," and the driver records all physically attached GPUs at checkpoint time — not just the CUDA_VISIBLE_DEVICES subset.
Migration requires exact chip match. Tested locally: any mapping that remaps between different architectures (e.g. Ada ↔️ A6000) is rejected with CUDA_ERROR_INVALID_VALUE. Same-chip mappings (A6000 ↔️ A6000) are accepted. The public CUDA docs say "the GPU to restore onto needs to be of the same chip type as the old GPU."
Migration may be a no-op on some driver versions. Tested locally: same-chip A6000 swap is accepted (no error) but the context device UUID doesn't change — the driver silently no-ops. This is consistent with NVBug 5437334 (api_reverse_gpu_pairs fails on GA100x4) and NVBug 5544504 (customer report: "restore on a different GPU — the checkpointed process does not appear in the GPU process list").
Checkpoint state machine is strict. Can't unlock from "checkpointed" state — must restore first (CUDA_ERROR_ILLEGAL_STATE). Can't corrupt GPU memory between checkpoint and restore within the same process (CUDA APIs are frozen). The overwrite-then-restore scenario requires an external coordinator (CRIU).
Use cuda.core APIs in tests, not raw bindings. Device().uuid for current device, Device.uuid for mapping keys, Device.get_all_devices() for enumeration. The implementation (checkpoint.py) handles the string-to-CUuuid conversion internally.

leofang · 2026-05-04T17:23:53Z

Pushed e9c03de because the real tests hang in the CI...

leofang · 2026-05-04T17:24:12Z

/ok to test e9c03de

cuCheckpointProcessCheckpoint hangs on CI runners (ephemeral VM + container), causing all CUDA 13.x test jobs to time out. Skip the tests that call into the checkpoint driver when the CI environment variable is set. Input validation tests still run everywhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kkraus14 · 2026-05-04T21:00:55Z

/ok to test 8f798f4

leofang · 2026-05-04T22:14:06Z

@kkraus14 I noticed your force-push (to stringify the tests for subprocess) still includes my WAR (checking the env var CI), so the tests are still not run in the CI environment, is it intended?

leofang

Since I also made some changes, would be nice for @Andy-Jost to re-review 🙂

kkraus14 · 2026-05-04T23:13:31Z

Addressed Leo’s latest review comments in signed commit 376acc7f18.

On CI: the remaining CI skip was a temporary workaround, not the intended final state. I removed the broad CI skip after moving all driver-backed lifecycle tests onto the isolated subprocess coordinator/target harness. The parent pytest process enforces timeouts and kills the scenario process group on timeout, so CI should get checkpoint coverage without repeating the previous job-level hang. Migration tests still skip when CUDA_VISIBLE_DEVICES is set or when the hardware/driver cannot provide valid same-chip migration, because the restore mapping must cover the KMD-visible GPU set.

Other updates:

Process.pid is now read-only via private _pid storage plus a property, with test coverage.
_call_driver now owns checkpoint missing-symbol and unsupported-result translation in one place.
Docs and migration tests now describe/use the KMD visibility rule instead of the looser CUDA-visible wording.
The PR description has been updated for the current implementation and validation state.

Local validation passed: ruff, git diff --check, focused pytest (10 passed, 6 skipped), and docs-build-latest.

kkraus14 · 2026-05-04T23:13:39Z

/ok to test 376acc7

kkraus14 · 2026-05-05T13:14:49Z

Follow-up CI fix pushed in signed commit 7a2e683059.

The first re-enabled CI run showed all eight checkpoint scenario tests failing in wheel jobs before reaching the driver: the subprocess imported cuda.core from the source checkout, where generated cuda.core._version is absent, instead of importing the installed wheel. The scenario harness now runs from a neutral temp cwd and removes the checkout cuda_core source root from PYTHONPATH, so wheel jobs should exercise the installed package. Local focused validation still passes: 10 passed, 6 skipped.

kkraus14 · 2026-05-05T13:14:57Z

/ok to test 7a2e683

kkraus14 · 2026-05-05T14:43:34Z

Triage for the two failed multi-GPU jobs on 7a2e6830:

Both failures were isolated to tests/test_checkpoint.py::TestCheckpointGpuMigration::test_rotation_migrates_context.
The lifecycle checkpoint tests passed on those runners.
The driver accepted the rotation mapping, but the target process stayed on the original GPU UUID. The swap migration test already observed the same accepted-but-no-op behavior and skipped it as a hardware/driver limitation.
I updated rotation to use the same no-op skip path while keeping the assertion for any unexpected migrated UUID.

Local validation on the new head:

git diff --check
pixi run ruff check cuda_core/tests/test_checkpoint.py
pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py cuda_core/tests/test_typing_imports.py (10 passed, 6 skipped)

The new signed/verified head is 8aeb8e82e1d09661289e0a4c588af6fb5bf862fc.

kkraus14 · 2026-05-05T14:43:42Z

/ok to test 8aeb8e8

Andy-Jost

LGTM!

kkraus14 · 2026-05-05T15:53:20Z

/ok to test b9fa2a1

kkraus14 · 2026-05-05T16:41:12Z

@Andy-Jost @leofang PTAL, I reworked how the checkpoint test subprocess launching works to avoid having to stringify all of the code and hopefully make it easier to maintain

Removed preview folders for the following PRs: - PR #1983 - PR #2006 - PR #2024

github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 28, 2026

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 396a2ca to 7c66b2f Compare April 28, 2026 16:28

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch 2 times, most recently from 779c697 to 82f816c Compare April 28, 2026 16:44

kkraus14 commented Apr 28, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/system/__init__.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 82f816c to 25455d8 Compare April 28, 2026 18:22

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 25455d8 to aaf1418 Compare April 28, 2026 19:14

kkraus14 commented Apr 28, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Add CUDA process checkpointing helpers

d8a2031

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from aaf1418 to d8a2031 Compare April 28, 2026 20:24

kkraus14 marked this pull request as ready for review April 29, 2026 13:59

kkraus14 added the feature New feature or request label Apr 29, 2026

kkraus14 added this to the cuda.core v1.0.0 milestone Apr 29, 2026

kkraus14 self-assigned this Apr 29, 2026

rparolin requested review from leofang and rparolin April 29, 2026 17:44

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/tests/test_checkpoint.py Outdated

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

leofang requested changes Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py

leofang requested review from Andy-Jost and leofang May 4, 2026 17:35

leofang and others added 2 commits May 4, 2026 16:49

Isolate checkpoint lifecycle tests

8f798f4

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from e9c03de to 8f798f4 Compare May 4, 2026 20:50

leofang reviewed May 4, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Address checkpoint review follow-ups

376acc7

Fix checkpoint subprocess imports in CI

7a2e683

kkraus14 commented May 5, 2026

View reviewed changes

Comment thread cuda_core/tests/test_checkpoint.py Outdated

Handle checkpoint migration no-op in CI

8aeb8e8

Andy-Jost approved these changes May 5, 2026

View reviewed changes

Simplify checkpoint subprocess tests

b9fa2a1

leofang approved these changes May 5, 2026

View reviewed changes

kkraus14 merged commit 98df790 into NVIDIA:main May 5, 2026
95 checks passed

github-actions Bot pushed a commit that referenced this pull request May 6, 2026

Clean up PR preview folders for 3 closed/merged PRs

e8ebacb

Removed preview folders for the following PRs: - PR #1983 - PR #2006 - PR #2024

Conversation

kkraus14 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Current Test Implementation

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

kkraus14 commented Apr 28, 2026

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

kkraus14 commented Apr 28, 2026

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

kkraus14 commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kkraus14 commented Apr 28, 2026

Uh oh!

kkraus14 commented Apr 28, 2026

Uh oh!

Uh oh!

kkraus14 commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

kkraus14 commented May 4, 2026

Uh oh!

leofang commented May 4, 2026

What I did

Test suite structure (13 tests)

Lessons learned

Uh oh!

leofang commented May 4, 2026

Uh oh!

leofang commented May 4, 2026

Uh oh!

kkraus14 commented May 4, 2026

Uh oh!

leofang commented May 4, 2026

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kkraus14 commented May 4, 2026

Uh oh!

kkraus14 commented May 4, 2026

Uh oh!

kkraus14 commented May 5, 2026

Uh oh!

kkraus14 commented May 5, 2026

Uh oh!

Uh oh!

kkraus14 commented May 5, 2026

Uh oh!

kkraus14 commented May 5, 2026

Uh oh!

Andy-Jost left a comment

Choose a reason for hiding this comment

Uh oh!

kkraus14 commented May 5, 2026

kkraus14 commented Apr 28, 2026 •

edited

Loading