Skip to content

Add CUDA process checkpointing helpers#1983

Merged
kkraus14 merged 13 commits intoNVIDIA:mainfrom
kkraus14:kk/issue-1343-cuda-checkpointing
May 5, 2026
Merged

Add CUDA process checkpointing helpers#1983
kkraus14 merged 13 commits intoNVIDIA:mainfrom
kkraus14:kk/issue-1343-cuda-checkpointing

Conversation

@kkraus14
Copy link
Copy Markdown
Collaborator

@kkraus14 kkraus14 commented Apr 28, 2026

Summary

  • add a dedicated cuda.core.checkpoint module for CUDA process checkpointing APIs while keeping cuda.core.system focused on CUDA system/NVML capabilities
  • expose a narrow runtime API via checkpoint.Process(pid): read-only pid, state, restore_thread_id, lock, checkpoint, restore, and unlock
  • keep the checkpoint module public surface limited to Process; the state return type lives in cuda.core.typing.ProcessStateT and is rendered in the private API docs
  • map Process.state from the CUDA driver CUprocessState enumerators rather than raw integer values
  • support restore-time GPU UUID remapping using either driver CUuuid values or Device.uuid strings; migration docs and tests now describe the stricter kernel-mode-driver visibility requirement rather than user-space CUDA visibility
  • document the coordinator/target-process model, Linux permission requirements such as CAP_SYS_PTRACE, the CRIU/CPU-process-image boundary, restore-thread requirement, and persistence mode/cuInit restore requirement
  • validate checkpoint API availability lazily and cache the successful check, covering the cuda-bindings version, required binding symbols, and CUDA driver version
  • consolidate checkpoint driver call handling in one boundary that translates missing checkpoint symbols and unsupported checkpoint CUDA results into a checkpoint-specific RuntimeError
  • re-enable checkpoint driver coverage in CI by running driver-backed lifecycle tests in isolated coordinator/target subprocess scenarios with parent-side timeouts; migration tests skip when the CUDA device view is masked because the mapping cannot be proven KMD-complete
  • run checkpoint scenario and target subprocesses through python -I cuda_core/tests/test_checkpoint.py ... as a file-backed child entrypoint, avoiding manual cwd/PYTHONPATH rewriting and stringified child Python while preserving installed-wheel imports in CI
  • treat accepted-but-no-op GPU migration as a hardware/driver skip for both swap and rotation scenarios while still failing if the driver moves a context to an unexpected GPU

Closes #1343

Testing

  • git commit -S pre-commit hooks: ruff, formatting, SPDX, whitespace, RST, and related checks passed
  • git diff --check
  • pixi run ruff check cuda_core/tests/test_checkpoint.py (All checks passed)
  • pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py cuda_core/tests/test_typing_imports.py (10 passed, 6 skipped) after the file-backed subprocess entrypoint simplification
  • pixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py (All checks passed)
  • pixi run --manifest-path cuda_core -e docs docs-build-latest (Sphinx build succeeded)
  • previous broader local run: pixi run --manifest-path cuda_core pytest cuda_core/tests --ignore=cuda_core/tests/cython (2798 passed, 352 skipped, 2 failed)

The two Python-suite failures in the broader run are existing local NVML/system environment failures and are not related to this checkpointing change:

  • cuda_core/tests/system/test_system_device.py::test_get_inforom_version returns an empty InfoROM board part number locally.
  • cuda_core/tests/system/test_system_system.py::test_get_process_name hits an NVML UTF-8 decode error locally.

Additional local build/test notes:

  • pixi run --manifest-path cuda_core test stops before pytest in the existing build-cython-tests pre-step because cuda_core/tests/cython/test_get_cuda_native_handle.pyx cannot find the expected cuda.bindings .pxd files in this local pixi environment.
  • pixi build from cuda_core reaches the existing native cuda-core extension build and then fails with CUDA 12.9 headers that do not declare CU_MEM_ALLOCATION_TYPE_MANAGED; this is in the existing graph/managed-memory extension build path and is not checkpoint-specific.

CI note:

  • The previous CI attempt on 8192df67 exposed two unrelated/runner issues: one Windows py3.12 build failed inside the shared mini-CTK cache setup before any cuda.core build step, and CUDA 13.x GPU test jobs were canceled after the old in-process checkpoint test hung in cuCheckpointProcessCheckpoint.
  • The current head removes the broad CI skip. Driver-backed checkpoint lifecycle tests now run through isolated subprocess coordinator/target scenarios, and the parent pytest process can kill and skip a scenario that times out instead of letting the CI job hang.
  • The CI attempt on 7a2e6830 passed the checkpoint lifecycle tests and only failed the two multi-GPU jobs. Both failed in test_rotation_migrates_context because the driver accepted the rotation mapping but reported the target context still on the original UUID; the swap migration test already observed and skipped the same no-op behavior. The current head extends that no-op skip handling to rotation while preserving the assertion for any unexpected migrated UUID.

Current Test Implementation

The checkpoint tests in cuda_core/tests/test_checkpoint.py are real driver/GPU tests, not broad mocks.

Input validation and public-symbol checks run everywhere. Driver-backed lifecycle tests create a target process that initializes a real CUDA context, then a coordinator scenario calls checkpoint.Process(target.pid) and exercises state, restore_thread_id, lock, checkpoint, restore, and unlock through the real driver. The parent pytest process enforces a timeout around each scenario so unsupported driver/hardware paths skip cleanly instead of hanging the test job.

The scenario subprocess is launched as python -I cuda_core/tests/test_checkpoint.py scenario <name>, and its target process is launched as python -I cuda_core/tests/test_checkpoint.py target <device_index>. Python isolated mode ignores PYTHON* environment variables and omits the script/current directory from import resolution, so wheel CI imports the installed package without manual cwd or PYTHONPATH surgery. The child processes still inherit non-Python CUDA environment such as driver/library paths and device visibility.

Migration tests require at least two same-chip GPUs and an unmasked CUDA device view. They build full UUID mappings using Device.uuid strings, then exercise rotation and pair-swap migration patterns through Process.restore(gpu_mapping=...) in the isolated target process. They skip gracefully when CUDA_VISIBLE_DEVICES is set, when the local hardware lacks a same-chip GPU pair, when the driver rejects checkpoint migration, or when the driver accepts the mapping but leaves the target context on the original GPU.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 28, 2026
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 396a2ca to 7c66b2f Compare April 28, 2026 16:28
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test 7c66b2f

@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch 2 times, most recently from 779c697 to 82f816c Compare April 28, 2026 16:44
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@github-actions
Copy link
Copy Markdown

Comment thread cuda_core/cuda/core/system/__init__.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 82f816c to 25455d8 Compare April 28, 2026 18:22
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 25455d8 to aaf1418 Compare April 28, 2026 19:14
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from aaf1418 to d8a2031 Compare April 28, 2026 20:24
@kkraus14
Copy link
Copy Markdown
Collaborator Author

/ok to test

@kkraus14 kkraus14 marked this pull request as ready for review April 29, 2026 13:59
@kkraus14 kkraus14 added the feature New feature or request label Apr 29, 2026
@kkraus14 kkraus14 added this to the cuda.core v1.0.0 milestone Apr 29, 2026
@kkraus14 kkraus14 self-assigned this Apr 29, 2026
@rparolin rparolin requested review from leofang and rparolin April 29, 2026 17:44
Comment thread cuda_core/tests/test_checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 4, 2026

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

/ok to test 8192df6

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

Copy-pasta from my bot, with internal info redacted.

What I did

  • Discovered GPU migration doesn't work on this machine (heterogeneous GPUs + known driver bugs)
  • Searched internal resources (Confluence, gdrive, NVBugs, P4) to understand the root cause
  • Rewrote the full test suite with real GPU tests and no mocks
  • Pushed 4 commits to Keith's PR branch

Test suite structure (13 tests)

  • 5 input validation (no GPU needed): invalid pid types/values, public symbols
  • 6 lifecycle (single GPU, real driver): state transitions at every step, restore_thread_id, lock/unlock with timeouts, full checkpoint/restore cycle
  • 2 migration (≥2 same-chip GPUs): rotation and swap — exercise real driver API, skip gracefully when unsupported

Lessons learned

  1. nvidia-smi order ≠ CUDA device order. We wasted time testing the wrong GPU pair because CUDA_VISIBLE_DEVICES=0,3 selected Ada + A6000 (different chips) instead of two A6000s. Always use Device.uuid to identify GPUs, never assume nvidia-smi indices match CUDA indices.

  2. CUDA_VISIBLE_DEVICES breaks checkpoint migration. Tested locally: even a 2-pair identity mapping fails with CUDA_ERROR_INVALID_VALUE when only 2 of 4 GPUs are visible, while the same 4-pair identity mapping succeeds with all GPUs visible. The CUDA driver docs state the mapping must cover "every checkpointed GPU," and the driver records all physically attached GPUs at checkpoint time — not just the CUDA_VISIBLE_DEVICES subset.

  3. Migration requires exact chip match. Tested locally: any mapping that remaps between different architectures (e.g. Ada ↔️ A6000) is rejected with CUDA_ERROR_INVALID_VALUE. Same-chip mappings (A6000 ↔️ A6000) are accepted. The public CUDA docs say "the GPU to restore onto needs to be of the same chip type as the old GPU."

  4. Migration may be a no-op on some driver versions. Tested locally: same-chip A6000 swap is accepted (no error) but the context device UUID doesn't change — the driver silently no-ops. This is consistent with NVBug 5437334 (api_reverse_gpu_pairs fails on GA100x4) and NVBug 5544504 (customer report: "restore on a different GPU — the checkpointed process does not appear in the GPU process list").

  5. Checkpoint state machine is strict. Can't unlock from "checkpointed" state — must restore first (CUDA_ERROR_ILLEGAL_STATE). Can't corrupt GPU memory between checkpoint and restore within the same process (CUDA APIs are frozen). The overwrite-then-restore scenario requires an external coordinator (CRIU).

  6. Use cuda.core APIs in tests, not raw bindings. Device().uuid for current device, Device.uuid for mapping keys, Device.get_all_devices() for enumeration. The implementation (checkpoint.py) handles the string-to-CUuuid conversion internally.

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

Pushed e9c03de because the real tests hang in the CI...

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

/ok to test e9c03de

@leofang leofang requested review from Andy-Jost and leofang May 4, 2026 17:35
leofang and others added 2 commits May 4, 2026 16:49
cuCheckpointProcessCheckpoint hangs on CI runners (ephemeral VM +
container), causing all CUDA 13.x test jobs to time out.  Skip the
tests that call into the checkpoint driver when the CI environment
variable is set.  Input validation tests still run everywhere.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kkraus14 kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from e9c03de to 8f798f4 Compare May 4, 2026 20:50
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

/ok to test 8f798f4

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

@kkraus14 I noticed your force-push (to stringify the tests for subprocess) still includes my WAR (checking the env var CI), so the tests are still not run in the CI environment, is it intended?

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I also made some changes, would be nice for @Andy-Jost to re-review 🙂

Comment thread cuda_core/cuda/core/checkpoint.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

Addressed Leo’s latest review comments in signed commit 376acc7f18.

On CI: the remaining CI skip was a temporary workaround, not the intended final state. I removed the broad CI skip after moving all driver-backed lifecycle tests onto the isolated subprocess coordinator/target harness. The parent pytest process enforces timeouts and kills the scenario process group on timeout, so CI should get checkpoint coverage without repeating the previous job-level hang. Migration tests still skip when CUDA_VISIBLE_DEVICES is set or when the hardware/driver cannot provide valid same-chip migration, because the restore mapping must cover the KMD-visible GPU set.

Other updates:

  • Process.pid is now read-only via private _pid storage plus a property, with test coverage.
  • _call_driver now owns checkpoint missing-symbol and unsupported-result translation in one place.
  • Docs and migration tests now describe/use the KMD visibility rule instead of the looser CUDA-visible wording.
  • The PR description has been updated for the current implementation and validation state.

Local validation passed: ruff, git diff --check, focused pytest (10 passed, 6 skipped), and docs-build-latest.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 4, 2026

/ok to test 376acc7

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 5, 2026

Follow-up CI fix pushed in signed commit 7a2e683059.

The first re-enabled CI run showed all eight checkpoint scenario tests failing in wheel jobs before reaching the driver: the subprocess imported cuda.core from the source checkout, where generated cuda.core._version is absent, instead of importing the installed wheel. The scenario harness now runs from a neutral temp cwd and removes the checkout cuda_core source root from PYTHONPATH, so wheel jobs should exercise the installed package. Local focused validation still passes: 10 passed, 6 skipped.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 5, 2026

/ok to test 7a2e683

Comment thread cuda_core/tests/test_checkpoint.py Outdated
@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 5, 2026

Triage for the two failed multi-GPU jobs on 7a2e6830:

  • Both failures were isolated to tests/test_checkpoint.py::TestCheckpointGpuMigration::test_rotation_migrates_context.
  • The lifecycle checkpoint tests passed on those runners.
  • The driver accepted the rotation mapping, but the target process stayed on the original GPU UUID. The swap migration test already observed the same accepted-but-no-op behavior and skipped it as a hardware/driver limitation.
  • I updated rotation to use the same no-op skip path while keeping the assertion for any unexpected migrated UUID.

Local validation on the new head:

  • git diff --check
  • pixi run ruff check cuda_core/tests/test_checkpoint.py
  • pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py cuda_core/tests/test_typing_imports.py (10 passed, 6 skipped)

The new signed/verified head is 8aeb8e82e1d09661289e0a4c588af6fb5bf862fc.

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 5, 2026

/ok to test 8aeb8e8

Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 5, 2026

/ok to test b9fa2a1

@kkraus14
Copy link
Copy Markdown
Collaborator Author

kkraus14 commented May 5, 2026

@Andy-Jost @leofang PTAL, I reworked how the checkpoint test subprocess launching works to avoid having to stringify all of the code and hopefully make it easier to maintain

@kkraus14 kkraus14 merged commit 98df790 into NVIDIA:main May 5, 2026
95 checks passed
github-actions Bot pushed a commit that referenced this pull request May 6, 2026
Removed preview folders for the following PRs:
- PR #1983
- PR #2006
- PR #2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CUDA Checkpointing

4 participants