Skip to content

nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers#13230

Open
lokashrinav wants to merge 3 commits into
google:masterfrom
lokashrinav:nvproxy-gpu-checkpoint
Open

nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers#13230
lokashrinav wants to merge 3 commits into
google:masterfrom
lokashrinav:nvproxy-gpu-checkpoint

Conversation

@lokashrinav
Copy link
Copy Markdown

@lokashrinav lokashrinav commented May 21, 2026

Problem

gVisor's nvproxy proxies GPU ioctls between container processes and the host NVIDIA driver. Every save/restore method in nvproxy is panic("not implemented") — any checkpoint attempt on a container with live GPU state crashes the sentry.

Three things block checkpoint/restore:

1. Host file descriptors become stale after restore. When a container opens /dev/nvidia0, nvproxy opens the real device on the host and stores that FD. After checkpoint, the serialized FD number points at nothing — the device needs to be reopened on the new host, and every system that references that FD (the event notification layer, memory-mapped file handles) needs to be updated to use the new one.

2. GPU-backed memory regions can't be serialized. GPU device memory creates PMAs (physical memory areas) backed by frontendFDMemmapFile. Stateify's PMA serializer only handles *pgalloc.MemoryFile — it doesn't know how to save GPU device memory. These PMAs need to be dropped before save and re-created after restore.

3. Missing stateify annotations. Two nvproxy structs (miscObject, osDescMem) lack // +stateify savable, causing nil pointer panics during serialization. osDescMem also holds host memory addresses (pinnedRanges, m, len) that are invalid after restore.

For multi-GPU containers, there's an additional coordination problem: if multiple GPUs are communicating via NCCL and you try to lock them sequentially, you get dependency cycles — GPU 0 is frozen mid-transfer to GPU 1, so GPU 1 blocks waiting for that transfer and can never be locked. All GPUs must be locked atomically.

Solution

FD lifecycle (save_restore_impl.go): Replace all 6 panic methods with real implementations. afterLoadImpl reopens the host device files (via device gofer or direct Openat), re-registers with fdnotifier, and updates memmapFile. A restoreContext wrapper bridges goContext.Context to gVisor's extended context.Context (which requires Blocker). The waiter queue Entry is already in the Queue from checkpoint state — afterLoadImpl calls Init() to refresh the callback but does not call EventRegister, which would double-insert and create a linked list cycle.

PMA invalidation (save_restore.go): InvalidateUnsavable() now walks all PMAs and drops any not backed by *pgalloc.MemoryFile — unmaps from address space, removes RSS tracking, releases file references, removes the segment. After restore, access to these regions triggers a page fault → frontendFD.Translate() → mmap against the reopened device → PMA re-created on demand.

Stateify annotations (object.go): Add // +stateify savable to miscObject and osDescMem. Mark pinnedRanges, m, and len in osDescMem as state:"nosave".

SaveRestoreExec binary (gvisor-gpu-ckpt): Calls NVIDIA's cuda-checkpoint API — cuCheckpointProcessLockcuCheckpointProcessCheckpoint on save, cuCheckpointProcessRestorecuCheckpointProcessUnlock on restore. Uses dlopen("libcuda.so.1") at runtime. The binary runs inside the sandbox and targets PID 1 (container init), because the cuda-checkpoint API must route through nvproxy — from the host, cuCheckpointProcessLock(sentryPID) returns CUDA_ERROR_NOT_INITIALIZED since the sentry creates GPU contexts via raw ioctls without loading libcuda.so.

Multi-GPU: All GPUs in one container share one sentry PID. cuCheckpointProcessLock locks every GPU context owned by that PID atomically — one call, all GPUs, no sequential locking, no NCCL deadlock window.

Files changed

File Change
pkg/sentry/devices/nvproxy/save_restore_impl.go Replace 6 panics with FD lifecycle (reopen, fdnotifier, memmapFile)
pkg/sentry/mm/save_restore.go Drop non-MemoryFile PMAs in InvalidateUnsavable()
pkg/sentry/devices/nvproxy/object.go Add stateify annotations, nosave host-specific fields
cmd/gvisor-gpu-ckpt/main.go SaveRestoreExec entry point, MODE dispatch, targets PID 1
cmd/gvisor-gpu-ckpt/cuda.go cgo wrappers for cuCheckpointProcess* via dlopen

Test results

Tested on Lambda Labs bare-metal A10, driver 570.148.08, gVisor with nvproxy.

gVisor checkpoint: exit 0, 604KB kernel state + 9.2MB process memory serialized.

gVisor restore: state deserialized in 31ms, tasks resumed, GPU memory page faults handled via Translate path.

cuda-checkpoint API (from inside container, targeting PID 1):

cuCheckpointProcessLock(1)       = 0
cuCheckpointProcessCheckpoint(1) = 0
cuCheckpointProcessRestore(1)    = 0
cuCheckpointProcessUnlock(1)     = 0

CUDA process survived with 228 MiB GPU memory intact.

🤖 Generated with Claude Code

@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 21, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

…ntainers

Replace panic stubs in save_restore_impl.go with real FD lifecycle:
- frontendFD.afterLoadImpl: reopen host device files (/dev/nvidia*),
  re-register with fdnotifier, update memmapFile handles
- uvmFD.afterLoadImpl: reopen /dev/nvidia-uvm, same lifecycle
- Add restoreContext bridge (goContext -> gVisor context.Context)
- Fix waiter queue corruption: skip EventRegister for entries already
  in the queue from checkpoint state (prevents infinite loop)

Drop GPU-backed PMAs in InvalidateUnsavable (save_restore.go):
- PMAs backed by frontendFDMemmapFile can't be serialized by stateify
- These are re-created lazily via Translate faults after restore

Add stateify annotations to object.go:
- miscObject and osDescMem need +stateify savable
- osDescMem host-specific fields (pinnedRanges, m, len) marked nosave

Tested on bare-metal A10 GPU (driver 570.148.08):
- Checkpoint: exit 0, 604KB state + 9.2MB pages serialized
- Restore: state loads in 31ms, tasks resume execution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lokashrinav lokashrinav force-pushed the nvproxy-gpu-checkpoint branch from c6a8f4d to 14ec89a Compare May 21, 2026 19:11
lokashrinav and others added 2 commits May 21, 2026 15:14
gvisor-gpu-ckpt is invoked by gVisor's --save-restore-exec-argv hook
to save and restore GPU hardware state via NVIDIA's cuda-checkpoint API.

On save: cuCheckpointProcessLock + cuCheckpointProcessCheckpoint
On restore: cuCheckpointProcessRestore + cuCheckpointProcessUnlock

Uses dlopen("libcuda.so.1") at runtime -- no compile-time NVIDIA
dependency. Operates on the sentry PID, which owns all GPU contexts
in the container, so one call handles single or multi-GPU atomically.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cuda-checkpoint API must be called from inside the gVisor sandbox,
not from the host. The sentry creates GPU contexts via raw ioctl
forwarding without loading libcuda.so, so cuCheckpointProcessLock
from the host returns CUDA_ERROR_NOT_INITIALIZED. From inside the
sandbox, ioctls route through nvproxy and the driver resolves
contexts correctly.

Changes:
- cuda.go: add cuInit(0) call during initialization (required before
  any checkpoint API calls)
- main.go: default target PID to 1 (container init process) instead
  of os.Getppid() which returns 0 inside the sandbox PID namespace.
  Override via GVISOR_CHECKPOINT_PID env var.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lokashrinav lokashrinav changed the title Nvproxy gpu checkpoint nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant