nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers by lokashrinav · Pull Request #13230 · google/gvisor

lokashrinav · 2026-05-21T19:09:44Z

Problem

gVisor's nvproxy proxies GPU ioctls between container processes and the host NVIDIA driver. Every save/restore method in nvproxy is panic("not implemented") — any checkpoint attempt on a container with live GPU state crashes the sentry.

Three things block checkpoint/restore:

1. Host file descriptors become stale after restore. When a container opens /dev/nvidia0, nvproxy opens the real device on the host and stores that FD. After checkpoint, the serialized FD number points at nothing — the device needs to be reopened on the new host, and every system that references that FD (the event notification layer, memory-mapped file handles) needs to be updated to use the new one.

2. GPU-backed memory regions can't be serialized. GPU device memory creates PMAs (physical memory areas) backed by frontendFDMemmapFile. Stateify's PMA serializer only handles *pgalloc.MemoryFile — it doesn't know how to save GPU device memory. These PMAs need to be dropped before save and re-created after restore.

3. Missing stateify annotations. Two nvproxy structs (miscObject, osDescMem) lack // +stateify savable, causing nil pointer panics during serialization. osDescMem also holds host memory addresses (pinnedRanges, m, len) that are invalid after restore.

For multi-GPU containers, there's an additional coordination problem: if multiple GPUs are communicating via NCCL and you try to lock them sequentially, you get dependency cycles — GPU 0 is frozen mid-transfer to GPU 1, so GPU 1 blocks waiting for that transfer and can never be locked. All GPUs must be locked atomically.

Solution

FD lifecycle (save_restore_impl.go): Replace all 6 panic methods with real implementations. afterLoadImpl reopens the host device files (via device gofer or direct Openat), re-registers with fdnotifier, and updates memmapFile. A restoreContext wrapper bridges goContext.Context to gVisor's extended context.Context (which requires Blocker). The waiter queue Entry is already in the Queue from checkpoint state — afterLoadImpl calls Init() to refresh the callback but does not call EventRegister, which would double-insert and create a linked list cycle.

PMA invalidation (save_restore.go): InvalidateUnsavable() now walks all PMAs and drops any not backed by *pgalloc.MemoryFile — unmaps from address space, removes RSS tracking, releases file references, removes the segment. After restore, access to these regions triggers a page fault → frontendFD.Translate() → mmap against the reopened device → PMA re-created on demand.

Stateify annotations (object.go): Add // +stateify savable to miscObject and osDescMem. Mark pinnedRanges, m, and len in osDescMem as state:"nosave".

SaveRestoreExec binary (gvisor-gpu-ckpt): Calls NVIDIA's cuda-checkpoint API — cuCheckpointProcessLock → cuCheckpointProcessCheckpoint on save, cuCheckpointProcessRestore → cuCheckpointProcessUnlock on restore. Uses dlopen("libcuda.so.1") at runtime. The binary runs inside the sandbox and targets PID 1 (container init), because the cuda-checkpoint API must route through nvproxy — from the host, cuCheckpointProcessLock(sentryPID) returns CUDA_ERROR_NOT_INITIALIZED since the sentry creates GPU contexts via raw ioctls without loading libcuda.so.

Multi-GPU: All GPUs in one container share one sentry PID. cuCheckpointProcessLock locks every GPU context owned by that PID atomically — one call, all GPUs, no sequential locking, no NCCL deadlock window.

Files changed

File	Change
`pkg/sentry/devices/nvproxy/save_restore_impl.go`	Replace 6 panics with FD lifecycle (reopen, fdnotifier, memmapFile)
`pkg/sentry/mm/save_restore.go`	Drop non-MemoryFile PMAs in `InvalidateUnsavable()`
`pkg/sentry/devices/nvproxy/object.go`	Add stateify annotations, nosave host-specific fields
`cmd/gvisor-gpu-ckpt/main.go`	SaveRestoreExec entry point, MODE dispatch, targets PID 1
`cmd/gvisor-gpu-ckpt/cuda.go`	cgo wrappers for cuCheckpointProcess* via dlopen

Test results

Tested on Lambda Labs bare-metal A10, driver 570.148.08, gVisor with nvproxy.

gVisor checkpoint: exit 0, 604KB kernel state + 9.2MB process memory serialized.

gVisor restore: state deserialized in 31ms, tasks resumed, GPU memory page faults handled via Translate path.

cuda-checkpoint API (from inside container, targeting PID 1):

cuCheckpointProcessLock(1)       = 0
cuCheckpointProcessCheckpoint(1) = 0
cuCheckpointProcessRestore(1)    = 0
cuCheckpointProcessUnlock(1)     = 0

CUDA process survived with 228 MiB GPU memory intact.

🤖 Generated with Claude Code

google-cla · 2026-05-21T19:10:02Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

…ntainers Replace panic stubs in save_restore_impl.go with real FD lifecycle: - frontendFD.afterLoadImpl: reopen host device files (/dev/nvidia*), re-register with fdnotifier, update memmapFile handles - uvmFD.afterLoadImpl: reopen /dev/nvidia-uvm, same lifecycle - Add restoreContext bridge (goContext -> gVisor context.Context) - Fix waiter queue corruption: skip EventRegister for entries already in the queue from checkpoint state (prevents infinite loop) Drop GPU-backed PMAs in InvalidateUnsavable (save_restore.go): - PMAs backed by frontendFDMemmapFile can't be serialized by stateify - These are re-created lazily via Translate faults after restore Add stateify annotations to object.go: - miscObject and osDescMem need +stateify savable - osDescMem host-specific fields (pinnedRanges, m, len) marked nosave Tested on bare-metal A10 GPU (driver 570.148.08): - Checkpoint: exit 0, 604KB state + 9.2MB pages serialized - Restore: state loads in 31ms, tasks resume execution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gvisor-gpu-ckpt is invoked by gVisor's --save-restore-exec-argv hook to save and restore GPU hardware state via NVIDIA's cuda-checkpoint API. On save: cuCheckpointProcessLock + cuCheckpointProcessCheckpoint On restore: cuCheckpointProcessRestore + cuCheckpointProcessUnlock Uses dlopen("libcuda.so.1") at runtime -- no compile-time NVIDIA dependency. Operates on the sentry PID, which owns all GPU contexts in the container, so one call handles single or multi-GPU atomically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cuda-checkpoint API must be called from inside the gVisor sandbox, not from the host. The sentry creates GPU contexts via raw ioctl forwarding without loading libcuda.so, so cuCheckpointProcessLock from the host returns CUDA_ERROR_NOT_INITIALIZED. From inside the sandbox, ioctls route through nvproxy and the driver resolves contexts correctly. Changes: - cuda.go: add cuInit(0) call during initialization (required before any checkpoint API calls) - main.go: default target PID to 1 (container init process) instead of os.Getppid() which returns 0 inside the sandbox PID namespace. Override via GVISOR_CHECKPOINT_PID env var. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lokashrinav force-pushed the nvproxy-gpu-checkpoint branch from c6a8f4d to 14ec89a Compare May 21, 2026 19:11

lokashrinav and others added 2 commits May 21, 2026 15:14

lokashrinav changed the title ~~Nvproxy gpu checkpoint~~ nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers#13230

nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers#13230
lokashrinav wants to merge 3 commits into
google:masterfrom
lokashrinav:nvproxy-gpu-checkpoint

lokashrinav commented May 21, 2026 •

edited

Loading

Uh oh!

google-cla Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lokashrinav commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Files changed

Test results

Uh oh!

google-cla Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lokashrinav commented May 21, 2026 •

edited

Loading