nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers#13230
Open
lokashrinav wants to merge 3 commits into
Open
nvproxy: implement GPU checkpoint/restore for single and multi-GPU containers#13230lokashrinav wants to merge 3 commits into
lokashrinav wants to merge 3 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
…ntainers Replace panic stubs in save_restore_impl.go with real FD lifecycle: - frontendFD.afterLoadImpl: reopen host device files (/dev/nvidia*), re-register with fdnotifier, update memmapFile handles - uvmFD.afterLoadImpl: reopen /dev/nvidia-uvm, same lifecycle - Add restoreContext bridge (goContext -> gVisor context.Context) - Fix waiter queue corruption: skip EventRegister for entries already in the queue from checkpoint state (prevents infinite loop) Drop GPU-backed PMAs in InvalidateUnsavable (save_restore.go): - PMAs backed by frontendFDMemmapFile can't be serialized by stateify - These are re-created lazily via Translate faults after restore Add stateify annotations to object.go: - miscObject and osDescMem need +stateify savable - osDescMem host-specific fields (pinnedRanges, m, len) marked nosave Tested on bare-metal A10 GPU (driver 570.148.08): - Checkpoint: exit 0, 604KB state + 9.2MB pages serialized - Restore: state loads in 31ms, tasks resume execution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c6a8f4d to
14ec89a
Compare
gvisor-gpu-ckpt is invoked by gVisor's --save-restore-exec-argv hook
to save and restore GPU hardware state via NVIDIA's cuda-checkpoint API.
On save: cuCheckpointProcessLock + cuCheckpointProcessCheckpoint
On restore: cuCheckpointProcessRestore + cuCheckpointProcessUnlock
Uses dlopen("libcuda.so.1") at runtime -- no compile-time NVIDIA
dependency. Operates on the sentry PID, which owns all GPU contexts
in the container, so one call handles single or multi-GPU atomically.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cuda-checkpoint API must be called from inside the gVisor sandbox, not from the host. The sentry creates GPU contexts via raw ioctl forwarding without loading libcuda.so, so cuCheckpointProcessLock from the host returns CUDA_ERROR_NOT_INITIALIZED. From inside the sandbox, ioctls route through nvproxy and the driver resolves contexts correctly. Changes: - cuda.go: add cuInit(0) call during initialization (required before any checkpoint API calls) - main.go: default target PID to 1 (container init process) instead of os.Getppid() which returns 0 inside the sandbox PID namespace. Override via GVISOR_CHECKPOINT_PID env var. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
gVisor's nvproxy proxies GPU ioctls between container processes and the host NVIDIA driver. Every save/restore method in nvproxy is
panic("not implemented")— any checkpoint attempt on a container with live GPU state crashes the sentry.Three things block checkpoint/restore:
1. Host file descriptors become stale after restore. When a container opens
/dev/nvidia0, nvproxy opens the real device on the host and stores that FD. After checkpoint, the serialized FD number points at nothing — the device needs to be reopened on the new host, and every system that references that FD (the event notification layer, memory-mapped file handles) needs to be updated to use the new one.2. GPU-backed memory regions can't be serialized. GPU device memory creates PMAs (physical memory areas) backed by
frontendFDMemmapFile. Stateify's PMA serializer only handles*pgalloc.MemoryFile— it doesn't know how to save GPU device memory. These PMAs need to be dropped before save and re-created after restore.3. Missing stateify annotations. Two nvproxy structs (
miscObject,osDescMem) lack// +stateify savable, causing nil pointer panics during serialization.osDescMemalso holds host memory addresses (pinnedRanges,m,len) that are invalid after restore.For multi-GPU containers, there's an additional coordination problem: if multiple GPUs are communicating via NCCL and you try to lock them sequentially, you get dependency cycles — GPU 0 is frozen mid-transfer to GPU 1, so GPU 1 blocks waiting for that transfer and can never be locked. All GPUs must be locked atomically.
Solution
FD lifecycle (
save_restore_impl.go): Replace all 6 panic methods with real implementations.afterLoadImplreopens the host device files (via device gofer or directOpenat), re-registers withfdnotifier, and updatesmemmapFile. ArestoreContextwrapper bridgesgoContext.Contextto gVisor's extendedcontext.Context(which requiresBlocker). The waiter queueEntryis already in theQueuefrom checkpoint state —afterLoadImplcallsInit()to refresh the callback but does not callEventRegister, which would double-insert and create a linked list cycle.PMA invalidation (
save_restore.go):InvalidateUnsavable()now walks all PMAs and drops any not backed by*pgalloc.MemoryFile— unmaps from address space, removes RSS tracking, releases file references, removes the segment. After restore, access to these regions triggers a page fault →frontendFD.Translate()→ mmap against the reopened device → PMA re-created on demand.Stateify annotations (
object.go): Add// +stateify savabletomiscObjectandosDescMem. MarkpinnedRanges,m, andleninosDescMemasstate:"nosave".SaveRestoreExec binary (
gvisor-gpu-ckpt): Calls NVIDIA's cuda-checkpoint API —cuCheckpointProcessLock→cuCheckpointProcessCheckpointon save,cuCheckpointProcessRestore→cuCheckpointProcessUnlockon restore. Usesdlopen("libcuda.so.1")at runtime. The binary runs inside the sandbox and targets PID 1 (container init), because the cuda-checkpoint API must route through nvproxy — from the host,cuCheckpointProcessLock(sentryPID)returnsCUDA_ERROR_NOT_INITIALIZEDsince the sentry creates GPU contexts via raw ioctls without loading libcuda.so.Multi-GPU: All GPUs in one container share one sentry PID.
cuCheckpointProcessLocklocks every GPU context owned by that PID atomically — one call, all GPUs, no sequential locking, no NCCL deadlock window.Files changed
pkg/sentry/devices/nvproxy/save_restore_impl.gopkg/sentry/mm/save_restore.goInvalidateUnsavable()pkg/sentry/devices/nvproxy/object.gocmd/gvisor-gpu-ckpt/main.gocmd/gvisor-gpu-ckpt/cuda.goTest results
Tested on Lambda Labs bare-metal A10, driver 570.148.08, gVisor with nvproxy.
gVisor checkpoint: exit 0, 604KB kernel state + 9.2MB process memory serialized.
gVisor restore: state deserialized in 31ms, tasks resumed, GPU memory page faults handled via Translate path.
cuda-checkpoint API (from inside container, targeting PID 1):
CUDA process survived with 228 MiB GPU memory intact.
🤖 Generated with Claude Code