Skip to content

Add UFFD snapshot pager graduation#272

Open
sjmiller609 wants to merge 6 commits into
mainfrom
hypeship/uffd-graduation
Open

Add UFFD snapshot pager graduation#272
sjmiller609 wants to merge 6 commits into
mainfrom
hypeship/uffd-graduation

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Running UFFD-backed VMs are pinned to their snapshot memory pager for the life of the restore. This adds a way to detach a running VM from its pager after it has soaked, so the pool of active pager sessions stays bounded and old pager versions can drain to zero and exit.

Detach happens without touching the VM: a new pager endpoint POST /sessions/{id}/complete populates every outstanding page from the backing file and then unregisters userfaultfd. The guest never pauses and its network is untouched; the VM ends up running on resident memory with no pager dependency.

Why not migrate UFFD→UFFD or fall back to the file backend: the memory backend is fixed at the mmap when a VM is restored, so reaching the file backend requires a VMM restart, which drops every TCP connection. Graduation (finish the lazy load, then detach) is the only path that is non-interrupting.

What's here

  • Pager (lib/uffdpager): POST /sessions/{id}/complete + Supervisor.CompleteSessionVersion. Completion runs in the fault-loop goroutine (woken via a pipe), populates all pages (reusing the existing read/copy path), then UFFDIO_UNREGISTERs the ranges. Unregister happens only after a full populate — otherwise the kernel zero-fills still-absent pages (corruption). On any populate failure the session keeps serving faults and is not torn down.
  • Hypervisor: new Capabilities().UsesDetachableSnapshotMemoryPager (true for Firecracker) so the controller stays hypervisor-agnostic.
  • Manager: GraduateSnapshotMemoryPager performs the detach under the instance lock and clears the session binding.
  • Controller (lib/uffdgraduate): scans for running pager-backed VMs and graduates eligible ones, prioritising outdated pager versions.
  • Config (hypervisor.firecracker_uffd_graduation): enabled (default false), min_session_age (10m), max_concurrent (1), max_active_sessions (0 = time-based weaning), scan_interval (1m), completion_timeout (10m). Wired in main.go via the existing configure/start pattern (no wire regen).

Behaviour

  • Disabled by default and only constructed on the uffd backend.
  • max_active_sessions == 0: every session past min_session_age is graduated (time-based weaning). > 0: only enough oldest sessions are graduated to return to the ceiling; outdated-version sessions are always graduated after the soak.
  • A failed graduation leaves the VM untouched (still on its pager) and is retried on a later scan.

Tradeoff

Graduated pages become resident anonymous memory (reclaimable only to swap, unlike clean file-backed pages), and completion reads the whole remaining image once — hence the soak + concurrency pacing.

Test plan

  • go build ./..., go vet, and unit tests pass for lib/uffdgraduate, lib/uffdpager, cmd/api/config.
  • Controller unit tests cover soak gating, concurrency, the max_active_sessions ceiling, outdated-version priority, and disabled = no-op. Config Normalize/Validate covered.
  • A unit test guards the hand-computed UFFDIO_UNREGISTER ioctl value and the wake pipe.
  • Not validated locally (needs real Firecracker + host kernel): the populate-then-unregister path on a live VM. Three assumptions to confirm before enabling in production:
    1. Firecracker tolerates the handler unregistering + closing the uffd mid-run after a full populate.
    2. Active-ballooning interaction: after unregister, a ballooned-then-reused page re-faults to zero-fill (safe only if genuinely guest-relinquished).
    3. UFFDIO_COPY is dirty-neutral on the host kernel, so the first post-graduation diff snapshot stays small (size regression risk, not correctness).

🤖 Generated with Claude Code


Note

High Risk
Touches live VM guest memory via userfaultfd completion on running instances; correctness depends on Firecracker tolerating mid-run unregister and full populate behavior under ballooning.

Overview
Adds UFFD graduation: a way to detach running Firecracker VMs from the snapshot memory pager without pausing or restarting the VMM, so active pager sessions stay bounded and old pager versions can drain.

The UFFD pager gains POST /sessions/{id}/complete and Supervisor.CompleteSessionVersion: the fault-loop goroutine populates all remaining pages from the backing file, then UFFDIO_UNREGISTERs ranges (populate-before-unregister to avoid zero-filled holes). Pager version bumps to 0.1.4.

Firecracker advertises UsesDetachableSnapshotMemoryPager. The instance manager implements GraduateSnapshotMemoryPager (running VMs only, clears session metadata) and UFFDGraduationTargetVersion.

A new lib/uffdgraduate controller scans on an interval, enforces soak age, concurrency, optional max_active_sessions, and prioritizes outdated pager versions; it is wired from hypervisor.firecracker_uffd_graduation (default disabled) in API main.go via providers.ProvideUFFDGraduationController. OTel metrics included.

Integration test TestFCUFFDGraduationLifecycle covers manual graduation, guest state after detach, and file-backed standby/restore afterward.

Reviewed by Cursor Bugbot for commit e344238. Bugbot is set up for automated code reviews on this repo. Configure here.

sjmiller609 and others added 5 commits June 6, 2026 18:35
Detach running UFFD-backed VMs from their snapshot memory pager after a
soak period instead of leaving them pinned for the life of the restore.
A new pager /sessions/{id}/complete endpoint populates the remaining
pages from the backing file and unregisters userfaultfd, so the VM keeps
running on resident memory with no pager dependency and no pause or
network interruption. This bounds the number of active pager sessions
and lets old pager versions drain to zero and exit.

A background controller (lib/uffdgraduate) drives graduations subject to
min_session_age, max_concurrent, and an optional max_active_sessions
ceiling, prioritising sessions on outdated pager versions. Disabled by
default and only active on the uffd backend. The detach is gated behind
a new hypervisor capability so the controller stays hypervisor-agnostic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sibling of the UFFD one-shot lifecycle test that detaches a running
UFFD-backed VM from its pager and asserts the VM keeps running with its
guest memory and disk intact, new writes still work, and a later
standby/restore preserves memory. Leaves the existing test unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Overlapping the graduation test's full memory populate with the sibling
UFFD lifecycle test's VMs saturated the CI runner and timed out
guest-agent readiness. Drop t.Parallel so peak concurrent UFFD VM load
matches the pre-existing single-test profile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread lib/instances/firecracker_uffd_graduate.go
@sjmiller609 sjmiller609 marked this pull request as ready for review June 24, 2026 14:12
Main advanced the pager to 0.1.3 independently (CLOCK cache eviction),
colliding with this branch's bump. Advance to 0.1.4 so the graduation
pager change carries a distinct version.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sjmiller609 sjmiller609 requested a review from hiroTamada June 24, 2026 14:24
@hiroTamada

Copy link
Copy Markdown
Contributor

reviewed end-to-end — solid, careful work, and the populate-then-unregister core is correct by construction. a few things worth a look, one i'd treat as a fix before merge.

should fix

  • lib/instances/firecracker_uffd_graduate.go:75-76 — graduation clears only FirecrackerUFFDSessionID + FirecrackerUFFDPagerVersion, but every other UFFD transition (standby.go:224, stop.go:310, fork.go:552, snapshot.go:321, firecracker_uffd.go:130) uses clearFirecrackerUFFDRestoreState, which also clears FirecrackerDeferredSnapshotMemoryPath. a running UFFD VM has that set, and it's consumed on the next standby (standby.go:151 → materializeDeferredSnapshotMemory, firecracker.go:121), which copies that backing file into the new snapshot dir. for a graduated VM it's stale — wasteful full-image copy at best, and if the source snapshot was deleted since, the copy errors and standby fails (firecracker.go:142-148). suggest replacing the two hand-clears with clearFirecrackerUFFDRestoreState(stored).

concurrency

  • complete_linux.go:76-81 + server_sessions_linux.go:118-121 — use-after-close race on wakeW. teardown runs close() (closes wakeW) before removeSession() deletes from the map, so a concurrent /complete can fetch the session and wake()unix.Write to a closed/recycled fd (wakeW >= 0 is never reset to -1). narrow window (fault loop exiting as a graduate fires, e.g. a VM dying mid-graduation), small blast radius, but real. fix: guard wakeW with a small mutex (set -1 under lock in close(), read+write under lock in wake()).

questions / confirm intent

  • firecracker_uffd_graduate.go:67 — on gctx timeout mid-populateAll, CompleteSessionVersion returns a context error (not ErrSessionNotFound) so the binding isn't cleared, but the server-side fault loop can still finish + tear down the session. self-heals next scan (404→clears) — is standby/health safe in that window?
  • controller.go:163-179overCap = len(insts) - MaxActiveSessions counts outdated sessions, which are graduated unconditionally on a separate branch. with outdated + a ceiling, non-outdated get over-selected below the ceiling (max=8, len=10, 3 outdated → 5 graduated, lands at 5). paced by MaxConcurrent so muted — intended, or should overCap net out the outdated?
  • complete_linux.go:110-140 — completion runs in the fault-loop goroutine, so during the whole-image, address-ordered populate sweep a guest thread touching an unpopulated page stalls until the sweep reaches it (correctness fine; latency can be large). "guest never pauses" holds for vCPUs but an individual fault can block — confirm soak + MaxConcurrent is the intended mitigation. minor: PR body says "reads the whole remaining image"; it re-reads the whole image (resident pages get EEXIST), so the I/O cost is full-image.

test gaps

  • the populate-then-unregister core is exercised only by TestFCUFFDGraduationLifecycle, which skips without KVM+userfaultfd+HYPEMAN_UFFD_PAGER_BINARY — likely not run in normal CI. hermetic linux tests cover only the ioctl constant + wake-pipe drain. consider a seam over uffdCopy/uffdUnregister to assert "unregister never runs if any populate errors" without a real uffd.
  • controller_test.gofakeStore.err is never set, so list-failure and graduate-failure/retry are untested; MaxConcurrent throttling and the overCap↔outdated interaction are untested too.
  • TestFCUFFDGraduationLifecycle:983 asserts session/version cleared but not FirecrackerDeferredSnapshotMemoryPath, and standbys while the source snapshot still exists — so it wouldn't catch the "should fix" item.

nits

  • controller.go:197-202 — a persistently-failing graduation retries every scan (1m) with no backoff; each attempt may re-read the whole image.
  • complete_linux.go:160-176drainUFFDEvents discards pagefault events too; safe only because populateAll covers every page — worth a one-line comment saying why.
  • providers/uffd_graduation.go:62-67 — when the manager type-assertion fails or target version is empty, the controller silently no-ops; a log.Warn would surface a misconfig.
  • firecracker_uffd_graduate.go:59-62 — empty stored version falls back to the current pager version; a resulting 404 clears the binding and could orphan a session on another version. shouldn't happen (version set at restore), just a sharp edge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants