feat(fc): drain virtio-balloon free-page-hinting before pause#2552
feat(fc): drain virtio-balloon free-page-hinting before pause#2552ValentaTomas wants to merge 34 commits intomainfrom
Conversation
A small two-state-plus-default tracker backed by roaring bitmaps. Used by upcoming UFFD work to track page states (Missing/Faulted/Removed) and by NBD to track zero pages, replacing ad-hoc map-based trackers with O(1) range ops and cheap snapshot exports.
…state Replace the map-based pageTracker with block.StateTracker[pageState], a roaring-bitmap-backed tracker with O(1) range ops. pageState gains a third value, removed, which is wired at the type level but not yet written anywhere -- #2520 adds the REMOVE-event handler that produces it. Page indices are computed at the call site via header.BlockIdx. pageStateEntries is updated to iterate the exported bitmaps so the cross-process test harness keeps working. Inline the 3-line pageState enum into userfaultfd.go and drop the dedicated page_tracker.go now that pageTracker is gone. Convert block.StateTracker's NewStateTracker / SetRange API from panics to errors. Distinct-state validation and unsupported-state checks now return fmt.Errorf descriptors; the userfaultfd-side init propagates the constructor error through NewUserfaultfdFromFd, and the SetRange call in the worker path logs and continues since these errors only fire on programming bugs.
Production:
- UFFDIO_REGISTER_MODE_REMOVE is requested so the kernel reports
MADV_DONTNEED'd pages via UFFD_EVENT_REMOVE.
- Userfaultfd.Serve splits read events into removes + pagefaults,
drains the REMOVE batch under settleRequests.Lock (calling
pageTracker.SetRange(.., removed) with BlockIdx-computed indices),
then dispatches the pagefault batch.
- Worker dispatch switches on pageTracker.Get(idx): faulted ->
short-circuit, removed -> zero-fill (source = nil), missing ->
copy from u.src. The state read happens inside the worker under
settleRequests.RLock so a concurrent REMOVE can't slip between
the read and the install.
- faultPage gains zero-fill paths for source == nil (4K read =
DONTWAKE zero + WP + wake; 4K write = zero + wake; hugepage =
copy(EmptyHugePage)) and returns (handled, err) so the worker can
defer UFFDIO_COPY EAGAIN back into a deferredFaults queue.
- wakeupPipe + deferredFaults wake the poll loop when a worker
defers, so a deferred fault doesn't sit waiting for an unrelated
UFFD event. The received uffd fd is marked FD_CLOEXEC.
- Prefault short-circuits for faulted || removed.
Tests:
- testConfig gains removeEnabled; the parent unregisters the UFFD
region on cleanup when REMOVE is on so munmap doesn't block on
un-acked events.
- Page-state wire format exposes removed via helpers_test.go.
- operationModeRemove + executeRemove (madvise MADV_DONTNEED).
- runMatrix wraps every existing generic test in remove-off and
remove-on subtests so the no-REMOVE path (still used by
production templates) stays covered while the new path is
exercised. The matrix-level t.Parallel() is intentionally
omitted to cap peak concurrency in CI.
- remove_test.go: TestRemove, TestRemoveThenFault,
TestRemoveThenWriteGated, TestWriteThenRemoveGated. Gated tests
are //nolint:tparallel — a paused gated handler keeps a faulting
goroutine suspended in the kernel pagefault path; a STW GC pause
from a parallel test would wait forever for that goroutine to
reach a safe point.
- race_test.go: deterministic stale-source / madvise-deadlock /
faulted-short-circuit regressions, serialised, with the
FD_CLOEXEC and UFFDIO_COPY-EAGAIN fixes covered.
A worker holding settleRequests.RLock must never block readEvents, because madvise(MADV_DONTNEED) blocks the producer until userspace reads the UFFD_EVENT_REMOVE — and the producer can be the FC balloon thread that other syscalls depend on. Use a dedicated readSerial mutex (not settleRequests) to serialize serve-loop iterations with snapshot-time Export, while keeping the existing settleRequests discipline (workers RLock, REMOVE batch Lock) intact so readEvents remains lock-free relative to workers. Restores liveness for TestNoMadviseDeadlockWithInflightCopy while closing the read-vs-apply race that motivated the prior buggy commit (345f7e9, now amended).
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting gates rollout to FC v1.14+. - Adds free-page-reporting LaunchDarkly feature flag. - create-build CLI: --free-page-reporting flag, defaults to enabled when FC version supports it. - smoketest: opportunistically enables FPR when the FC version under test supports it. UFFD-side changes (REMOVE handling, page tracker, race tests, fix) remain in #2520; this PR is purely the production rollout path.
…rchestrator HasFreePageReporting() was added to fcversion.Info but had zero callers in the production path. Mirror the HasHugePages() pattern: let the orchestrator derive the value from the FC version (authoritative), gated by the FreePageReportingFlag LaunchDarkly flag (default false). Also emit an env.free_page_reporting span attribute alongside the existing env.huge_pages one.
Read the Removed bitmap from PageTracker and emit it as DiffMetadata.Empty so REMOVE'd pages become uuid.Nil mappings in the snapshot header (read as zero on resume). Defensively AndNot the empty set out of dirty: settle drains make these disjoint in practice (Removed pages have no PTE, WP-async only sees present pages with WP cleared), but if the invariant ever breaks the guest's last intent for a Removed page is "free, read zero on restore" — so empty must win, not stale dirty content.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 51ae406. Bugbot is set up for automated code reviews on this repo. Configure here. |
e8bd708 to
bf00edc
Compare
263a0d0 to
f4e3ab0
Compare
Arm free-page-hinting on the existing balloon device (always set when the balloon is installed; pure runtime toggle), and on pause do a host-initiated hint+wait so MADV_DONTNEED-reclaimed pages are settled before the snapshot. Pages reclaimed this way generate UFFD_EVENT_REMOVE, which the orchestrator already tracks (parent FPR PR), so the snapshot captures them as removed instead of zero-filled. - fc/client.go: rename enableFreePageReporting -> installBalloon; always set FreePageHinting=true; add startBalloonHinting + describeBalloonHinting helpers. - fc/process.go: track balloonInstalled; add DrainBalloon (start + poll guest_cmd >= host_cmd, with host>0 guard against transient nil/zero responses). - sandbox.go: wire featureFlags into Sandbox; call DrainBalloon from Pause behind the flag. Failures are logged but non-fatal. Gated by free-page-hinting-timeout-ms (LD int flag, ms; default 0 = disabled). resume-build gains --fph-timeout-ms for local exercise.
f4e3ab0 to
7619cc9
Compare
417ed97 to
920e8ec
Compare
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting + auto-detect in TemplateCreate gate rollout to FC v1.14+. - LaunchDarkly free-page-reporting flag (default off).
…rchestrator HasFreePageReporting() was added to fcversion.Info but had zero callers in the production path. Mirror the HasHugePages() pattern: let the orchestrator derive the value from the FC version (authoritative), gated by the FreePageReportingFlag LaunchDarkly flag (default false). Also emit an env.free_page_reporting span attribute alongside the existing env.huge_pages one.
Read the Removed bitmap from PageTracker and emit it as DiffMetadata.Empty so REMOVE'd pages become uuid.Nil mappings in the snapshot header (read as zero on resume). Defensively AndNot the empty set out of dirty: settle drains make these disjoint in practice (Removed pages have no PTE, WP-async only sees present pages with WP cleared), but if the invariant ever breaks the guest's last intent for a Removed page is "free, read zero on restore" — so empty must win, not stale dirty content.
❌ 8 Tests Failed:
View the full list of 16 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: FPR conflicts with hugepages
- Added
!hugePagescondition to FPR auto-enable logic, matching the server build path's conflict prevention.
- Added
Or push these changes by commenting:
@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
})
}
- // Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+ // Default FPR on for FC v1.14+ unless hugepages is enabled.
+ // Firecracker rejects balloon (free-page-reporting) together with hugepages.
var fprEnabled bool
if freePageReporting != nil {
fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
versionOnly, _, _ := strings.Cut(fcVersion, "_")
supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
if err == nil {
- fprEnabled = supported
+ fprEnabled = !hugePages && supported
}
}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit a1b3a8f. Configure here.
- Drop FPR-related changes superseded by parent PR (create-build, smoketest, template-manager.proto, generated pb.go). - Delete unused block.StateTracker (parent PR added block.Tracker). - Trim verbose comments in fph_gates, fc/client, fc/process, sandbox.go, featureflags, sandbox_features.
…/sandbox-pause-fph
|
Waiting for the merge of #2541, but otherwise should be ready. |
|
Before enabling in prod we need to deploy the kernel fix though. |
Code Review by Qodo
1.
|
Generic name and parameterized FreePageReporting so individual balloon features (FPR today, FPH next) can be opted in independently from the caller without renaming the helper again.
…/sandbox-pause-fph # Conflicts: # packages/orchestrator/pkg/sandbox/fc/client.go # packages/orchestrator/pkg/sandbox/fc/process.go
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).
Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.
Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.
Single LD flag, one rule per cohort:
- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.
Example value:
```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```
`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.
Pairs cleanly with #2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (#2552). Split out from
#2550.
- New free-page-hinting-arm bool flag controls install-time FreePageHinting on the balloon. Independent from FPR; arming alone doesn't trigger the guest-kernel race that the existing free-page-hinting-timeout-ms flag guards against. - free-page-hinting-timeout-ms now evaluated with sandbox LD context that includes kernel-version + firecracker-version, so operators can roll out the actual drain only on guests with the kernel race fix. - Drop the kernelSupportsFreePageHinting Go-side gate (it was hardcoded to 999.0.0 anyway); kernel eligibility is now expressed in LD targeting.
Resume-from-snapshot can restore non-zero host_cmd/guest_cmd from a prior drain. The old 'host > 0 && guest >= host' check then trivially succeeds on the first describe before FC's VMM thread bumps host_cmd, returning a false-positive completion and silently no-op'ing the drain. Capture hostBefore prior to start and require a strict bump.
'arm' was jargon-y. The flag controls whether FPH is configured on the balloon at install time, so 'install' reads more clearly and pairs naturally with the runtime free-page-hinting-timeout-ms flag.


Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. The balloon (from parent FPR PR) always arms
FreePageHinting=true; on pause we callstart_balloon_hintingand polldescribe_balloon_hintinguntilguest_cmd >= host_cmd(withhost > 0guard). Reclaimed pages emitUFFD_EVENT_REMOVE, already tracked by parent.Gated by
free-page-hinting-timeout-msLD flag (ms; default 0 = disabled). Operator opts in once the kernel has the FPH race fix. Stacked on parent FPR branch for the shared balloon-install path; split out from #2550.