feat(uffd): handle UFFD_EVENT_REMOVE; track per-page state; race-safe COPY#2520
feat(uffd): handle UFFD_EVENT_REMOVE; track per-page state; race-safe COPY#2520ValentaTomas merged 13 commits intomainfrom
Conversation
PR SummaryHigh Risk Overview Reviewed by Cursor Bugbot for commit 611d0d6. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: da912303d8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
da91230 to
efa01a5
Compare
…ernel redeliver Folds audit findings #1 and #7 into one commit since they share the same error arm in faultPage. The kernel surfaces concurrent mm churn (e.g. balloon-driven madvise(MADV_DONTNEED), mremap, fork against the same mm) through UFFDIO_COPY in two distinct ways: as an EAGAIN errno from the syscall, or — once UFFD_FEATURE_EVENT_REMOVE is enabled — through the partial-copy convention where the syscall returns 0 and cpy.copy carries either -EAGAIN or 0..pagesize. Hugetlb pages can also surface a positive short copy if a fault preempts the operation mid-page (#7). Pre-#2520 the latter path went through fmt.Errorf("UFFDIO_COPY copied N bytes...") and fell into the catch-all writeErr != nil arm — which calls onFailure() / fdExit.SignalExit(), tears the uffd serve loop down, and crashes the sandbox the moment the guest touches an unmapped page. The pre-existing errno-EAGAIN soft handler covered only the syscall errno path. Move the partial-copy classification into a small helper so both surfaces collapse onto the existing EAGAIN-returning-(false, nil) branch in faultPage. No retry budget — matches Firecracker's reference handler in src/firecracker/examples/uffd/uffd_utils.rs (Err(PartiallyCopied(n)) if n == 0 || n == -EAGAIN ⇒ return false). Add a uffd.copy_eagain span attribute for observability. Tests: unit-test classifyCopyResult directly. faultPage doesn't expose an Fd seam to mock UFFDIO_COPY without an interface refactor that would materially expand the diff; per the audit's "smallest pragmatic test" guidance the classifier covers the new branching and the existing cross-process matrix tests cover the integration path.
…onrpc over unix socket (#2519) Replace the cross-process userfaultfd test harness's pipes + signals (`SIGUSR1` shutdown, `SIGUSR2` page-state snapshot, ready/offsets/gate-cmd/gate-sync pipes) with one Unix socket carrying stdlib `net/rpc` + `net/rpc/jsonrpc`. The userfaultfd and the rpc socketpair half are passed via `ExtraFiles`. Production change: one `atomic.Pointer[func(uintptr, faultPhase)]` field on `Userfaultfd` and three nil-checked inline call sites. Test builds install the hook via `SetTestFaultHook` defined in a `_test.go` file. Stacked follow-ups: - `UFFD_EVENT_REMOVE` handling + matrix tests — #2520 - Stale-source / madvise-deadlock / faulted-short-circuit race tests — #2521 - Stale-source race fix — #2512
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting gates rollout to FC v1.14+. - Adds free-page-reporting LaunchDarkly feature flag. - create-build CLI: --free-page-reporting flag, defaults to enabled when FC version supports it. - smoketest: opportunistically enables FPR when the FC version under test supports it. UFFD-side changes (REMOVE handling, page tracker, race tests, fix) remain in #2520; this PR is purely the production rollout path.
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting gates rollout to FC v1.14+. - Adds free-page-reporting LaunchDarkly feature flag. - create-build CLI: --free-page-reporting flag, defaults to enabled when FC version supports it. - smoketest: opportunistically enables FPR when the FC version under test supports it. UFFD-side changes (REMOVE handling, page tracker, race tests, fix) remain in #2520; this PR is purely the production rollout path.
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting gates rollout to FC v1.14+. - Adds free-page-reporting LaunchDarkly feature flag. - create-build CLI: --free-page-reporting flag, defaults to enabled when FC version supports it. - smoketest: opportunistically enables FPR when the FC version under test supports it. UFFD-side changes (REMOVE handling, page tracker, race tests, fix) remain in #2520; this PR is purely the production rollout path.
Replace the generic block.StateTracker[S] with a non-generic block.Tracker over a fixed block.State enum (NotPresent default, Dirty, Zero). The same three states cover both UFFD page management (NotPresent=missing, Dirty=faulted, Zero=removed via DONTNEED/balloon) and upcoming NBD overlay tracking (NotPresent=fall-through, Dirty=overlay-owned, Zero=explicit zero/discard).
37be905 to
dcb7de4
Compare
Production:
- UFFDIO_REGISTER_MODE_REMOVE is requested so the kernel reports
MADV_DONTNEED'd pages via UFFD_EVENT_REMOVE.
- Userfaultfd.Serve splits read events into removes + pagefaults,
drains the REMOVE batch under settleRequests.Lock (calling
pageTracker.SetRange(.., removed) with BlockIdx-computed indices),
then dispatches the pagefault batch.
- Worker dispatch switches on pageTracker.Get(idx): faulted ->
short-circuit, removed -> zero-fill (source = nil), missing ->
copy from u.src. The state read happens inside the worker under
settleRequests.RLock so a concurrent REMOVE can't slip between
the read and the install.
- faultPage gains zero-fill paths for source == nil (4K read =
DONTWAKE zero + WP + wake; 4K write = zero + wake; hugepage =
copy(EmptyHugePage)) and returns (handled, err) so the worker can
defer UFFDIO_COPY EAGAIN back into a deferredFaults queue.
- wakeupPipe + deferredFaults wake the poll loop when a worker
defers, so a deferred fault doesn't sit waiting for an unrelated
UFFD event. The received uffd fd is marked FD_CLOEXEC.
- Prefault short-circuits for faulted || removed.
Tests:
- testConfig gains removeEnabled; the parent unregisters the UFFD
region on cleanup when REMOVE is on so munmap doesn't block on
un-acked events.
- Page-state wire format exposes removed via helpers_test.go.
- operationModeRemove + executeRemove (madvise MADV_DONTNEED).
- runMatrix wraps every existing generic test in remove-off and
remove-on subtests so the no-REMOVE path (still used by
production templates) stays covered while the new path is
exercised. The matrix-level t.Parallel() is intentionally
omitted to cap peak concurrency in CI.
- remove_test.go: TestRemove, TestRemoveThenFault,
TestRemoveThenWriteGated, TestWriteThenRemoveGated. Gated tests
are //nolint:tparallel — a paused gated handler keeps a faulting
goroutine suspended in the kernel pagefault path; a STW GC pause
from a parallel test would wait forever for that goroutine to
reach a safe point.
- race_test.go: deterministic stale-source / madvise-deadlock /
faulted-short-circuit regressions, serialised, with the
FD_CLOEXEC and UFFDIO_COPY-EAGAIN fixes covered.
A worker holding settleRequests.RLock must never block readEvents, because madvise(MADV_DONTNEED) blocks the producer until userspace reads the UFFD_EVENT_REMOVE — and the producer can be the FC balloon thread that other syscalls depend on. Use a dedicated readSerial mutex (not settleRequests) to serialize serve-loop iterations with snapshot-time Export, while keeping the existing settleRequests discipline (workers RLock, REMOVE batch Lock) intact so readEvents remains lock-free relative to workers. Restores liveness for TestNoMadviseDeadlockWithInflightCopy while closing the read-vs-apply race that motivated the prior buggy commit (345f7e9, now amended).
e44d433 to
4d3c128
Compare
❌ 2 Tests Failed:
View the full list of 2 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d3c128b0a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: uint32 truncation in REMOVE range calculation overflows for >4GB
- Changed to divide (rm.end-rm.start) by uint64(pageSize) before casting to uint32, preventing byte-length truncation for REMOVE events spanning >4GB.
Or push these changes by commenting:
@cursor push 2ebb482dd8
Preview (2ebb482dd8)
diff --git a/packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go b/packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
--- a/packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
+++ b/packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
@@ -289,7 +289,7 @@
}
startIdx := uint32(header.BlockIdx(startOff, int64(u.pageSize)))
- endIdx := startIdx + uint32(rm.end-rm.start)/uint32(u.pageSize)
+ endIdx := startIdx + uint32((rm.end-rm.start)/uint64(u.pageSize))
u.pageTracker.SetRange(startIdx, endIdx, block.Zero)
}
u.settleRequests.Unlock()You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit dee755c. Configure here.
uint32(rm.end-rm.start) truncated the byte length to 32 bits before dividing by pageSize, silently dropping pages from REMOVE events larger than 4 GiB. Divide first; uint32 block indices still address up to ~16 TiB at 4 KiB pages.
a02b428 to
89d54b1
Compare
… hush self-wake faultPage now returns a three-valued faultOutcome (Installed/Deferred/ Discarded). ESRCH maps to Discarded so the caller no longer marks a never-installed page as Dirty in the tracker. EAGAIN maps to Deferred (same behaviour as before). Panic recover returns Discarded + err so the Serve loop exits rather than looping. Serve gained a top-level defer wg.Wait() so workers can't outlive Serve on error-return paths and race with Close() on wakeupPipe / the uffd fd. Stop incrementing noDataCounter on wakeupPipe self-wakes — those are expected (a worker deferred a fault); they were polluting the uffd-no-data metric.
Adds `block.Tracker`, a non-generic state tracker over a fixed `block.State` enum with three universal values that cover both UFFD memory handling and NBD overlay tracking: - `NotPresent` (default) — fall through to the previous layer - `Dirty` — materialized in this layer (UFFD faulted page, NBD overlay-owned block) - `Zero` — known-zero without consulting the previous layer (UFFD page removed via `DONTNEED`/balloon, NBD explicit zero/discard) Backed by two roaring bitmaps (`dirty`, `zero`); `NotPresent` indices are implicit. Exposes `SetRange`, `Get`, and `Export` (returns clones). First consumer is #2520 (UFFD per-page state); also used by #2546 for NBD zero-page tracking.
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting gates rollout to FC v1.14+. - Adds free-page-reporting LaunchDarkly feature flag. - create-build CLI: --free-page-reporting flag, defaults to enabled when FC version supports it. - smoketest: opportunistically enables FPR when the FC version under test supports it. UFFD-side changes (REMOVE handling, page tracker, race tests, fix) remain in #2520; this PR is purely the production rollout path.
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting + auto-detect in TemplateCreate gate rollout to FC v1.14+. - LaunchDarkly free-page-reporting flag (default off).
Wires Firecracker balloon free-page-reporting through the template build into `pkg/sandbox/fc`. In the orchestrator path FPR turns on only when `fcInfo.HasFreePageReporting()` (FC v1.14+) AND the `free-page-reporting` LaunchDarkly flag are both true; `cmd/create-build` defaults to the FC-version-derived value and accepts an explicit `--free-page-reporting` override. Depends on #2545 → #2520 — without the REMOVE-event handling from #2520, FC's `madvise(MADV_DONTNEED)` on balloon deflate would race in-flight pagefault workers.


Handles
UFFD_EVENT_REMOVEso balloon-deflate /madvise(MADV_DONTNEED)transitions pages to aremovedstate instead of leaving stalefaultedmappings. Drains the REMOVE batch undersettleRequests.Lock; workers holdsettleRequests.RLockacross the state read →UFFDIO_COPY→SetRangesequence so a concurrent REMOVE can't slip between the read and the install. Soft-failsUFFDIO_COPYEAGAIN / partial copies onto adeferredFaultsqueue and wakes the poll loop via a self-pipe.Depends on #2545.