Skip to content

feat(fc): drain virtio-balloon free-page-hinting before pause#2552

Open
ValentaTomas wants to merge 34 commits intomainfrom
feat/sandbox-pause-fph
Open

feat(fc): drain virtio-balloon free-page-hinting before pause#2552
ValentaTomas wants to merge 34 commits intomainfrom
feat/sandbox-pause-fph

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. The balloon (from parent FPR PR) always arms FreePageHinting=true; on pause we call start_balloon_hinting and poll describe_balloon_hinting until guest_cmd >= host_cmd (with host > 0 guard). Reclaimed pages emit UFFD_EVENT_REMOVE, already tracked by parent.

Gated by free-page-hinting-timeout-ms LD flag (ms; default 0 = disabled). Operator opts in once the kernel has the FPH race fix. Stacked on parent FPR branch for the shared balloon-install path; split out from #2550.

ValentaTomas and others added 11 commits May 3, 2026 01:26
A small two-state-plus-default tracker backed by roaring bitmaps. Used by
upcoming UFFD work to track page states (Missing/Faulted/Removed) and by
NBD to track zero pages, replacing ad-hoc map-based trackers with O(1)
range ops and cheap snapshot exports.
…state

Replace the map-based pageTracker with block.StateTracker[pageState], a
roaring-bitmap-backed tracker with O(1) range ops. pageState gains a
third value, removed, which is wired at the type level but not yet
written anywhere -- #2520 adds the REMOVE-event handler that produces
it. Page indices are computed at the call site via header.BlockIdx.
pageStateEntries is updated to iterate the exported bitmaps so the
cross-process test harness keeps working.

Inline the 3-line pageState enum into userfaultfd.go and drop the
dedicated page_tracker.go now that pageTracker is gone.

Convert block.StateTracker's NewStateTracker / SetRange API from panics
to errors. Distinct-state validation and unsupported-state checks now
return fmt.Errorf descriptors; the userfaultfd-side init propagates the
constructor error through NewUserfaultfdFromFd, and the SetRange call
in the worker path logs and continues since these errors only fire on
programming bugs.
Production:
  - UFFDIO_REGISTER_MODE_REMOVE is requested so the kernel reports
    MADV_DONTNEED'd pages via UFFD_EVENT_REMOVE.
  - Userfaultfd.Serve splits read events into removes + pagefaults,
    drains the REMOVE batch under settleRequests.Lock (calling
    pageTracker.SetRange(.., removed) with BlockIdx-computed indices),
    then dispatches the pagefault batch.
  - Worker dispatch switches on pageTracker.Get(idx): faulted ->
    short-circuit, removed -> zero-fill (source = nil), missing ->
    copy from u.src. The state read happens inside the worker under
    settleRequests.RLock so a concurrent REMOVE can't slip between
    the read and the install.
  - faultPage gains zero-fill paths for source == nil (4K read =
    DONTWAKE zero + WP + wake; 4K write = zero + wake; hugepage =
    copy(EmptyHugePage)) and returns (handled, err) so the worker can
    defer UFFDIO_COPY EAGAIN back into a deferredFaults queue.
  - wakeupPipe + deferredFaults wake the poll loop when a worker
    defers, so a deferred fault doesn't sit waiting for an unrelated
    UFFD event. The received uffd fd is marked FD_CLOEXEC.
  - Prefault short-circuits for faulted || removed.

Tests:
  - testConfig gains removeEnabled; the parent unregisters the UFFD
    region on cleanup when REMOVE is on so munmap doesn't block on
    un-acked events.
  - Page-state wire format exposes removed via helpers_test.go.
  - operationModeRemove + executeRemove (madvise MADV_DONTNEED).
  - runMatrix wraps every existing generic test in remove-off and
    remove-on subtests so the no-REMOVE path (still used by
    production templates) stays covered while the new path is
    exercised. The matrix-level t.Parallel() is intentionally
    omitted to cap peak concurrency in CI.
  - remove_test.go: TestRemove, TestRemoveThenFault,
    TestRemoveThenWriteGated, TestWriteThenRemoveGated. Gated tests
    are //nolint:tparallel — a paused gated handler keeps a faulting
    goroutine suspended in the kernel pagefault path; a STW GC pause
    from a parallel test would wait forever for that goroutine to
    reach a safe point.
  - race_test.go: deterministic stale-source / madvise-deadlock /
    faulted-short-circuit regressions, serialised, with the
    FD_CLOEXEC and UFFDIO_COPY-EAGAIN fixes covered.
A worker holding settleRequests.RLock must never block readEvents,
because madvise(MADV_DONTNEED) blocks the producer until userspace
reads the UFFD_EVENT_REMOVE — and the producer can be the FC balloon
thread that other syscalls depend on. Use a dedicated readSerial
mutex (not settleRequests) to serialize serve-loop iterations with
snapshot-time Export, while keeping the existing settleRequests
discipline (workers RLock, REMOVE batch Lock) intact so readEvents
remains lock-free relative to workers.

Restores liveness for TestNoMadviseDeadlockWithInflightCopy while
closing the read-vs-apply race that motivated the prior buggy commit
(345f7e9, now amended).
…loon

Adds the FC-side integration plumbing for free page reporting on top of
the UFFD REMOVE-event handling in #2520:

- template-manager proto: optional bool freePageReporting (field 17).
- TemplateConfig + sandbox.Config gain a FreePageReporting bool that
  flows from template create → build phases (base/steps/finalize) →
  sandbox factory → fc.Process.Create.
- fc.apiClient.enableFreePageReporting calls PUT /balloon with
  free_page_reporting=true after entropy setup and before VM start.
- fcversion.HasFreePageReporting gates rollout to FC v1.14+.
- Adds free-page-reporting LaunchDarkly feature flag.
- create-build CLI: --free-page-reporting flag, defaults to enabled
  when FC version supports it.
- smoketest: opportunistically enables FPR when the FC version under
  test supports it.

UFFD-side changes (REMOVE handling, page tracker, race tests, fix)
remain in #2520; this PR is purely the production rollout path.
…rchestrator

HasFreePageReporting() was added to fcversion.Info but had zero callers
in the production path. Mirror the HasHugePages() pattern: let the
orchestrator derive the value from the FC version (authoritative),
gated by the FreePageReportingFlag LaunchDarkly flag (default false).

Also emit an env.free_page_reporting span attribute alongside the
existing env.huge_pages one.
Read the Removed bitmap from PageTracker and emit it as DiffMetadata.Empty
so REMOVE'd pages become uuid.Nil mappings in the snapshot header (read as
zero on resume). Defensively AndNot the empty set out of dirty: settle drains
make these disjoint in practice (Removed pages have no PTE, WP-async only
sees present pages with WP cleared), but if the invariant ever breaks the
guest's last intent for a Removed page is "free, read zero on restore" — so
empty must win, not stale dirty content.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

Medium Risk
Changes the VM pause/snapshot path to run and wait on Firecracker free-page-hinting, which could increase pause latency or fail in unexpected FC/guest configurations despite best-effort handling. Adds new LaunchDarkly context targeting and flag overrides that could be misconfigured and silently disable/enable the behavior.

Overview
Adds a pre-pause virtio-balloon free-page-hinting drain (start + poll) gated by a new free-page-hinting-timeout-ms int flag and kernel/Firecracker-version LaunchDarkly context; this can extend pause time, and the polling loop’s time.After usage may add avoidable timer churn under load. The drain treats Firecracker 400s as “not configured” and returns nil, so misconfiguration (balloon not installed or API mismatch) can silently skip the intended reclaim. Balloon install now optionally enables free-page-hinting (plus new FC-version gating), and resume-build can override the timeout flag at runtime, which may have surprising global side effects in tests or multi-run processes.

Reviewed by Cursor Bugbot for commit 51ae406. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch 3 times, most recently from 263a0d0 to f4e3ab0 Compare May 4, 2026 00:51
Arm free-page-hinting on the existing balloon device (always set when
the balloon is installed; pure runtime toggle), and on pause do a
host-initiated hint+wait so MADV_DONTNEED-reclaimed pages are settled
before the snapshot. Pages reclaimed this way generate UFFD_EVENT_REMOVE,
which the orchestrator already tracks (parent FPR PR), so the snapshot
captures them as removed instead of zero-filled.

- fc/client.go: rename enableFreePageReporting -> installBalloon;
  always set FreePageHinting=true; add startBalloonHinting +
  describeBalloonHinting helpers.
- fc/process.go: track balloonInstalled; add DrainBalloon (start +
  poll guest_cmd >= host_cmd, with host>0 guard against transient
  nil/zero responses).
- sandbox.go: wire featureFlags into Sandbox; call DrainBalloon from
  Pause behind the flag. Failures are logged but non-fatal.

Gated by free-page-hinting-timeout-ms (LD int flag, ms; default 0 =
disabled). resume-build gains --fph-timeout-ms for local exercise.
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch from f4e3ab0 to 7619cc9 Compare May 4, 2026 00:55
@ValentaTomas ValentaTomas force-pushed the feat/uffd-fc-free-page-reporting-integration branch from 417ed97 to 920e8ec Compare May 5, 2026 08:10
…loon

Adds the FC-side integration plumbing for free page reporting on top of
the UFFD REMOVE-event handling in #2520:

- template-manager proto: optional bool freePageReporting (field 17).
- TemplateConfig + sandbox.Config gain a FreePageReporting bool that
  flows from template create → build phases (base/steps/finalize) →
  sandbox factory → fc.Process.Create.
- fc.apiClient.enableFreePageReporting calls PUT /balloon with
  free_page_reporting=true after entropy setup and before VM start.
- fcversion.HasFreePageReporting + auto-detect in TemplateCreate gate
  rollout to FC v1.14+.
- LaunchDarkly free-page-reporting flag (default off).
…rchestrator

HasFreePageReporting() was added to fcversion.Info but had zero callers
in the production path. Mirror the HasHugePages() pattern: let the
orchestrator derive the value from the FC version (authoritative),
gated by the FreePageReportingFlag LaunchDarkly flag (default false).

Also emit an env.free_page_reporting span attribute alongside the
existing env.huge_pages one.
Read the Removed bitmap from PageTracker and emit it as DiffMetadata.Empty
so REMOVE'd pages become uuid.Nil mappings in the snapshot header (read as
zero on resume). Defensively AndNot the empty set out of dirty: settle drains
make these disjoint in practice (Removed pages have no PTE, WP-async only
sees present pages with WP cleared), but if the invariant ever breaks the
guest's last intent for a Removed page is "free, read zero on restore" — so
empty must win, not stale dirty content.
@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
2605 8 2597 7
View the full list of 16 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 72.99% (Passed 74 times, Failed 200 times)

Stack Traces | 219s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (218.70s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.41% (Passed 71 times, Failed 196 times)

Stack Traces | 6.58s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1363}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox i7guecv6pfgtd9sfc266h
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1364}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox i7guecv6pfgtd9sfc266h
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1365}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Fri, 08 May 2026 23:58:51 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i7guecv6pfgtd9sfc266h
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (6.58s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 53.12% (Passed 120 times, Failed 136 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.64% (Passed 67 times, Failed 95 times)

Stack Traces | 7.8s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1256}}
Executing command python in sandbox iuh5b4dd97tudt6cknobf
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.80s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 53.42% (Passed 68 times, Failed 78 times)

Stack Traces | 7.49s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox icn4olsjqbokfrvgg007f
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1256}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.49s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::

Flake rate in main: 50.00% (Passed 68 times, Failed 68 times)

Stack Traces | 7.26s run time
=== RUN   TestBindLocalhost/bind_::
=== PAUSE TestBindLocalhost/bind_::
=== CONT  TestBindLocalhost/bind_::
Executing command python in sandbox i0nn8cwccscwhqghonn1t
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1256}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::
        	Messages:   	Unexpected status code 502 for bind address ::
--- FAIL: TestBindLocalhost/bind_:: (7.26s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 60.59% (Passed 67 times, Failed 103 times)

Stack Traces | 8.62s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox i7tsf814tvovf6794zqzo
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1257}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.62s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 60.59% (Passed 67 times, Failed 103 times)

Stack Traces | 8.06s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox ipcz08p64yvmoo544wu4z
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1256}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (8.06s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Flake rate in main: 46.88% (Passed 85 times, Failed 75 times)

Stack Traces | 0.67s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
Executing command findmnt in sandbox ixtkk5mvx4homhc5a80q0 (user: root)
--- FAIL: TestListDir (0.67s)
Executing command python in sandbox iqviy6wabok9f9530txz5
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Flake rate in main: 51.80% (Passed 67 times, Failed 72 times)

Stack Traces | 0.03s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.03s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Flake rate in main: 51.80% (Passed 67 times, Failed 72 times)

Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Flake rate in main: 51.80% (Passed 67 times, Failed 72 times)

Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Flake rate in main: 51.80% (Passed 67 times, Failed 72 times)

Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 61.31% (Passed 77 times, Failed 122 times)

Stack Traces | 86.9s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (86.92s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 63.39% (Passed 67 times, Failed 116 times)

Stack Traces | 31.1s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1264}}
Executing command bash in sandbox i5s61f2l7mdl6x14ql8hs (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 187 MB\nFree memory before tmpfs mount: 797 MB\nMemory to use in integrity test (80% of free, min 64MB): 637 MB\n"}}
Executing command bash in sandbox i9zjeas2tpcqbnuk3ze0f (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"637+0 records in\n637+0 records out\n667942912 bytes (668 MB, 637 MiB) copied, 3.38733 s, 197 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=637\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.35\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.39\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2708\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 343\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 19\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 829 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i7bhx7a4dg7mymijq4awr
Executing command bash in sandbox i7bhx7a4dg7mymijq4awr (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1280}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"eb30d455a7a147d3ae92a46e37736f3dc7aa53c1ed995249018194bbbdceac89\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i7bhx7a4dg7mymijq4awr
Executing command bash in sandbox i7bhx7a4dg7mymijq4awr (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1283}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i7bhx7a4dg7mymijq4awr: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (31.10s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxAutoResumeViaProxy

Flake rate in main: 48.87% (Passed 68 times, Failed 65 times)

Stack Traces | 19.3s run time
=== RUN   TestSandboxAutoResumeViaProxy
=== PAUSE TestSandboxAutoResumeViaProxy
=== CONT  TestSandboxAutoResumeViaProxy
Executing command apt-get in sandbox iub8r0196f0dfa8u6wkam (user: root)
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"izpi26yydnrq5gn7h7urr","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"izpi26yydnrq5gn7h7urr","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:116: 
        	Error Trace:	.../tests/proxies/auto_resume_test.go:116
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxAutoResumeViaProxy
--- FAIL: TestSandboxAutoResumeViaProxy (19.31s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: FPR conflicts with hugepages
    • Added !hugePages condition to FPR auto-enable logic, matching the server build path's conflict prevention.

Create PR

Or push these changes by commenting:

@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
 		})
 	}
 
-	// Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+	// Default FPR on for FC v1.14+ unless hugepages is enabled.
+	// Firecracker rejects balloon (free-page-reporting) together with hugepages.
 	var fprEnabled bool
 	if freePageReporting != nil {
 		fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
 		versionOnly, _, _ := strings.Cut(fcVersion, "_")
 		supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
 		if err == nil {
-			fprEnabled = supported
+			fprEnabled = !hugePages && supported
 		}
 	}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit a1b3a8f. Configure here.

Comment thread packages/orchestrator/cmd/create-build/main.go Outdated
- Drop FPR-related changes superseded by parent PR (create-build,
  smoketest, template-manager.proto, generated pb.go).
- Delete unused block.StateTracker (parent PR added block.Tracker).
- Trim verbose comments in fph_gates, fc/client, fc/process,
  sandbox.go, featureflags, sandbox_features.
@ValentaTomas
Copy link
Copy Markdown
Member Author

Waiting for the merge of #2541, but otherwise should be ready.

@ValentaTomas ValentaTomas marked this pull request as ready for review May 7, 2026 06:28
@ValentaTomas
Copy link
Copy Markdown
Member Author

Before enabling in prod we need to deploy the kernel fix though.

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 7, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. FPH kernel gate disables ✓ Resolved 🐞
Description
MinFreePageHintingKernelVersion is set to 999.0.0, so kernelSupportsFreePageHinting() will never
enable FreePageHinting for normal guest kernels and installBalloon() will always configure the
balloon with hinting disabled. With hinting disabled, DrainBalloon() will consistently no-op as “not
configured”, so enabling free-page-hinting-timeout-ms won’t actually drain anything before pause.
Code

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[R10-18]

+// MinFreePageHintingKernelVersion is the minimum guest kernel version that
+// contains the FPH/MADV_DONTNEED race fix. Bump once the fixed kernel ships.
+const MinFreePageHintingKernelVersion = "999.0.0"
+
+func kernelSupportsFreePageHinting(kernelVersion string) bool {
+	v := strings.TrimPrefix(kernelVersion, "vmlinux-")
+	ok, _ := utils.IsGTEVersion(v, MinFreePageHintingKernelVersion)
+
+	return ok
Evidence
The kernel gate compares the guest kernel version against 999.0.0, which will fail for real kernel
versions (e.g. the repo default vmlinux-6.1.158), causing freePageHinting to be false when
configuring the balloon. Firecracker’s API reports 400 when hinting wasn’t enabled at device
configuration time; DrainBalloon treats that specific 400 as “not configured” and returns nil,
making the pre-pause drain ineffective.

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
packages/orchestrator/pkg/sandbox/fc/process.go[446-454]
packages/shared/pkg/featureflags/flags.go[244-247]
packages/shared/pkg/fc/client/operations/start_balloon_hinting_responses.go[110-114]
packages/orchestrator/pkg/sandbox/fc/process.go[734-740]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Free-page-hinting is effectively impossible to enable because `MinFreePageHintingKernelVersion` is hardcoded to `999.0.0`, making `kernelSupportsFreePageHinting()` always return false for real kernel versions; this causes the balloon to be configured without hinting and makes `DrainBalloon()` a no-op.

### Issue Context
The pre-pause drain is guarded by a timeout feature flag, but the balloon hinting capability is separately gated by the kernel version check; with the current constant, the drain cannot ever perform useful work.

### Fix Focus Areas
- packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
- packages/orchestrator/pkg/sandbox/fc/process.go[446-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. FPH override no-op online 🐞
Description
resume-build’s -fph-timeout-ms calls featureflags.NewIntFlag(), which only updates the offline test
datasource, not a live LaunchDarkly environment. When LAUNCH_DARKLY_API_KEY is set,
NewClientWithLogLevel uses a real LaunchDarkly client and the override is ignored, so the CLI flag
does not do what its help text claims.
Code

packages/orchestrator/cmd/resume-build/main.go[R76-82]

+	fphTimeoutMs := flag.Int("fph-timeout-ms", 0, "override free-page-hinting-timeout-ms LD flag (0 = use LD default)")
+
	flag.Parse()

+	if *fphTimeoutMs > 0 {
+		featureflags.NewIntFlag("free-page-hinting-timeout-ms", *fphTimeoutMs)
+	}
Evidence
The CLI override is implemented by calling NewIntFlag(), which mutates the in-process ldtestdata
(offline) store. The featureflags client switches to a real LaunchDarkly client whenever
LAUNCH_DARKLY_API_KEY is set, so changes to the offline store won’t affect evaluation in that mode.

packages/orchestrator/cmd/resume-build/main.go[76-82]
packages/shared/pkg/featureflags/flags.go[147-152]
packages/shared/pkg/featureflags/client.go[19-23]
packages/shared/pkg/featureflags/client.go[71-86]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`-fph-timeout-ms` currently only affects the offline LaunchDarkly test datasource; when a real LaunchDarkly client is in use, the override is ignored.

### Issue Context
The flag help text says it “overrides free-page-hinting-timeout-ms LD flag”, so it should deterministically control the drain timeout in resume-build regardless of whether LaunchDarkly is configured.

### Fix Focus Areas
- packages/orchestrator/cmd/resume-build/main.go[76-82]
- packages/shared/pkg/featureflags/flags.go[147-152]
- packages/shared/pkg/featureflags/client.go[19-23]
- packages/shared/pkg/featureflags/client.go[71-86]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread packages/orchestrator/pkg/sandbox/fc/fph_gates.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
Generic name and parameterized FreePageReporting so individual balloon
features (FPR today, FPH next) can be opted in independently from the
caller without renaming the helper again.
…/sandbox-pause-fph

# Conflicts:
#	packages/orchestrator/pkg/sandbox/fc/client.go
#	packages/orchestrator/pkg/sandbox/fc/process.go
ValentaTomas added a commit that referenced this pull request May 8, 2026
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).

Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.

Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.

Single LD flag, one rule per cohort:

- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.

Example value:

```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```

`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.

Pairs cleanly with #2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (#2552). Split out from
#2550.
ValentaTomas and others added 3 commits May 8, 2026 01:29
- New free-page-hinting-arm bool flag controls install-time FreePageHinting
  on the balloon. Independent from FPR; arming alone doesn't trigger the
  guest-kernel race that the existing free-page-hinting-timeout-ms flag
  guards against.
- free-page-hinting-timeout-ms now evaluated with sandbox LD context that
  includes kernel-version + firecracker-version, so operators can roll out
  the actual drain only on guests with the kernel race fix.
- Drop the kernelSupportsFreePageHinting Go-side gate (it was hardcoded to
  999.0.0 anyway); kernel eligibility is now expressed in LD targeting.
Resume-from-snapshot can restore non-zero host_cmd/guest_cmd from a prior
drain. The old 'host > 0 && guest >= host' check then trivially succeeds
on the first describe before FC's VMM thread bumps host_cmd, returning a
false-positive completion and silently no-op'ing the drain. Capture
hostBefore prior to start and require a strict bump.
@ValentaTomas ValentaTomas requested review from bchalios and kalyazin and removed request for dobrac and jakubno May 8, 2026 08:48
'arm' was jargon-y. The flag controls whether FPH is configured on the
balloon at install time, so 'install' reads more clearly and pairs
naturally with the runtime free-page-hinting-timeout-ms flag.
Base automatically changed from feat/uffd-fc-free-page-reporting-integration to main May 8, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants