Skip to content

feat(sandbox): pre-pause guest reclaim via envd#2551

Merged
ValentaTomas merged 20 commits intomainfrom
feat/sandbox-pause-reclaim
May 8, 2026
Merged

feat(sandbox): pre-pause guest reclaim via envd#2551
ValentaTomas merged 20 commits intomainfrom
feat/sandbox-pause-reclaim

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

Adds an opt-in pre-pause step that runs sync, drop_caches, compact_memory, and fstrim -av on the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped in timeout -s KILL with its own cap, so a stuck step (most realistically a slow sync on a large dirty backlog) cannot starve the rest — and a killed step does not abort the chain (;-separated, not &&).

Pausing FC is unaffected by an in-flight guest sync we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness — drop_caches is documented non-destructive, fstrim consults FS allocation metadata not pagecache, and a partial compact_memory is just less-compacted.

Disabled by default — the LD flag's null default leaves every step at 0 (skipped). Missing keys, zero, negative, and wrong-type values all collapse to "skip". The orchestrator skips the envd call entirely when the chain is empty. The outer Connect-Timeout-Ms is the sum of per-step caps plus a small slack.

Single LD flag, one rule per cohort:

  • guest-pause-reclaim (JSON) — per-step caps in milliseconds keyed by step name, evaluated against sandbox / team / template LD contexts so targeting is configured in LaunchDarkly.

Example value:

{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}

resume-build exposes -reclaim to inject the example values into the offline LD store for local testing.

Pairs cleanly with #2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (#2552). Split out from #2550.

Run sync, drop_caches, compact_memory, and fstrim -av on the live VM
through envd's Process service immediately before pause to shrink the
memfile/rootfs diff snapshot. Composed as a single bash chain with
';' separators so each step is best-effort, the orchestrator owns the
deadline via Connect-Timeout-Ms, and all failures are non-fatal.

Gated by reclaim-on-pause-timeout-ms (LD int flag, ms; default 0 =
disabled). resume-build gains a matching --reclaim-timeout-ms override
for local exercise.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

Medium Risk
Touches the sandbox pause/snapshot path and adds guest-side commands executed via envd, which could increase pause latency or fail unexpectedly depending on guest tooling and envd behavior.

Overview
This can panic if Pause() is called on a Sandbox constructed without featureFlags (e.g., in tests/helpers) because the reclaim path assumes a non-nil client. The reclaim chain assumes guest availability of timeout, fstrim, and /proc/sys/vm/*; when missing or blocked it will consistently fail (logged only), potentially masking real regressions while still adding pre-pause overhead. resume-build command execution now relies on Connect-Timeout-Ms behavior for kill semantics; if envd ignores it, long-running commands may no longer be bounded as intended.

Reviewed by Cursor Bugbot for commit 1b451c2. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Wraps each reclaim step (sync, drop_caches, compact_memory, fstrim) in
its own `timeout -s KILL`. A stuck step (most realistically a slow sync
on a large dirty backlog) cannot starve the rest, so compact_memory —
the diff-critical step — always runs as long as its cap is > 0.

Per-step ceilings are runtime-configurable via four new IntFlags:
- reclaim-sync-timeout-ms (default 500)
- reclaim-drop-caches-timeout-ms (default 200)
- reclaim-compact-memory-timeout-ms (default 1000)
- reclaim-fstrim-timeout-ms (default 500)

Setting any per-step cap to 0 skips that step. The outer
reclaim-on-pause-timeout-ms remains the master enable + Connect-Timeout-Ms
cap.

Pausing FC is unaffected by an in-flight guest sync that we time out:
FC only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness.
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
…uffix and join

Three fixes triggered by Cursor Bugbot review of the previous commit and
a follow-up question on the master flag:

1. Drop reclaim-on-pause-timeout-ms. Per-step caps (defaulting to 0)
   already encode "disabled by default": when every cap is 0 the script
   is empty and bestEffortReclaim short-circuits without calling envd.
   The outer Connect-Timeout-Ms is now derived from the sum of per-step
   caps + 500ms slack.

2. `timeout` accepts s/m/h/d (or fractional seconds), not `ms`. Format
   each cap as `%.3f` seconds (e.g. 500ms → 0.500). Without this, every
   step would silently fail with "invalid time interval".

3. Join parts with `; ` (not a single space) and append one trailing
   `true`. With space-joining, bash parsed `; true timeout ...` as
   `true` swallowing subsequent steps as args, so only `sync` ever ran.

4. Add `--foreground` to `timeout`. Without it, the SIGKILL doesn't
   reliably reach a stuck child when run from a non-interactive bash
   invoked by envd's Process service (verified empirically with
   `sh -c "sleep 5"` running its full 5s despite a 0.5s timeout).

resume-build CLI: replace --reclaim-timeout-ms with a `--reclaim` bool
that flips the per-step caps to sane local-test defaults (500/200/1000/500).
The previous refactor commit referenced Sandbox.StartEnvdProcess from
reclaim.go and resume-build/main.go but the helper file itself was
never tracked, breaking the orchestrator build (typecheck) on CI.
Comment thread packages/orchestrator/pkg/sandbox/envd_process.go Outdated
ValentaTomas added a commit that referenced this pull request May 4, 2026
Adds `vm.compaction_proactiveness=0` to the base template's
`/etc/sysctl.conf` so kcompactd no longer runs background page
migrations in the guest.

With 2 MiB host-side hugepage backing of guest RAM, every migration
dirties a destination hugepage from the host UFFD's perspective and
lands in the next memfile diff — with no snapshot-aligned benefit. The
pre-pause `compact_memory` write (#2551) does the work deterministically
right before we capture state.

Existing templates inherit the change on rebuild.
Redirect each step's stdout+stderr to /dev/null inside the guest so envd
doesn't ship per-command output back to the orchestrator. Capture
per-step exit codes into `rc` and `exit $rc` instead of the trailing
`true`, so the script's overall status reflects whether any step failed.
On the orchestrator side, inspect the End event's exit code and log a
warning only when envd errors or the script reports non-zero.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
2594 8 2586 7
View the full list of 10 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 67.66% (Passed 54 times, Failed 113 times)

Stack Traces | 2.98s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (2.98s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestEgressFirewallDomainPersistsAfterResume

Flake rate in main: 50.48% (Passed 52 times, Failed 53 times)

Stack Traces | 7.48s run time
=== RUN   TestEgressFirewallDomainPersistsAfterResume
=== PAUSE TestEgressFirewallDomainPersistsAfterResume
=== CONT  TestEgressFirewallDomainPersistsAfterResume
    sandbox_network_out_test.go:731: Command [curl] output: event:{start:{pid:1316}}
    sandbox_network_out_test.go:731: Command [curl] output: event:{end:{exit_code:28 exited:true status:"exit status 28" error:"exit status 28"}}
    sandbox_network_out_test.go:731: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:67
        	            				.../api/sandboxes/sandbox_network_out_test.go:731
        	Error:      	Received unexpected error:
        	            	command curl in sandbox i7el9c0r0oudzffl73jv1 failed with exit code 28
        	Test:       	TestEgressFirewallDomainPersistsAfterResume
        	Messages:   	Expected curl to allowed domain (google.com) to succeed before pause
--- FAIL: TestEgressFirewallDomainPersistsAfterResume (7.48s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 73.27% (Passed 58 times, Failed 159 times)

Stack Traces | 195s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (195.46s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.46% (Passed 56 times, Failed 155 times)

Stack Traces | 1.74s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i4pba69yyr9i6fklxotn1
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1359}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox i4pba69yyr9i6fklxotn1
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1360}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox igu91v4mmatu02wka971j
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1361}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Fri, 08 May 2026 08:18:24 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i4pba69yyr9i6fklxotn1
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (1.74s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 53.16% (Passed 89 times, Failed 101 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.73% (Passed 52 times, Failed 74 times)

Stack Traces | 13.2s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1258}}
Executing command python in sandbox ikjlb612atrxan7r3psza
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (13.18s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 60.00% (Passed 52 times, Failed 78 times)

Stack Traces | 11.3s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1258}}
Executing command python in sandbox iyfwqb73gkoyhu66viksy
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (11.30s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 60.00% (Passed 52 times, Failed 78 times)

Stack Traces | 9.43s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox it1c50vztegaowp4k6oag
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1258}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (9.43s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 61.01% (Passed 62 times, Failed 97 times)

Stack Traces | 83.3s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (83.33s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 63.64% (Passed 52 times, Failed 91 times)

Stack Traces | 28s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1250}}
Executing command bash in sandbox iv8umbnodw5cnft1j4v9c (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 185 MB\nFree memory before tmpfs mount: 800 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
Executing command bash in sandbox iv8umbnodw5cnft1j4v9c (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.3277 s, 202 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.30\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.33\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2684\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 343\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 23\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox iw2z4fdsh514u83z2883w
Executing command bash in sandbox iw2z4fdsh514u83z2883w (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1266}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"62c47ff908fbc0bddcf8c4579ee73e52f5df3dd09f4fd1b3c59c93775d649a64\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox iw2z4fdsh514u83z2883w
Executing command bash in sandbox iw2z4fdsh514u83z2883w (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1270}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox iw2z4fdsh514u83z2883w: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (28.00s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Drop the StartEnvdProcess helper (avoiding the -l login-shell regression
in resume-build's runCommandInSandbox) and inline the envd Process.Start
construction in reclaim.go. Drop `timeout --foreground`: by default
`timeout` puts the wrapped sh in its own pgroup so SIGKILL reaches all
descendants — `--foreground` would target only the direct child.
@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

@ValentaTomas ValentaTomas marked this pull request as ready for review May 6, 2026 20:34
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, push a new commit or reopen this pull request to trigger a review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40a1e24405

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/cmd/resume-build/main.go Outdated
@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 6, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. -reclaim flag silently no-ops in production ✓ Resolved 🐞
Description
When LAUNCH_DARKLY_API_KEY is set, featureflags.NewClientWithLogLevel connects to the real
LaunchDarkly backend; the four featureflags.NewIntFlag(...) calls inside the if *reclaim block
only update the package-level launchDarklyOfflineStore (the test fixture), which the live client
never reads. The reclaim chain stays disabled despite the flag being passed.
Code

packages/orchestrator/cmd/resume-build/main.go[R80-85]

+	if *reclaim {
+		featureflags.NewIntFlag("reclaim-sync-timeout-ms", 500)
+		featureflags.NewIntFlag("reclaim-drop-caches-timeout-ms", 200)
+		featureflags.NewIntFlag("reclaim-compact-memory-timeout-ms", 1000)
+		featureflags.NewIntFlag("reclaim-fstrim-timeout-ms", 500)
+	}
Evidence
NewIntFlag writes exclusively to launchDarklyOfflineStore (flags.go:141-145). NewClientWithLogLevel
bypasses that store and creates a real LD client whenever launchDarklyApiKey != ""
(client.go:71-85). The resume-build binary creates its client at line 1019 before the reclaim block
runs, but even if order were reversed the offline store is never consulted by the live client.

packages/shared/pkg/featureflags/flags.go[141-145]
packages/shared/pkg/featureflags/client.go[71-85]
packages/orchestrator/cmd/resume-build/main.go[1019-1019]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The `-reclaim` flag calls `featureflags.NewIntFlag(...)` at runtime to set per-step timeout values, but `NewIntFlag` only mutates the package-level `launchDarklyOfflineStore` (the ldtestdata fixture). When the binary runs with `LAUNCH_DARKLY_API_KEY` set, the `featureflags.Client` is backed by the real LaunchDarkly SDK and never reads from that store, so the overrides are silently ignored and the reclaim chain never executes.

## Issue Context
The four flags (`reclaim-sync-timeout-ms`, `reclaim-drop-caches-timeout-ms`, `reclaim-compact-memory-timeout-ms`, `reclaim-fstrim-timeout-ms`) already have their global `IntFlag` variables declared in `flags.go` with default 0. The `resume-build` binary is a developer/operator CLI tool, not a long-running service, so a simple mechanism is preferred.

## Fix Focus Areas
- packages/orchestrator/cmd/resume-build/main.go[80-85] — replace the `NewIntFlag` calls with a mechanism that actually reaches the `featureflags.Client` used by the sandbox factory, e.g. expose a `NewClientWithDatasource` path that pre-seeds the offline store before the client is created, or accept the timeout values as plain integer flags and pass them directly to `buildReclaimScript` without going through LaunchDarkly at all.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread packages/orchestrator/cmd/resume-build/main.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Reclaim requires Bash
    • Changed the outer shell from /bin/bash to /bin/sh to support minimal images like Alpine that only provide BusyBox with /bin/sh.

Create PR

Or push these changes by commenting:

@cursor push dc39246efe
Preview (dc39246efe)
diff --git a/packages/orchestrator/pkg/sandbox/reclaim.go b/packages/orchestrator/pkg/sandbox/reclaim.go
--- a/packages/orchestrator/pkg/sandbox/reclaim.go
+++ b/packages/orchestrator/pkg/sandbox/reclaim.go
@@ -83,7 +83,7 @@
 	pc := processconnect.NewProcessClient(&http.Client{Transport: sandboxHttpClient.Transport}, addr)
 
 	req := connect.NewRequest(&process.StartRequest{
-		Process: &process.ProcessConfig{Cmd: "/bin/bash", Args: []string{"-c", script}},
+		Process: &process.ProcessConfig{Cmd: "/bin/sh", Args: []string{"-c", script}},
 	})
 	req.Header().Set("Connect-Timeout-Ms", strconv.FormatInt(int64(timeout/time.Millisecond), 10))
 	if s.Config.Envd.AccessToken != nil {

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Prevents gofmt from re-aligning HostStatsSamplingInterval's trailing
comment column when the longer ReclaimCompactMemoryTimeoutMs name was
inserted next to it, keeping the diff scoped to the new feature.
Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
…-build

Caller passes the bash args (e.g. ["-l","-c",cmd] or ["-c",script]) so
login-shell is a per-call decision rather than baked into the helper.
Removes the duplicated envd Process.Start boilerplate from reclaim.go
and runCommandInSandbox.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Reclaim requires Bash
    • Changed /bin/bash to /bin/sh in StartEnvdBash to support minimal guest images that only have sh, since reclaim commands are already shell-compatible.

Create PR

Or push these changes by commenting:

@cursor push 91ed118737
Preview (91ed118737)
diff --git a/packages/orchestrator/pkg/sandbox/envd_process.go b/packages/orchestrator/pkg/sandbox/envd_process.go
--- a/packages/orchestrator/pkg/sandbox/envd_process.go
+++ b/packages/orchestrator/pkg/sandbox/envd_process.go
@@ -16,7 +16,7 @@
 )
 
 // StartEnvdBash opens a streaming Process.Start call against this
-// sandbox's envd, running `/bin/bash` with the given args as `user`.
+// sandbox's envd, running `/bin/sh` with the given args as `user`.
 // Caller chooses login-shell vs. plain (e.g. []{"-l","-c",cmd} vs.
 // []{"-c",script}). When timeout > 0 it sets `Connect-Timeout-Ms` so
 // envd kills the process at the deadline. Auth/user headers are wired
@@ -31,7 +31,7 @@
 	pc := processconnect.NewProcessClient(&http.Client{Transport: sandboxHttpClient.Transport}, addr)
 
 	req := connect.NewRequest(&process.StartRequest{
-		Process: &process.ProcessConfig{Cmd: "/bin/bash", Args: bashArgs},
+		Process: &process.ProcessConfig{Cmd: "/bin/sh", Args: bashArgs},
 	})
 	if timeout > 0 {
 		req.Header().Set("Connect-Timeout-Ms", strconv.FormatInt(timeout.Milliseconds(), 10))

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/pkg/sandbox/envd_process.go Outdated
- Add featureflags.DurationFlag (string-backed, parsed via time.ParseDuration)
  so LaunchDarkly values like "500ms" are self-describing instead of opaque ints.
- Convert reclaim per-step timeouts (sync, drop_caches, compact_memory, fstrim)
  from IntFlag to DurationFlag.
- Rename StartEnvdBash -> StartEnvdShell and switch to /bin/sh so reclaim and
  resume-build commands work on minimal guests without bash.
- Add NewOfflineClient and use it in resume-build so per-run flag overrides
  (e.g. from -reclaim) take effect even when LAUNCH_DARKLY_API_KEY is set.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Sub-millisecond durations format as zero disabling timeout
    • Added check to round up any positive duration below 1ms to 0.001 seconds before formatting, preventing GNU timeout from treating it as 0 (no timeout).

Create PR

Or push these changes by commenting:

@cursor push 080e581151
Preview (080e581151)
diff --git a/packages/orchestrator/pkg/sandbox/reclaim.go b/packages/orchestrator/pkg/sandbox/reclaim.go
--- a/packages/orchestrator/pkg/sandbox/reclaim.go
+++ b/packages/orchestrator/pkg/sandbox/reclaim.go
@@ -46,7 +46,12 @@
 		// `timeout` accepts fractional seconds (s/m/h/d), not `ms`. Output
 		// is dropped; non-zero status is captured into `rc` so the final
 		// exit code surfaces failures without short-circuiting later steps.
-		parts = append(parts, fmt.Sprintf("timeout -s KILL %.3f sh -c %q >/dev/null 2>&1 || rc=$?", d.Seconds(), st.cmd))
+		// Ensure sub-millisecond durations round up to 0.001 so timeout != 0.
+		sec := d.Seconds()
+		if sec < 0.001 && sec > 0 {
+			sec = 0.001
+		}
+		parts = append(parts, fmt.Sprintf("timeout -s KILL %.3f sh -c %q >/dev/null 2>&1 || rc=$?", sec, st.cmd))
 		sum += d
 	}
 	if len(parts) == 0 {

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 801a37f. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/reclaim.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 801a37f79e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/envd_process.go Outdated
Replace per-step DurationFlags with a single JSON flag (reclaim-config)
that holds all per-step caps and is evaluated against
sandbox/team/template LD contexts, so reclaim can be enabled per cohort
in LaunchDarkly with one rule. Default is null = all disabled. Sub-ms
guard preserved.

Drop the now-unused DurationFlag type.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e23ee082f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/cmd/resume-build/main.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 76f3b04017

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/cmd/resume-build/main.go Outdated
… compact_memory

fstrim pulls fs metadata into the page cache and dirties the superblock
(last-trim timestamps). Running it first lets the subsequent sync +
drop_caches flush and evict everything fstrim touched in the same pass,
and compact_memory then runs against the minimal RSS so the snapshot has
long contiguous zero runs that compress better.
Previously the shared http.Client capped the stream at 10m. After the
StartEnvdShell refactor, the call passed timeout=0 (no Connect-Timeout-Ms
and no client-level timeout), so a stuck guest command or a stream that
never emits End would block the CLI indefinitely in -cmd, -cmd-pause and
-cmd-signal-pause modes.
@qodo-code-review
Copy link
Copy Markdown

CI Feedback 🧐

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: integration-tests / integration_tests

Failed stage: Aggregate matrix result [❌]

Failed test name: ""

Failure summary:

The workflow failed because the matrix job aggregate result was cancelled, not success.
- A
post-matrix check step ran if [[ "cancelled" != "success" ]]; then ... exit 1 (lines 31-35).
- This
caused the step to print matrix result: cancelled (line 39) and exit with code 1 (line 40), failing
the job.

Relevant error logs:
1:  ##[group]Runner Image Provisioner
2:  Hosted Compute Agent
...

25:  Prepare all required actions
26:  Uses: e2b-dev/infra/.github/workflows/integration_tests.yml@refs/pull/2551/merge (8f118dfe0b0b8f0fd2a1333d440b7d35d9270a79)
27:  ##[group] Inputs
28:  publish: true
29:  ##[endgroup]
30:  Complete job name: integration-tests / integration_tests
31:  ##[group]Run if [[ "cancelled" != "success" ]]; then
32:  �[36;1mif [[ "cancelled" != "success" ]]; then�[0m
33:  �[36;1m  echo "matrix result: cancelled"�[0m
34:  �[36;1m  exit 1�[0m
35:  �[36;1mfi�[0m
36:  �[36;1mecho "all matrix shards succeeded"�[0m
37:  shell: /usr/bin/bash -e {0}
38:  ##[endgroup]
39:  matrix result: cancelled
40:  ##[error]Process completed with exit code 1.
41:  Cleaning up orphan processes

@ValentaTomas ValentaTomas enabled auto-merge (squash) May 8, 2026 08:07
@ValentaTomas ValentaTomas merged commit a9f9629 into main May 8, 2026
50 checks passed
@ValentaTomas ValentaTomas deleted the feat/sandbox-pause-reclaim branch May 8, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants