Skip to content

feat(envd): give envd realtime IO priority, reset for user processes#2681

Merged
ValentaTomas merged 5 commits into
mainfrom
feat/envd-io-priority
May 19, 2026
Merged

feat(envd): give envd realtime IO priority, reset for user processes#2681
ValentaTomas merged 5 commits into
mainfrom
feat/envd-io-priority

Conversation

@ValentaTomas

@ValentaTomas ValentaTomas commented May 17, 2026

Copy link
Copy Markdown
Member

envd at IOSchedulingClass=realtime + IOWeight=10000; lower io.weight on user/PTY/socat sub-cgroups; user-spawned processes get ioprio reset via the existing wrapper.

Adds IO scheduling priority for envd so disk I/O from user processes
cannot starve envd's own I/O during pause/resume storms. Mirrors the
existing CPU priority setup (Nice=-20 + CPUWeight on envd.service,
lower cpu.weight on user/PTY/socat sub-cgroups, reset to defaults for
spawned user processes).

- envd.service: IOSchedulingClass=realtime, IOSchedulingPriority=4, IOWeight=10000
- user/ptys/socats sub-cgroups: lower io.weight (10/50/50 vs envd default 100)
- user-process wrapper resets ioprio to best-effort/4 via ionice(1) the
  same way it already resets nice, so user-spawned grandchildren cannot
  inherit envd's realtime IO class
- tolerate cgroup properties whose controller isn't enabled in
  subtree_control (ENOENT), so io.weight on hosts without io delegation
  degrades gracefully instead of failing the cgroup manager
@cla-bot cla-bot Bot added the cla-signed label May 17, 2026
@cursor

cursor Bot commented May 17, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes process and service-level IO priority/cgroup settings, which can impact system performance and may fail on hosts lacking the required privileges or controllers.

Overview
Hard-codes /usr/bin/ionice into the user-process wrapper; this can fail if the binary/path is missing and will change runtime behavior for all spawned processes.

Adds io.weight cgroup writes but silently skips ENOENT/EPERM and logs to stderr, which can mask misconfiguration (e.g., missing IO controller/subtree control) while still reporting startup success.

Sets systemd IOSchedulingClass=realtime/high IOWeight for envd, which may be rejected without the right capabilities/limits and can starve other IO if it takes effect.

Reviewed by Cursor Bugbot for commit 0eb4606. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented May 17, 2026

Copy link
Copy Markdown

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
2622 8 2614 5
View the full list of 13 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestSandboxMetrics

Flake rate in main: 56.61% (Passed 223 times, Failed 291 times)

Stack Traces | 4.89s run time
=== RUN   TestSandboxMetrics
=== PAUSE TestSandboxMetrics
=== CONT  TestSandboxMetrics
    sandbox_metrics_test.go:47: 
        	Error Trace:	.../api/metrics/sandbox_metrics_test.go:47
        	Error:      	Should NOT be empty, but was 0
        	Test:       	TestSandboxMetrics
--- FAIL: TestSandboxMetrics (4.89s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 71.70% (Passed 221 times, Failed 560 times)

Stack Traces | 0.84s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.84s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestInternetAccessResumedSbx

Flake rate in main: 54.66% (Passed 224 times, Failed 270 times)

Stack Traces | 0s run time
=== RUN   TestInternetAccessResumedSbx
=== PAUSE TestInternetAccessResumedSbx
=== CONT  TestInternetAccessResumedSbx
--- FAIL: TestInternetAccessResumedSbx (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestInternetAccessResumedSbx/deny_internet_access

Flake rate in main: 54.66% (Passed 224 times, Failed 270 times)

Stack Traces | 3.01s run time
=== RUN   TestInternetAccessResumedSbx/deny_internet_access
=== PAUSE TestInternetAccessResumedSbx/deny_internet_access
=== CONT  TestInternetAccessResumedSbx/deny_internet_access
Executing command curl in sandbox imgv0w1us5onov6u72a22
    sandbox_internet_test.go:92: Command [curl] output: event:{start:{pid:1261}}
    sandbox_internet_test.go:97: 
        	Error Trace:	.../api/sandboxes/sandbox_internet_test.go:97
        	Error:      	"failed to execute command curl in sandbox ix400xqeltrev96coes3z: invalid_argument: protocol error: incomplete envelope: unexpected EOF" does not contain "failed with exit code"
        	Test:       	TestInternetAccessResumedSbx/deny_internet_access
        	Messages:   	Expected connection failure message
--- FAIL: TestInternetAccessResumedSbx/deny_internet_access (3.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.85% (Passed 231 times, Failed 767 times)

Stack Traces | 218s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (218.35s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.39% (Passed 222 times, Failed 760 times)

Stack Traces | 6.85s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i7nhd3w1s4edorfeil6z2
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1350}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox i7nhd3w1s4edorfeil6z2
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1351}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1352}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Mon, 18 May 2026 00:06:20 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i7nhd3w1s4edorfeil6z2
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (6.85s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV

Flake rate in main: 59.78% (Passed 220 times, Failed 327 times)

Stack Traces | 0s run time
=== RUN   TestTemplateBuildENV
=== PAUSE TestTemplateBuildENV
=== CONT  TestTemplateBuildENV
--- FAIL: TestTemplateBuildENV (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV/ENV_with_multiline_value

Flake rate in main: 60.34% (Passed 213 times, Failed 324 times)

Stack Traces | 7.08s run time
=== RUN   TestTemplateBuildENV/ENV_with_multiline_value
=== PAUSE TestTemplateBuildENV/ENV_with_multiline_value
=== CONT  TestTemplateBuildENV/ENV_with_multiline_value
    build_template_test.go:134: test-ubuntu-env-multiline: [info] Building template dn152sa1wl351uv0szq4/d16e3dd2-87a3-4032-988b-2a6c66d5be4f
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] FROM ubuntu:22.04 [ffd709f131f42dfab282de47a91dd2c139e900c1c11fc574b49b517a05ef0a32]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] DEFAULT USER user [90bdd4afa342293c931373351bf578872dec9179214ba3e8bf9edba311466213]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 1/2] ENV MULTILINE line1
        line2
        line3 [e93da3f3765f20eb6407c336b9e4e0b9321d994ec5f6cb547743a2a4070eed23]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 2/2] RUN [[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1 [477610d61cdf858776262d3331809539bcbcf16f706aac18515a57337bae1786]
    build_template_test.go:134: test-ubuntu-env-multiline: [error] Build failed: failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1
    build_template_test.go:374: Build failed: {<nil> failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1 0xc00033fa30}
--- FAIL: TestTemplateBuildENV/ENV_with_multiline_value (7.08s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 56.36% (Passed 391 times, Failed 505 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 63.07% (Passed 219 times, Failed 374 times)

Stack Traces | 7.44s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1268}}
Executing command python in sandbox idav092kmcee6w2llds43
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.44s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 64.33% (Passed 219 times, Failed 395 times)

Stack Traces | 7.24s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox ijas5rp3tzag7gxjvxzbd
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1268}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
Executing command python in sandbox ilca9hnddedyn7y6kewwj
--- FAIL: TestBindLocalhost/bind_localhost (7.24s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.17% (Passed 229 times, Failed 448 times)

Stack Traces | 80.3s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (80.31s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 66.87% (Passed 219 times, Failed 442 times)

Stack Traces | 45s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1253}}
Executing command bash in sandbox i2e3xb6hjgszbu91kjmfp (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory before tmpfs mount: 184 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Free memory before tmpfs mount: 800 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Memory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.14451 s, 213 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.10\n\tPercent of CPU this job got: 98%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.14\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2612\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 345\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 11\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Exit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i9qlcritks1uhzvw5hxhe
Executing command bash in sandbox i9qlcritks1uhzvw5hxhe (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1269}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"f92cd6e4cb7b68486ddb4cab6c0764c1c2cbc45dcd0c9d7d2f3ac1cbdb6f03b8\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i9qlcritks1uhzvw5hxhe
Executing command bash in sandbox i9qlcritks1uhzvw5hxhe (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1273}}
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{data:{stdout:"f92cd6e4cb7b68486ddb4cab6c0764c1c2cbc45dcd0c9d7d2f3ac1cbdb6f03b8\n"}}
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:99: Command [bash] completed successfully in sandbox i9qlcritks1uhzvw5hxhe
Executing command bash in sandbox i9qlcritks1uhzvw5hxhe (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1276}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i9qlcritks1uhzvw5hxhe: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (44.98s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The hardcoded absolute paths for /usr/bin/ionice and /usr/bin/nice in the wrapper script introduce a dependency on a specific filesystem layout and will cause process spawning to fail in environments where ionice is missing. Using relative paths and providing a fallback ensures user processes can start even if the IO priority reset fails or the binary is located elsewhere.

Comment thread packages/envd/internal/services/process/handler/handler.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: ENOENT check is dead code due to os.WriteFile O_CREATE
    • Replaced os.WriteFile with os.OpenFile(O_WRONLY) to correctly receive ENOENT when cgroup property files don't exist, enabling graceful degradation when controllers aren't delegated.

Create PR

Or push these changes by commenting:

@cursor push fbfba8084a
Preview (fbfba8084a)
diff --git a/packages/envd/internal/services/cgroups/cgroup2.go b/packages/envd/internal/services/cgroups/cgroup2.go
--- a/packages/envd/internal/services/cgroups/cgroup2.go
+++ b/packages/envd/internal/services/cgroups/cgroup2.go
@@ -111,7 +111,10 @@
 
 	var errs []error
 	for name, value := range properties {
-		if err := os.WriteFile(filepath.Join(fullPath, name), []byte(value), 0o644); err != nil {
+		propPath := filepath.Join(fullPath, name)
+		// Open without O_CREATE to get ENOENT when controller isn't enabled
+		f, err := os.OpenFile(propPath, os.O_WRONLY, 0)
+		if err != nil {
 			// Tolerate properties whose controller isn't enabled in
 			// cgroup.subtree_control (file doesn't exist). Other errors are fatal.
 			if errors.Is(err, os.ErrNotExist) {
@@ -119,7 +122,15 @@
 				continue
 			}
 			errs = append(errs, fmt.Errorf("failed to write cgroup property %q: %w", name, err))
+			continue
 		}
+		_, writeErr := f.Write([]byte(value))
+		closeErr := f.Close()
+		if writeErr != nil {
+			errs = append(errs, fmt.Errorf("failed to write cgroup property %q: %w", name, writeErr))
+		} else if closeErr != nil {
+			errs = append(errs, fmt.Errorf("failed to close cgroup property %q: %w", name, closeErr))
+		}
 	}
 	if len(errs) > 0 {
 		return -1, errors.Join(errs...)

You can send follow-ups to the cloud agent here.

Comment thread packages/envd/internal/services/cgroups/cgroup2.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: io.weight values use wrong format, need "default" prefix
    • Added 'default' prefix to all io.weight values in main.go to comply with cgroup v2 format and prevent EINVAL errors with BFQ scheduler.

Create PR

Or push these changes by commenting:

@cursor push 6a176846ad
Preview (6a176846ad)
diff --git a/packages/envd/main.go b/packages/envd/main.go
--- a/packages/envd/main.go
+++ b/packages/envd/main.go
@@ -243,13 +243,13 @@
 	opts := []cgroups.Cgroup2ManagerOption{
 		cgroups.WithCgroup2ProcessType(cgroups.ProcessTypePTY, "ptys", map[string]string{
 			"cpu.weight":  "200",
-			"io.weight":   "50",
+			"io.weight":   "default 50",
 			"memory.high": fmt.Sprintf("%d", memoryHigh),
 			"memory.max":  fmt.Sprintf("%d", memoryMax),
 		}),
 		cgroups.WithCgroup2ProcessType(cgroups.ProcessTypeSocat, "socats", map[string]string{
 			"cpu.weight": "150",
-			"io.weight":  "50",
+			"io.weight":  "default 50",
 			"memory.min": fmt.Sprintf("%d", 5*megabyte),
 			"memory.low": fmt.Sprintf("%d", 8*megabyte),
 		}),
@@ -257,7 +257,7 @@
 			"memory.high": fmt.Sprintf("%d", memoryHigh),
 			"memory.max":  fmt.Sprintf("%d", memoryMax),
 			"cpu.weight":  "50",
-			"io.weight":   "10",
+			"io.weight":   "default 10",
 		}),
 	}
 	if cgroupRoot != "" {

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 5301533. Configure here.

Comment thread packages/envd/main.go Outdated
@ValentaTomas ValentaTomas marked this pull request as ready for review May 17, 2026 23:53

@arkamar arkamar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ValentaTomas ValentaTomas merged commit f4bd1b2 into main May 19, 2026
56 checks passed
@ValentaTomas ValentaTomas deleted the feat/envd-io-priority branch May 19, 2026 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants