Skip to content

perf(sandbox): keep envd logging out of journald#2675

Merged
ValentaTomas merged 1 commit into
mainfrom
perf/envd-journal-suppression
May 16, 2026
Merged

perf(sandbox): keep envd logging out of journald#2675
ValentaTomas merged 1 commit into
mainfrom
perf/envd-journal-suppression

Conversation

@ValentaTomas

Copy link
Copy Markdown
Member

envd's zerolog stdout writer is now gated behind a new `-verbose` flag (default off), so production envd inside FC no longer writes anything to stdout — journald stays clean of per-request debug events. The HTTP exporter still ships full debug to the orchestrator regardless.

A journald drop-in caps the rest of the in-VM journal at `Storage=persistent` / `SystemMaxUse=8M` / `MaxLevelStore=warning` so other systemd services can't grow it without bound either.

`handler.go`'s "error reading from pty/stdout/stderr" messages were on raw stderr but are about envd-handled user processes, not envd internals; they now flow through the zerolog logger.

Split out of #2674 (journal-side half).

@cla-bot cla-bot Bot added the cla-signed label May 16, 2026
@cursor

cursor Bot commented May 16, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Logging delivery semantics change: envd may now emit no logs at all in some modes (e.g., non-FC without -verbose), and exporter failures no longer fall back to stdout, which can hinder debugging and incident response.

Overview
This change reroutes envd logging so Firecracker runs no longer write debug logs to stdout by default, and removes the exporter’s previous stdout fallback on JSON/enqueue/send errors, which can silently drop logs when the HTTP path fails. It also adds a new -verbose flag and updates the docker debug run to pass it; note that running outside Firecracker without -verbose will now produce no logs. A journald drop-in is added to keep the VM journal bounded, and process pipe read errors are now logged via zerolog instead of raw stderr.

Reviewed by Cursor Bugbot for commit 0b94771. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented May 16, 2026

Copy link
Copy Markdown

❌ 9 Tests Failed:

Tests completed Failed Passed Skipped
2619 9 2610 7
View the full list of 15 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.94% (Passed 204 times, Failed 498 times)

Stack Traces | 0.84s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.84s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.57% (Passed 212 times, Failed 693 times)

Stack Traces | 178s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (178.37s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.02% (Passed 205 times, Failed 687 times)

Stack Traces | 3.8s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox io98oc22icgkas9lfor2m
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1360}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox io98oc22icgkas9lfor2m
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1361}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command ls in sandbox ioyxcofsl6o6ln18ev4w9 (user: root)
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1362}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Sat, 16 May 2026 06:43:56 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox io98oc22icgkas9lfor2m
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (3.80s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 56.06% (Passed 359 times, Failed 458 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 62.64% (Passed 201 times, Failed 337 times)

Stack Traces | 11.2s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1264}}
Executing command python in sandbox ieywiw2oz571wciofsbk9
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (11.16s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 58.23% (Passed 203 times, Failed 283 times)

Stack Traces | 8.29s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1264}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (8.29s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::

Flake rate in main: 56.44% (Passed 203 times, Failed 263 times)

Stack Traces | 8.11s run time
=== RUN   TestBindLocalhost/bind_::
=== PAUSE TestBindLocalhost/bind_::
=== CONT  TestBindLocalhost/bind_::
Executing command python in sandbox iu9opkaelsgn5r44uem2o
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1264}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::
        	Messages:   	Unexpected status code 502 for bind address ::
--- FAIL: TestBindLocalhost/bind_:: (8.11s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 63.93% (Passed 202 times, Failed 358 times)

Stack Traces | 8.46s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1264}}
Executing command python in sandbox i3z7pus5cb7t677acf6op
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.46s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 63.86% (Passed 202 times, Failed 357 times)

Stack Traces | 10.1s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox it1pvlzibn9ep2s8aqaaj
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1264}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
Executing command python in sandbox iklqlltplebq0tbcj8y84
--- FAIL: TestBindLocalhost/bind_localhost (10.05s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Flake rate in main: 53.49% (Passed 253 times, Failed 291 times)

Stack Traces | 4.49s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
--- FAIL: TestListDir (4.49s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Flake rate in main: 57.20% (Passed 202 times, Failed 270 times)

Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Flake rate in main: 57.20% (Passed 202 times, Failed 270 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Flake rate in main: 57.20% (Passed 202 times, Failed 270 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Flake rate in main: 57.20% (Passed 202 times, Failed 270 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxAutoResumeViaProxy

Flake rate in main: 54.63% (Passed 206 times, Failed 248 times)

Stack Traces | 23.4s run time
=== RUN   TestSandboxAutoResumeViaProxy
=== PAUSE TestSandboxAutoResumeViaProxy
=== CONT  TestSandboxAutoResumeViaProxy
Executing command ls in sandbox ife71udb7qqxu4k53jrjc
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i8v56cqfzcgatv5j81vor","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i8v56cqfzcgatv5j81vor","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i8v56cqfzcgatv5j81vor","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"i8v56cqfzcgatv5j81vor","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:116: 
        	Error Trace:	.../tests/proxies/auto_resume_test.go:116
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxAutoResumeViaProxy
--- FAIL: TestSandboxAutoResumeViaProxy (23.42s)
Executing command cat in sandbox iwwju28pb01vpuk2867ip (user: root)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The implementation in NewLogger disables stdout logging by default when running in local development mode (isNotFC), requiring the verbose flag to be explicitly set. This contradicts the expected behavior where the -isnotfc flag should automatically enable stdout printing. The condition should be updated to enable stdout if either isNotFC or verbose is true.

Comment thread packages/envd/internal/logs/logger.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Verbose guard added inside unreachable dead code branch
    • Removed the unreachable isNotFC branch in HTTPExporter.start() since the HTTP exporter is only created when !isNotFC, making w.isNotFC always false.
  • ✅ Fixed: Local dev mode silently discards all log output
    • Added stdout as a default exporter when isNotFC=true so local development mode always logs to stdout for visibility.

Create PR

Or push these changes by commenting:

@cursor push 963a6a035c
Preview (963a6a035c)
diff --git a/packages/envd/internal/logs/exporter/exporter.go b/packages/envd/internal/logs/exporter/exporter.go
--- a/packages/envd/internal/logs/exporter/exporter.go
+++ b/packages/envd/internal/logs/exporter/exporter.go
@@ -109,16 +109,6 @@
 			continue
 		}
 
-		if w.isNotFC {
-			if w.verbose {
-				for _, log := range logs {
-					fmt.Fprintf(os.Stdout, "%v", string(log))
-				}
-			}
-
-			continue
-		}
-
 		for _, logLine := range logs {
 			w.mmdsLock.RLock()
 			logLineWithOpts, err := w.mmdsOpts.AddOptsToJSON(logLine)

diff --git a/packages/envd/internal/logs/logger.go b/packages/envd/internal/logs/logger.go
--- a/packages/envd/internal/logs/logger.go
+++ b/packages/envd/internal/logs/logger.go
@@ -24,7 +24,8 @@
 	// Stdout is opt-in via -verbose. Inside FC stdout flows into journald and
 	// dirties guest pages on every snapshot, so we keep it off by default and
 	// rely on the HTTP exporter to ship debug logs to the orchestrator.
-	if verbose {
+	// In local dev mode (isNotFC), always log to stdout for visibility.
+	if verbose || isNotFC {
 		exporters = append(exporters, os.Stdout)
 	}

You can send follow-ups to the cloud agent here.

Comment thread packages/envd/internal/logs/exporter/exporter.go
Comment thread packages/envd/internal/logs/logger.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue. You can view the agent here.

Reviewed by Cursor Bugbot for commit d50a5ecdb06dc5881254c0e7bac1faea7e5e8f3f. Configure here.

Comment thread packages/orchestrator/pkg/template/build/core/rootfs/files/journald.conf.tpl Outdated
@ValentaTomas

Copy link
Copy Markdown
Member Author

Fixed in latest push: stdout is now added when !isFC || verbose, so -isnotfc keeps logging to stdout as before.

@ValentaTomas ValentaTomas force-pushed the perf/envd-journal-suppression branch from f0e4ad3 to ee3303b Compare May 16, 2026 06:27
@ValentaTomas ValentaTomas marked this pull request as ready for review May 16, 2026 06:29
envd's zerolog stdout writer is now gated behind a new -verbose flag
(default off), so production envd inside FC no longer writes anything
to stdout — journald stays clean of per-request debug events. The HTTP
exporter still ships full debug to the orchestrator regardless.

A journald drop-in caps the rest of the in-VM journal at
Storage=persistent / SystemMaxUse=8M / MaxLevelStore=warning so other
systemd services can't grow it without bound either.

handler.go's "error reading from pty/stdout/stderr" messages were on
raw stderr but are about envd-handled user processes, not envd
internals; they now flow through the zerolog logger.
@ValentaTomas ValentaTomas force-pushed the perf/envd-journal-suppression branch from ee3303b to 0b94771 Compare May 16, 2026 06:31
@ValentaTomas ValentaTomas enabled auto-merge (squash) May 16, 2026 06:32
@ValentaTomas ValentaTomas merged commit f6943ca into main May 16, 2026
53 checks passed
@ValentaTomas ValentaTomas deleted the perf/envd-journal-suppression branch May 16, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants