Skip to content

perf(sandbox): reduce memory dirtying from journald and envd logging#2674

Closed
ValentaTomas wants to merge 7 commits into
mainfrom
perf/reduce-sandbox-memory-dirtying
Closed

perf(sandbox): reduce memory dirtying from journald and envd logging#2674
ValentaTomas wants to merge 7 commits into
mainfrom
perf/reduce-sandbox-memory-dirtying

Conversation

@ValentaTomas

@ValentaTomas ValentaTomas commented May 16, 2026

Copy link
Copy Markdown
Member

Reduces guest memory and rootfs dirtying caused by journald and envd logging during pause-resume cycles.

envd no longer writes to its stdout by default. A new -verbose flag opts in for local development; in FC production envd's stdout stays empty so it doesn't dirty journald pages. The HTTP exporter still ships the full debug stream to the orchestrator. Only envd's stderr (Go runtime panics, log.Fatalf, bootstrap and MMDS-poll warnings) reaches journald — which is where we actually want envd's lifecycle/fatal output to land.

The MMDS poll (50ms ticker) and the HTTP exporter's per-line "send failed" paths used to write to stderr per iteration / per log line, which 1:1-amplified into journald during outages. Both now log only the first failure of each kind. The exporter additionally caps its in-memory queue at 10k entries with drop-oldest semantics so a wedged collector can't grow it without bound. The remaining stderr writes from those two paths are tagged with the syslog <4> (WARNING) prefix so journald classifies them correctly instead of as errors.

The handler.go "error reading from pty/stdout/stderr" messages were on stderr but are about envd-handled user processes, not envd internals; they now flow through the zerolog logger.

A journald drop-in caps the remaining (other systemd services) journal at Storage=persistent / SystemMaxUse=8M / MaxLevelStore=warning so the journal can't grow without bound and the writes land on the rootfs (small, ext4-inline_data-friendly) rather than dirtying RAM via the default tmpfs path.

@cla-bot cla-bot Bot added the cla-signed label May 16, 2026
@cursor

cursor Bot commented May 16, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes logging defaults and adds backoff/drop behavior for MMDS polling and log exporting, which could hide repeated failures or drop logs during outages if mis-tuned. Impacts observability and boot-time diagnostics but does not touch auth or user data paths.

Overview
Reduces journald/guest-memory churn from sandbox logging by making envd stdout output opt-in via a new -verbose flag, adding exponential backoff and first-error-only warning emission to both MMDS polling and the HTTP log exporter, and bounding the exporter’s in-memory queue with drop-oldest semantics during prolonged collector outages; additionally routes process pipe read errors through zerolog instead of raw stderr, adds a journald drop-in that persists and caps the journal while storing only warning and above, and bumps envd version to 0.5.24.

Reviewed by Cursor Bugbot for commit 203b9e9. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented May 16, 2026

Copy link
Copy Markdown

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
2619 8 2611 7
View the full list of 12 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestSandboxMetrics

Flake rate in main: 55.91% (Passed 205 times, Failed 260 times)

Stack Traces | 5.07s run time
=== RUN   TestSandboxMetrics
=== PAUSE TestSandboxMetrics
=== CONT  TestSandboxMetrics
    sandbox_metrics_test.go:47: 
        	Error Trace:	.../api/metrics/sandbox_metrics_test.go:47
        	Error:      	Should NOT be empty, but was 0
        	Test:       	TestSandboxMetrics
--- FAIL: TestSandboxMetrics (5.07s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.94% (Passed 204 times, Failed 498 times)

Stack Traces | 1.54s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (1.54s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.57% (Passed 212 times, Failed 693 times)

Stack Traces | 36.2s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (36.25s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.02% (Passed 205 times, Failed 687 times)

Stack Traces | 4.62s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i2fla9z5fyiugkk1shf78
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1364}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox i2fla9z5fyiugkk1shf78
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1365}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1366}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Sat, 16 May 2026 01:47:58 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i2fla9z5fyiugkk1shf78
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (4.62s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV

Flake rate in main: 59.68% (Passed 202 times, Failed 299 times)

Stack Traces | 0s run time
=== RUN   TestTemplateBuildENV
=== PAUSE TestTemplateBuildENV
=== CONT  TestTemplateBuildENV
--- FAIL: TestTemplateBuildENV (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV/ENV_with_multiline_value

Flake rate in main: 60.29% (Passed 195 times, Failed 296 times)

Stack Traces | 23.4s run time
=== RUN   TestTemplateBuildENV/ENV_with_multiline_value
=== PAUSE TestTemplateBuildENV/ENV_with_multiline_value
=== CONT  TestTemplateBuildENV/ENV_with_multiline_value
    build_template_test.go:134: test-ubuntu-env-multiline: [info] Building template a40hpns2ma8egolhthsn/e7bde81d-de26-447e-90fb-37f9313c551d
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] FROM ubuntu:22.04 [ffd709f131f42dfab282de47a91dd2c139e900c1c11fc574b49b517a05ef0a32]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] DEFAULT USER user [90bdd4afa342293c931373351bf578872dec9179214ba3e8bf9edba311466213]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 1/2] ENV MULTILINE line1
        line2
        line3 [e93da3f3765f20eb6407c336b9e4e0b9321d994ec5f6cb547743a2a4070eed23]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 2/2] RUN [[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1 [477610d61cdf858776262d3331809539bcbcf16f706aac18515a57337bae1786]
    build_template_test.go:134: test-ubuntu-env-multiline: [error] Build failed: failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1
    build_template_test.go:374: Build failed: {<nil> failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1 0xc0003f4640}
--- FAIL: TestTemplateBuildENV/ENV_with_multiline_value (23.41s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 56.06% (Passed 359 times, Failed 458 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 62.64% (Passed 201 times, Failed 337 times)

Stack Traces | 12.6s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1263}}
Executing command python in sandbox ilrzkz8710fkfnq119yzm
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (12.59s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 63.93% (Passed 202 times, Failed 358 times)

Stack Traces | 8.81s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox i1mtr3rp5t8y1s4nv3u6m
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1263}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.81s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 63.86% (Passed 202 times, Failed 357 times)

Stack Traces | 8.71s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1263}}
Executing command python in sandbox i2vo5gyl3ktn32gcdda9m
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (8.71s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.03% (Passed 212 times, Failed 412 times)

Stack Traces | 75.5s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (75.51s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 66.78% (Passed 202 times, Failed 406 times)

Stack Traces | 39.6s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1264}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 184 MB\nFree memory before tmpfs mount: 800 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Memory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.94591 s, 170 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"C"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"o"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"a"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"d"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"b"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"g"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"imed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.89\n\tPercent of CPU this job got: 98%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.95\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kby"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"tes): 0\n\tMax"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"imum r"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"esident "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"set size (kbytes): 2688\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 345\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 16\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox igoporperjk6vpgww8k5f
Executing command bash in sandbox igoporperjk6vpgww8k5f (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1280}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"aed358a3ba643cca95ba917f157328923a2017832e373397361b63434ea18b50\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox igoporperjk6vpgww8k5f
Executing command bash in sandbox igoporperjk6vpgww8k5f (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1283}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox igoporperjk6vpgww8k5f: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (39.59s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The MaxLevelStore=warning setting in journald.conf.d/e2b.conf causes journald to discard logs with priority higher than 4, including the default info priority for service stdout and stderr. This will result in the loss of envd logs intended for preservation. MaxLevelStore should be set to info to ensure these logs are captured correctly.

Storage=persistent
SystemMaxUse=8M
SystemMaxFileSize=2M
MaxLevelStore=warning

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The MaxLevelStore=warning setting causes journald to discard all logs with a priority higher than 4, including the default info (6) priority assigned by systemd to service stdout and stderr. Consequently, all envd logs—including the Warn and Error levels intended for preservation—will be dropped. Consider setting this to info and relying on envd's internal filtering to manage volume, or ensuring envd outputs syslog-compatible priority prefixes.

MaxLevelStore=info

Route envd stdout to /dev/null and keep only stderr in journald via the
envd systemd service, so envd panics/fatal errors are still inspectable
but per-request debug events no longer dirty guest memory and rootfs
pages on every snapshot. Full debug logs continue to ship through the
HTTP exporter to the orchestrator.

Bound the rest of journald with a drop-in: persistent (rootfs-backed)
storage capped at 8M with warning-only filtering, so other systemd
services can't grow the journal without bound either.
@ValentaTomas ValentaTomas force-pushed the perf/reduce-sandbox-memory-dirtying branch from c419f4b to de6a6f1 Compare May 16, 2026 00:49

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Stderr panics silently dropped by journald level filter
    • Added SyslogLevel=warning to envd.service to ensure stderr messages receive priority 4, allowing them to pass the MaxLevelStore=warning filter in journald.conf.

Create PR

Or push these changes by commenting:

@cursor push f2185fdc4a
Preview (f2185fdc4a)
diff --git a/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl b/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl
--- a/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl
+++ b/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl
@@ -16,6 +16,7 @@
 # stderr (envd panics/fatal errors) reaches journald.
 StandardOutput=null
 StandardError=journal
+SyslogLevel=warning
 Environment=GOTRACEBACK=all
 LimitCORE=infinity
 ExecStartPre=/bin/sh -c 'mountpoint -q /etc/ssl/certs || (mkdir -p /run/e2b/certs && mount --bind /run/e2b/certs /etc/ssl/certs) && ([ -s /etc/ssl/certs/ca-certificates.crt ] || update-ca-certificates)'

You can send follow-ups to the cloud agent here.

These messages are about envd-handled user processes, not envd internals,
so they should not land in journald via stderr. They now flow through the
zerolog logger to envd stdout (discarded by systemd) and to the HTTP
exporter, keeping only true envd panics/fatal errors on stderr.
@ValentaTomas ValentaTomas marked this pull request as ready for review May 16, 2026 00:58
@ValentaTomas

Copy link
Copy Markdown
Member Author

Closing in favor of reopening the original #2423 on the same branch.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 585e9cc34d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Group=root
# Discard envd stdout (debug logs still ship via the HTTP exporter); only
# stderr (envd panics/fatal errors) reaches journald.
StandardOutput=null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep a local log sink when exporter delivery fails

Setting StandardOutput=null drops every envd log line written to stdout, but envd’s logger always writes to stdout (packages/envd/internal/logs/logger.go) and the HTTP exporter’s failure path also falls back to stdout (packages/envd/internal/logs/exporter/exporter.go). In any sandbox where MMDS/log-collector data is missing or log POSTs fail, this change removes the last local copy of operational logs, so process/debug output becomes unrecoverable during the exact outage scenarios where it is needed.

Useful? React with 👍 / 👎.

…p log queue

Persistently-failing MMDS polling (50ms ticker) and a wedged log
collector both used to write to stderr per iteration / per log line,
which 1:1-amplified into journald and defeated the snapshot-dirtying
fix from this PR. Both now log only the first failure of each kind.

Also cap the in-memory log queue at 10k entries with drop-oldest under
back-pressure, so a hung HTTP exporter cannot grow it without bound.
These are recoverable / expected-during-boot conditions, not envd
crashes, so prefix them with <4> so journald classifies them as
warning. Still stored (MaxLevelStore=warning) but no longer surface as
errors in journalctl, and we drop the log package dependency in the
exporter.
envd now skips writing to stdout unless -verbose is passed. In FC mode
this removes the need for the systemd StandardOutput=null drop-in (so
default systemd routing is restored), and at the same time keeps
production journald free of envd debug traffic. Devs running envd
locally pass -verbose to see the full debug stream; the HTTP exporter
still ships everything to the orchestrator regardless.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f48305f7a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +21 to +23
if !isNotFC {
exporters = append(exporters, exporter.NewHTTPLogsExporter(ctx, isNotFC, mmdsChan))
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep a default log sink in non-Firecracker mode

When envd runs with -isnotfc (for example via packages/envd/Makefile start-docker), this change leaves exporters empty unless -verbose is also set, because the HTTP exporter is now FC-only and stdout is opt-in. With no writer, zerolog drops all envd logs, so routine failures in local/non-FC runs become invisible and debugging regresses from previous behavior where stdout logging was always enabled in non-FC mode.

Useful? React with 👍 / 👎.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Local dev mode silently discards all log output
    • Added 'isNotFC' condition to stdout check so local dev mode always enables stdout logging while FC mode keeps it opt-in via -verbose flag.

Create PR

Or push these changes by commenting:

@cursor push 65f9a4fffc
Preview (65f9a4fffc)
diff --git a/packages/envd/internal/logs/logger.go b/packages/envd/internal/logs/logger.go
--- a/packages/envd/internal/logs/logger.go
+++ b/packages/envd/internal/logs/logger.go
@@ -21,10 +21,11 @@
 	if !isNotFC {
 		exporters = append(exporters, exporter.NewHTTPLogsExporter(ctx, isNotFC, mmdsChan))
 	}
-	// Stdout is opt-in via -verbose. Inside FC stdout flows into journald and
-	// dirties guest pages on every snapshot, so we keep it off by default and
-	// rely on the HTTP exporter to ship debug logs to the orchestrator.
-	if verbose {
+	// Stdout is opt-in via -verbose inside FC mode. Inside FC, stdout flows into
+	// journald and dirties guest pages on every snapshot, so we keep it off by
+	// default and rely on the HTTP exporter. In local dev mode, stdout is always
+	// enabled since there's no page-dirtying concern.
+	if verbose || isNotFC {
 		exporters = append(exporters, os.Stdout)
 	}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit f48305f. Configure here.

Comment thread packages/envd/internal/logs/logger.go
Both the MMDS poll (50ms ticker) and the HTTP exporter (one send per
log line) could hammer broken endpoints indefinitely. Even with the
log suppression in the previous commit, the underlying HTTP requests
kept firing on every iteration / every log line, wasting CPU and
ExporterTimeout (10s).

MMDS now backs off 50ms → 100ms → 200ms → ... up to 1s. The exporter
backs off 1s → 2s → ... up to 5min between send attempts after a
failure, and drops queued logs during the cooldown window.

Also gate the exporter's printLog/non-FC stdout dumps behind -verbose
so failed-send fallback writes don't leak into journald either.
Replace the entry-count cap (10k) with a 4 MiB byte cap and track
buffered size explicitly. Drop-oldest is preserved so that after a long
buffering window (e.g. before MMDS is up on boot) the orchestrator
receives the most recent logs first. Nil-out dropped slots so the GC
can reclaim them instead of being pinned by the slice's backing array.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 203b9e9940

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +236 to 237
w.bufferedBytes += len(logs)
w.logs = append(w.logs, logs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject or truncate oversized log entries before buffering

addLogs only evicts existing entries while len(w.logs) > 0, so if a single incoming record is larger than maxBufferedBytes and the queue is empty, it is still appended and bufferedBytes jumps past the configured cap. That breaks the new bounded-memory guarantee and allows one unusually large log event to exceed the intended 4 MiB limit (with repeated large events causing repeated over-cap allocations).

Useful? React with 👍 / 👎.

@ValentaTomas

Copy link
Copy Markdown
Member Author

Closing in favor of the split into #2675 (keep envd logging out of journald) and #2676 (bound and back off the logs exporter and MMDS poll).

ValentaTomas added a commit that referenced this pull request May 16, 2026
envd's zerolog stdout writer is now gated behind a new \`-verbose\` flag
(default off), so production envd inside FC no longer writes anything to
stdout — journald stays clean of per-request debug events. The HTTP
exporter still ships full debug to the orchestrator regardless.

A journald drop-in caps the rest of the in-VM journal at
\`Storage=persistent\` / \`SystemMaxUse=8M\` / \`MaxLevelStore=warning\`
so other systemd services can't grow it without bound either.

\`handler.go\`'s "error reading from pty/stdout/stderr" messages were on
raw stderr but are about envd-handled user processes, not envd
internals; they now flow through the zerolog logger.

Split out of #2674 (journal-side half).
@ValentaTomas ValentaTomas deleted the perf/reduce-sandbox-memory-dirtying branch May 18, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants