perf(sandbox): reduce memory dirtying from journald and envd logging#2674
perf(sandbox): reduce memory dirtying from journald and envd logging#2674ValentaTomas wants to merge 7 commits into
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 203b9e9. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 8 Tests Failed:
View the full list of 12 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
The MaxLevelStore=warning setting in journald.conf.d/e2b.conf causes journald to discard logs with priority higher than 4, including the default info priority for service stdout and stderr. This will result in the loss of envd logs intended for preservation. MaxLevelStore should be set to info to ensure these logs are captured correctly.
| Storage=persistent | ||
| SystemMaxUse=8M | ||
| SystemMaxFileSize=2M | ||
| MaxLevelStore=warning |
There was a problem hiding this comment.
The MaxLevelStore=warning setting causes journald to discard all logs with a priority higher than 4, including the default info (6) priority assigned by systemd to service stdout and stderr. Consequently, all envd logs—including the Warn and Error levels intended for preservation—will be dropped. Consider setting this to info and relying on envd's internal filtering to manage volume, or ensuring envd outputs syslog-compatible priority prefixes.
MaxLevelStore=info
Route envd stdout to /dev/null and keep only stderr in journald via the envd systemd service, so envd panics/fatal errors are still inspectable but per-request debug events no longer dirty guest memory and rootfs pages on every snapshot. Full debug logs continue to ship through the HTTP exporter to the orchestrator. Bound the rest of journald with a drop-in: persistent (rootfs-backed) storage capped at 8M with warning-only filtering, so other systemd services can't grow the journal without bound either.
c419f4b to
de6a6f1
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Stderr panics silently dropped by journald level filter
- Added SyslogLevel=warning to envd.service to ensure stderr messages receive priority 4, allowing them to pass the MaxLevelStore=warning filter in journald.conf.
Or push these changes by commenting:
@cursor push f2185fdc4a
Preview (f2185fdc4a)
diff --git a/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl b/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl
--- a/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl
+++ b/packages/orchestrator/pkg/template/build/core/rootfs/files/envd.service.tpl
@@ -16,6 +16,7 @@
# stderr (envd panics/fatal errors) reaches journald.
StandardOutput=null
StandardError=journal
+SyslogLevel=warning
Environment=GOTRACEBACK=all
LimitCORE=infinity
ExecStartPre=/bin/sh -c 'mountpoint -q /etc/ssl/certs || (mkdir -p /run/e2b/certs && mount --bind /run/e2b/certs /etc/ssl/certs) && ([ -s /etc/ssl/certs/ca-certificates.crt ] || update-ca-certificates)'You can send follow-ups to the cloud agent here.
These messages are about envd-handled user processes, not envd internals, so they should not land in journald via stderr. They now flow through the zerolog logger to envd stdout (discarded by systemd) and to the HTTP exporter, keeping only true envd panics/fatal errors on stderr.
|
Closing in favor of reopening the original #2423 on the same branch. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 585e9cc34d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Group=root | ||
| # Discard envd stdout (debug logs still ship via the HTTP exporter); only | ||
| # stderr (envd panics/fatal errors) reaches journald. | ||
| StandardOutput=null |
There was a problem hiding this comment.
Keep a local log sink when exporter delivery fails
Setting StandardOutput=null drops every envd log line written to stdout, but envd’s logger always writes to stdout (packages/envd/internal/logs/logger.go) and the HTTP exporter’s failure path also falls back to stdout (packages/envd/internal/logs/exporter/exporter.go). In any sandbox where MMDS/log-collector data is missing or log POSTs fail, this change removes the last local copy of operational logs, so process/debug output becomes unrecoverable during the exact outage scenarios where it is needed.
Useful? React with 👍 / 👎.
…p log queue Persistently-failing MMDS polling (50ms ticker) and a wedged log collector both used to write to stderr per iteration / per log line, which 1:1-amplified into journald and defeated the snapshot-dirtying fix from this PR. Both now log only the first failure of each kind. Also cap the in-memory log queue at 10k entries with drop-oldest under back-pressure, so a hung HTTP exporter cannot grow it without bound.
These are recoverable / expected-during-boot conditions, not envd crashes, so prefix them with <4> so journald classifies them as warning. Still stored (MaxLevelStore=warning) but no longer surface as errors in journalctl, and we drop the log package dependency in the exporter.
envd now skips writing to stdout unless -verbose is passed. In FC mode this removes the need for the systemd StandardOutput=null drop-in (so default systemd routing is restored), and at the same time keeps production journald free of envd debug traffic. Devs running envd locally pass -verbose to see the full debug stream; the HTTP exporter still ships everything to the orchestrator regardless.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f48305f7a5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !isNotFC { | ||
| exporters = append(exporters, exporter.NewHTTPLogsExporter(ctx, isNotFC, mmdsChan)) | ||
| } |
There was a problem hiding this comment.
Keep a default log sink in non-Firecracker mode
When envd runs with -isnotfc (for example via packages/envd/Makefile start-docker), this change leaves exporters empty unless -verbose is also set, because the HTTP exporter is now FC-only and stdout is opt-in. With no writer, zerolog drops all envd logs, so routine failures in local/non-FC runs become invisible and debugging regresses from previous behavior where stdout logging was always enabled in non-FC mode.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Local dev mode silently discards all log output
- Added 'isNotFC' condition to stdout check so local dev mode always enables stdout logging while FC mode keeps it opt-in via -verbose flag.
Or push these changes by commenting:
@cursor push 65f9a4fffc
Preview (65f9a4fffc)
diff --git a/packages/envd/internal/logs/logger.go b/packages/envd/internal/logs/logger.go
--- a/packages/envd/internal/logs/logger.go
+++ b/packages/envd/internal/logs/logger.go
@@ -21,10 +21,11 @@
if !isNotFC {
exporters = append(exporters, exporter.NewHTTPLogsExporter(ctx, isNotFC, mmdsChan))
}
- // Stdout is opt-in via -verbose. Inside FC stdout flows into journald and
- // dirties guest pages on every snapshot, so we keep it off by default and
- // rely on the HTTP exporter to ship debug logs to the orchestrator.
- if verbose {
+ // Stdout is opt-in via -verbose inside FC mode. Inside FC, stdout flows into
+ // journald and dirties guest pages on every snapshot, so we keep it off by
+ // default and rely on the HTTP exporter. In local dev mode, stdout is always
+ // enabled since there's no page-dirtying concern.
+ if verbose || isNotFC {
exporters = append(exporters, os.Stdout)
}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit f48305f. Configure here.
Both the MMDS poll (50ms ticker) and the HTTP exporter (one send per log line) could hammer broken endpoints indefinitely. Even with the log suppression in the previous commit, the underlying HTTP requests kept firing on every iteration / every log line, wasting CPU and ExporterTimeout (10s). MMDS now backs off 50ms → 100ms → 200ms → ... up to 1s. The exporter backs off 1s → 2s → ... up to 5min between send attempts after a failure, and drops queued logs during the cooldown window. Also gate the exporter's printLog/non-FC stdout dumps behind -verbose so failed-send fallback writes don't leak into journald either.
Replace the entry-count cap (10k) with a 4 MiB byte cap and track buffered size explicitly. Drop-oldest is preserved so that after a long buffering window (e.g. before MMDS is up on boot) the orchestrator receives the most recent logs first. Nil-out dropped slots so the GC can reclaim them instead of being pinned by the slice's backing array.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 203b9e9940
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| w.bufferedBytes += len(logs) | ||
| w.logs = append(w.logs, logs) |
There was a problem hiding this comment.
Reject or truncate oversized log entries before buffering
addLogs only evicts existing entries while len(w.logs) > 0, so if a single incoming record is larger than maxBufferedBytes and the queue is empty, it is still appended and bufferedBytes jumps past the configured cap. That breaks the new bounded-memory guarantee and allows one unusually large log event to exceed the intended 4 MiB limit (with repeated large events causing repeated over-cap allocations).
Useful? React with 👍 / 👎.
envd's zerolog stdout writer is now gated behind a new \`-verbose\` flag (default off), so production envd inside FC no longer writes anything to stdout — journald stays clean of per-request debug events. The HTTP exporter still ships full debug to the orchestrator regardless. A journald drop-in caps the rest of the in-VM journal at \`Storage=persistent\` / \`SystemMaxUse=8M\` / \`MaxLevelStore=warning\` so other systemd services can't grow it without bound either. \`handler.go\`'s "error reading from pty/stdout/stderr" messages were on raw stderr but are about envd-handled user processes, not envd internals; they now flow through the zerolog logger. Split out of #2674 (journal-side half).


Reduces guest memory and rootfs dirtying caused by journald and envd logging during pause-resume cycles.
envd no longer writes to its stdout by default. A new
-verboseflag opts in for local development; in FC production envd's stdout stays empty so it doesn't dirty journald pages. The HTTP exporter still ships the full debug stream to the orchestrator. Only envd's stderr (Go runtime panics,log.Fatalf, bootstrap and MMDS-poll warnings) reaches journald — which is where we actually want envd's lifecycle/fatal output to land.The MMDS poll (50ms ticker) and the HTTP exporter's per-line "send failed" paths used to write to stderr per iteration / per log line, which 1:1-amplified into journald during outages. Both now log only the first failure of each kind. The exporter additionally caps its in-memory queue at 10k entries with drop-oldest semantics so a wedged collector can't grow it without bound. The remaining stderr writes from those two paths are tagged with the syslog
<4>(WARNING) prefix so journald classifies them correctly instead of as errors.The
handler.go"error reading from pty/stdout/stderr" messages were on stderr but are about envd-handled user processes, not envd internals; they now flow through the zerolog logger.A journald drop-in caps the remaining (other systemd services) journal at
Storage=persistent/SystemMaxUse=8M/MaxLevelStore=warningso the journal can't grow without bound and the writes land on the rootfs (small, ext4-inline_data-friendly) rather than dirtying RAM via the default tmpfs path.