Skip to content

fix(envd): suppress repeat MMDS poll failures#2678

Merged
ValentaTomas merged 1 commit into
mainfrom
perf/envd-mmds-log-suppression
May 16, 2026
Merged

fix(envd): suppress repeat MMDS poll failures#2678
ValentaTomas merged 1 commit into
mainfrom
perf/envd-mmds-log-suppression

Conversation

@ValentaTomas

Copy link
Copy Markdown
Member

The MMDS poll runs on a 50ms ticker and was writing one stderr line per failed iteration, which 1:1-amplified into journald during outages (e.g. before MMDS is up at boot). Log only the first failure of each kind and tag the stderr writes as syslog WARNING.

Split out of #2676 (the exporter-hardening half).

@cla-bot cla-bot Bot added the cla-signed label May 16, 2026
@cursor

cursor Bot commented May 16, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes MMDS polling error reporting, which can reduce observability during boot/outage scenarios if MMDS never becomes available. Functional behavior is otherwise unchanged, but debugging relies on the final cancellation log.

Overview
MMDS polling no longer emits a stderr line on every 50ms failure; it now only records the most recent failure and logs a single syslog-warning (<4>) message when the polling context is cancelled. This may hide ongoing MMDS unavailability for up to the full timeout and only reports the last error, not the first or the most frequent failure mode.

Reviewed by Cursor Bugbot for commit aa0449f. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented May 16, 2026

Copy link
Copy Markdown

❌ 6 Tests Failed:

Tests completed Failed Passed Skipped
2622 6 2616 5
View the full list of 10 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 71.05% (Passed 209 times, Failed 513 times)

Stack Traces | 0.4s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.40s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestEgressFirewallAllowDomainAndIP

Flake rate in main: 54.91% (Passed 211 times, Failed 257 times)

Stack Traces | 4.15s run time
=== RUN   TestEgressFirewallAllowDomainAndIP
=== PAUSE TestEgressFirewallAllowDomainAndIP
=== CONT  TestEgressFirewallAllowDomainAndIP
Executing command curl in sandbox iaaqqegy70m5evguwkvbw
    sandbox_network_out_test.go:673: Command [curl] output: event:{start:{pid:1321}}
Executing command curl in sandbox iy32neowh8mqrg9e8k0zv
    sandbox_network_out_test.go:673: Command [curl] output: event:{data:{stdout:"HTTP/2 301 \r\nlocation: https://www.google.com/\r\ncontent-type: text/html; charset=UTF-8\r\ncontent-security-policy-report-only: object-src 'none';base-uri 'self';script-src 'nonce-BzodCVWFnZMq3zraEwfR_A' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle..../csp/gws/other-hp\r\ndate: Sat, 16 May 2026 07:13:14 GMT\r\nexpires: Mon, 15 Jun 2026 07:13:14 GMT\r\ncache-control: public, max-age=2592000\r\nserver: gws\r\ncontent-length: 220\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_out_test.go:673: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_out_test.go:673: Command [curl] completed successfully in sandbox ik3oz1dcvqpskd003upp1
    sandbox_network_out_test.go:674: Command [curl] output: event:{start:{pid:1323}}
    sandbox_network_out_test.go:674: Command [curl] output: event:{data:{stdout:"HTTP/2 301 \r\ndate: Sat, 16 May 2026 07:13:14 GMT\r\nlocation: https://one.one.one.one/\r\nserver: cloudflare\r\ncf-ray: 9fc89421692f8b33-DFW\r\n\r\n"}}
    sandbox_network_out_test.go:674: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_out_test.go:674: Command [curl] completed successfully in sandbox ik3oz1dcvqpskd003upp1
Executing command curl in sandbox iy32neowh8mqrg9e8k0zv
    sandbox_network_out_test.go:675: Command [curl] output: event:{start:{pid:1324}}
    sandbox_network_out_test.go:675: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_out_test.go:676: Command [curl] output: event:{start:{pid:1326}}
    sandbox_network_out_test.go:676: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:75
        	            				.../api/sandboxes/sandbox_network_out_test.go:676
        	Error:      	"failed to execute command curl in sandbox ik3oz1dcvqpskd003upp1: invalid_argument: protocol error: incomplete envelope: unexpected EOF" does not contain "failed with exit code"
        	Test:       	TestEgressFirewallAllowDomainAndIP
        	Messages:   	Expected connection failure message
--- FAIL: TestEgressFirewallAllowDomainAndIP (4.15s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.65% (Passed 219 times, Failed 719 times)

Stack Traces | 195s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (194.86s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.22% (Passed 210 times, Failed 712 times)

Stack Traces | 4.71s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ivbdimqjpew4c6h0mvjjp
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1363}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ivbdimqjpew4c6h0mvjjp
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1364}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1365}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Sat, 16 May 2026 07:13:30 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ivbdimqjpew4c6h0mvjjp
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (4.71s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 56.31% (Passed 367 times, Failed 473 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 62.84% (Passed 207 times, Failed 350 times)

Stack Traces | 7.19s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1257}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.19s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 64.19% (Passed 207 times, Failed 371 times)

Stack Traces | 8.84s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox ipsyxk4ibj8twqte1yn50
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1257}}
Executing command python in sandbox ia6auefrlyqaevdlayaq2
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.84s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 64.06% (Passed 207 times, Failed 369 times)

Stack Traces | 9.91s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox i2qunispp347jrj8jl8sk
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1258}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (9.91s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.15% (Passed 217 times, Failed 424 times)

Stack Traces | 81.1s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (81.08s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 66.88% (Passed 207 times, Failed 418 times)

Stack Traces | 28.9s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1253}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 184 MB\nFree memory before tmpfs mount: 800 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
Executing command bash in sandbox i5cumrp3v4kfd6ohn1w3a (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.19313 s, 210 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.18\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.19\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2632\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 346\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 28\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Socket messages"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" sent: 0\n\tSocket messages received: 0\n\tSignals de"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"livered:"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" 0\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Pag"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e si"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ze (byte"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox iuekprdvavs6wltvicatj
Executing command bash in sandbox iuekprdvavs6wltvicatj (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1269}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"494d5806bb13f1ddf57cb626d4a11a2f02cc782b77c474ff395662934e4a25ba\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox iuekprdvavs6wltvicatj
Executing command bash in sandbox iuekprdvavs6wltvicatj (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1274}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox iuekprdvavs6wltvicatj: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (28.90s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The changes implement error suppression in the MMDS polling loop by using boolean flags to log token and option retrieval errors only once, and remove the log message upon context cancellation. I have no feedback to provide.

@ValentaTomas ValentaTomas force-pushed the perf/envd-mmds-log-suppression branch from 4e064a9 to f71c31b Compare May 16, 2026 06:57
The MMDS poll runs on a 50ms ticker and was writing one stderr line per
failed iteration, which 1:1-amplified into journald during outages
(e.g. before MMDS is up at boot). Log only the first failure of each
kind and tag stderr writes as syslog WARNING (<4>). Drop the
context-cancelled stderr line: shutdown noise.
@ValentaTomas ValentaTomas force-pushed the perf/envd-mmds-log-suppression branch from f71c31b to aa0449f Compare May 16, 2026 06:58
@ValentaTomas ValentaTomas marked this pull request as ready for review May 16, 2026 06:59
@ValentaTomas ValentaTomas enabled auto-merge (squash) May 16, 2026 06:59
Comment thread packages/envd/internal/host/mmds.go
Comment thread packages/envd/internal/host/mmds.go
@ValentaTomas ValentaTomas merged commit 73d691a into main May 16, 2026
55 checks passed
@ValentaTomas ValentaTomas deleted the perf/envd-mmds-log-suppression branch May 16, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants