Skip to content

obs(envd): log initLock wait + per-stage SetData durations#2691

Closed
ValentaTomas wants to merge 2 commits into
mainfrom
obs/envd-init-timing
Closed

obs(envd): log initLock wait + per-stage SetData durations#2691
ValentaTomas wants to merge 2 commits into
mainfrom
obs/envd-init-timing

Conversation

@ValentaTomas

@ValentaTomas ValentaTomas commented May 17, 2026

Copy link
Copy Markdown
Member

Log init_lock_wait and set_data durations in /init plus per-stage CA/NFS timings.

Today an /init failure surfaces as \"envd silent\" on the orchestrator
without any envd-side signal about which stage of SetData was holding
initLock. Each invocation now emits one info log with init_lock_wait
and set_data durations, plus debug-level per-stage durations for the
CA-bundle install and the NFS setup so the next investigation can tell
\"envd was queued behind a slow predecessor\" from \"envd's NFS unmount
took 9 seconds\" without grepping kernel traces.
@cla-bot cla-bot Bot added the cla-signed label May 17, 2026
@cursor

cursor Bot commented May 17, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Low risk because changes are limited to additional logging/timing around existing /init and SetData flows plus a version bump. Main risk is increased log volume or slight overhead on hot paths.

Overview
This adds duration logging to /init to record init_lock_wait and total set_data time, and logs per-stage timings for CA bundle installation and NFS setup.

It also bumps the envd version from 0.5.23 to 0.5.24.

Reviewed by Cursor Bugbot for commit 03586e6. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented May 17, 2026

Copy link
Copy Markdown

❌ 10 Tests Failed:

Tests completed Failed Passed Skipped
2622 10 2612 5
View the full list of 16 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 71.70% (Passed 221 times, Failed 560 times)

Stack Traces | 1.42s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (1.42s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestEgressFirewallPrivateIPRangesAlwaysBlocked

Flake rate in main: 55.04% (Passed 223 times, Failed 273 times)

Stack Traces | 150s run time
=== RUN   TestEgressFirewallPrivateIPRangesAlwaysBlocked
=== PAUSE TestEgressFirewallPrivateIPRangesAlwaysBlocked
=== CONT  TestEgressFirewallPrivateIPRangesAlwaysBlocked
--- FAIL: TestEgressFirewallPrivateIPRangesAlwaysBlocked (149.68s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestEgressFirewallPrivateIPRangesAlwaysBlocked/link_local_169.254.0.0/16

Flake rate in main: 55.56% (Passed 216 times, Failed 270 times)

Stack Traces | 1.12s run time
=== RUN   TestEgressFirewallPrivateIPRangesAlwaysBlocked/link_local_169.254.0.0/16
=== PAUSE TestEgressFirewallPrivateIPRangesAlwaysBlocked/link_local_169.254.0.0/16
=== CONT  TestEgressFirewallPrivateIPRangesAlwaysBlocked/link_local_169.254.0.0/16
Executing command curl in sandbox iurq6sa39vvf8j8jae8pw
    sandbox_network_out_test.go:463: Command [curl] output: event:{start:{pid:1316}}
    sandbox_network_out_test.go:463: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:75
        	            				.../api/sandboxes/sandbox_network_out_test.go:463
        	Error:      	"failed to execute command curl in sandbox iurq6sa39vvf8j8jae8pw: invalid_argument: protocol error: incomplete envelope: unexpected EOF" does not contain "failed with exit code"
        	Test:       	TestEgressFirewallPrivateIPRangesAlwaysBlocked/link_local_169.254.0.0/16
        	Messages:   	Expected connection failure message
--- FAIL: TestEgressFirewallPrivateIPRangesAlwaysBlocked/link_local_169.254.0.0/16 (1.12s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.85% (Passed 231 times, Failed 767 times)

Stack Traces | 29.5s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (29.48s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.39% (Passed 222 times, Failed 760 times)

Stack Traces | 2.67s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ivtvvfrtnd49dha7gm9j1
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1362}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ivtvvfrtnd49dha7gm9j1
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1363}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ivtvvfrtnd49dha7gm9j1
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1364}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Sun, 17 May 2026 10:52:57 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ivtvvfrtnd49dha7gm9j1
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (2.67s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 56.36% (Passed 391 times, Failed 505 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 64.45% (Passed 219 times, Failed 397 times)

Stack Traces | 8.42s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command /bin/bash in sandbox iz6mwtx0tvv59jiqodq71
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1255}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.42s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 64.33% (Passed 219 times, Failed 395 times)

Stack Traces | 6.63s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox iql4305rg1m77l9gh8yok
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1254}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (6.63s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Flake rate in main: 53.85% (Passed 276 times, Failed 322 times)

Stack Traces | 0.34s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
--- FAIL: TestListDir (0.34s)
Executing command python in sandbox ih0sr0ac09igd1w8ryn14
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Flake rate in main: 57.88% (Passed 219 times, Failed 301 times)

Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Flake rate in main: 57.88% (Passed 219 times, Failed 301 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Flake rate in main: 57.88% (Passed 219 times, Failed 301 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Flake rate in main: 57.88% (Passed 219 times, Failed 301 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.17% (Passed 229 times, Failed 448 times)

Stack Traces | 79.1s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (79.14s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 66.87% (Passed 219 times, Failed 442 times)

Stack Traces | 30s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1253}}
Executing command bash in sandbox io29q4s3p5tzgk0q28van (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 183 MB\nFree memory before tmpfs mount: 801 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.66661 s, 183 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"C"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.62\n\tPercent of CPU this job got: 98%\n\tElap"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"sed (wall clock) ti"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"me (h:mm:s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s or m:s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s): 0:03.67\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tAverage shared t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ext siz"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e (kby"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"tes): 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\n\tAverage u"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"nshared "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"data si"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ze (kbytes): 0\n\tAver"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"age stack size "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"(kbytes): 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\n\tAvera"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ge total size ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"kbytes"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): 0\n\tMax"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"imum re"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"sident "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"set si"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ze (kby"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"tes): 2628\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Average"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" resident set si"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ze (kbytes): 0\n\tMajor (requiring I/O) page faults:"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" 2\n\tMinor ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"reclaim"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ing a fr"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ame) p"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"age fau"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"lts: 347\n\tVo"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"luntary"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" contex"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"t swit"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ches: 3\n\tI"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"nvolunt"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ary context swit"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ches: "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"16\n\tSwaps: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tFile "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"system"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" i"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"nputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 830 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox iezgdnd5vnrw6vw6gv1fp
Executing command bash in sandbox iezgdnd5vnrw6vw6gv1fp (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1269}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"988ce5dab26c955234c816e5403af33a83a601416fad6dae934c81f86c7f977e\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox iezgdnd5vnrw6vw6gv1fp
Executing command bash in sandbox iezgdnd5vnrw6vw6gv1fp (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1273}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox iezgdnd5vnrw6vw6gv1fp: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (30.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxAutoResumeViaProxy

Flake rate in main: 55.40% (Passed 223 times, Failed 277 times)

Stack Traces | 23.4s run time
=== RUN   TestSandboxAutoResumeViaProxy
=== PAUSE TestSandboxAutoResumeViaProxy
=== CONT  TestSandboxAutoResumeViaProxy
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"ib8hyxolu4hg0wd85clyg","message":"The sandbox is running but port is not open","port":8000,"code":502}
Executing command ls in sandbox iilh3wk96zoh7xsnhn5bt
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"ib8hyxolu4hg0wd85clyg","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"ib8hyxolu4hg0wd85clyg","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"ib8hyxolu4hg0wd85clyg","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:116: 
        	Error Trace:	.../tests/proxies/auto_resume_test.go:116
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxAutoResumeViaProxy
--- FAIL: TestSandboxAutoResumeViaProxy (23.39s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Info log emitted before error check misleads on failures
    • Moved the '/init handled' info log to after the error check so it only fires on successful SetData calls, preventing misleading success logs when auth or other errors occur.

Create PR

Or push these changes by commenting:

@cursor push 02a19689a6
Preview (02a19689a6)
diff --git a/packages/envd/internal/api/init.go b/packages/envd/internal/api/init.go
--- a/packages/envd/internal/api/init.go
+++ b/packages/envd/internal/api/init.go
@@ -138,10 +138,6 @@
 			setDataStart := time.Now()
 			err = a.SetData(ctx, logger, initRequest)
 			setDataDur := time.Since(setDataStart)
-			logger.Info().
-				Dur("init_lock_wait", lockWait).
-				Dur("set_data", setDataDur).
-				Msg("/init handled")
 			if err != nil {
 				switch {
 				case errors.Is(err, ErrAccessTokenMismatch), errors.Is(err, ErrAccessTokenResetNotAuthorized):
@@ -154,6 +150,10 @@
 
 				return
 			}
+			logger.Info().
+				Dur("init_lock_wait", lockWait).
+				Dur("set_data", setDataDur).
+				Msg("/init handled")
 		} else {
 			logger.Debug().Dur("init_lock_wait", lockWait).Msg("/init skipped (older timestamp)")
 		}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit b3531d7. Configure here.

Comment thread packages/envd/internal/api/init.go Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The debug log for CA bundle installation is emitted before the error check, which can result in misleading 'done' messages if the operation fails; this log should be moved after the error check to ensure accurate observability. Similarly, the 'setupNFS done' log message is recorded regardless of the function's success and should be relocated after the error handling block so that the log correctly indicates successful completion.

Comment thread packages/envd/internal/api/init.go
Comment thread packages/envd/internal/api/init.go
@ValentaTomas

Copy link
Copy Markdown
Member Author

Superseded by #2700 (move envd into a dedicated network namespace), which addresses the root cause.

@ValentaTomas

Copy link
Copy Markdown
Member Author

Closing — pure observability, unlikely to drive any concrete fix given we already have the data we needed from the Bitfrost incident. Can revive if a future incident needs initLock/SetData stage timing again.

@ValentaTomas ValentaTomas deleted the obs/envd-init-timing branch May 18, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants