obs(envd): log initLock wait + per-stage SetData durations#2691
obs(envd): log initLock wait + per-stage SetData durations#2691ValentaTomas wants to merge 2 commits into
Conversation
Today an /init failure surfaces as \"envd silent\" on the orchestrator without any envd-side signal about which stage of SetData was holding initLock. Each invocation now emits one info log with init_lock_wait and set_data durations, plus debug-level per-stage durations for the CA-bundle install and the NFS setup so the next investigation can tell \"envd was queued behind a slow predecessor\" from \"envd's NFS unmount took 9 seconds\" without grepping kernel traces.
PR SummaryLow Risk Overview It also bumps the Reviewed by Cursor Bugbot for commit 03586e6. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 10 Tests Failed:
View the full list of 16 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Info log emitted before error check misleads on failures
- Moved the '/init handled' info log to after the error check so it only fires on successful SetData calls, preventing misleading success logs when auth or other errors occur.
Or push these changes by commenting:
@cursor push 02a19689a6
Preview (02a19689a6)
diff --git a/packages/envd/internal/api/init.go b/packages/envd/internal/api/init.go
--- a/packages/envd/internal/api/init.go
+++ b/packages/envd/internal/api/init.go
@@ -138,10 +138,6 @@
setDataStart := time.Now()
err = a.SetData(ctx, logger, initRequest)
setDataDur := time.Since(setDataStart)
- logger.Info().
- Dur("init_lock_wait", lockWait).
- Dur("set_data", setDataDur).
- Msg("/init handled")
if err != nil {
switch {
case errors.Is(err, ErrAccessTokenMismatch), errors.Is(err, ErrAccessTokenResetNotAuthorized):
@@ -154,6 +150,10 @@
return
}
+ logger.Info().
+ Dur("init_lock_wait", lockWait).
+ Dur("set_data", setDataDur).
+ Msg("/init handled")
} else {
logger.Debug().Dur("init_lock_wait", lockWait).Msg("/init skipped (older timestamp)")
}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit b3531d7. Configure here.
There was a problem hiding this comment.
Code Review
The debug log for CA bundle installation is emitted before the error check, which can result in misleading 'done' messages if the operation fails; this log should be moved after the error check to ensure accurate observability. Similarly, the 'setupNFS done' log message is recorded regardless of the function's success and should be relocated after the error handling block so that the log correctly indicates successful completion.
|
Superseded by #2700 (move envd into a dedicated network namespace), which addresses the root cause. |
|
Closing — pure observability, unlikely to drive any concrete fix given we already have the data we needed from the Bitfrost incident. Can revive if a future incident needs initLock/SetData stage timing again. |


Log
init_lock_waitandset_datadurations in/initplus per-stage CA/NFS timings.