feat(orchestrator): classify envd-init by exit type#3139
Conversation
Add an exit_type attribute (success/timeout/canceled/other) to the orchestrator.sandbox.envd.init.calls and orchestrator.sandbox.envd.init.duration meters so init outcomes can be told apart instead of only knowing success vs failure. The success bool is kept for backward compatibility until consumers move to exit_type. WaitForEnvd signals its timeout via context cancellation, so a dedicated ErrWaitForEnvdTimeout cause is used to distinguish a real timeout from a generic cancellation. Intermediate retried attempts that preceded an eventual success are tagged as other.
PR SummaryLow Risk Overview WaitForEnvd cancels with dedicated sentinels Reviewed by Cursor Bugbot for commit d9f8d64. Bugbot is set up for automated code reviews on this repo. Configure here. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
Appending directly to the shared attributes slice inside callAttributes can lead to an overwrite of metric attributes if the capacity of the attributes slice is larger than its length. To prevent this correctness issue, the base slice should be copied before appending the new attributes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bdcd0f624c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
WaitForEnvd cancels its context when the Firecracker process exits, but WithCancelCause forces ctx.Err() to context.Canceled, so a real FC crash (OOM-kill, panic, segfault) was misclassified as exit_type=canceled. Introduce an ErrFcProcessExited cancel cause and classify it as other so FC deaths are no longer conflated with caller-initiated cancellations.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want fixes drafted automatically? Bugbot Autofix can create code changes for findings. A team admin can enable Autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit de17270. Configure here.
Retried attempts that preceded a successful init were tagged exit_type=other, overloading the residual terminal-failure bucket. Introduce a dedicated transient value so a successful init episode reports its count-1 retries distinctly, keeping other for unclassified terminal failures only.

Add an exit_type attribute (success/timeout/canceled/transient/other) to the
orchestrator.sandbox.envd.init.calls and orchestrator.sandbox.envd.init.duration
meters so init outcomes can be told apart instead of only knowing success vs
failure. The success bool is kept for backward compatibility until consumers
move to exit_type.
WaitForEnvd signals both its timeout and a Firecracker process exit via context
cancellation, so dedicated ErrWaitForEnvdTimeout and ErrFcProcessExited causes
are used to distinguish a real timeout and a Firecracker crash (OOM-kill, panic,
segfault) from a generic cancellation. Firecracker exits are classified as other
rather than canceled.
Intermediate retried attempts that preceded an eventual success are tagged as
transient.