Skip to content

feat(orchestrator): classify envd-init by exit type#3139

Open
arkamar wants to merge 4 commits into
mainfrom
feat/envd-init-failure-type
Open

feat(orchestrator): classify envd-init by exit type#3139
arkamar wants to merge 4 commits into
mainfrom
feat/envd-init-failure-type

Conversation

@arkamar

@arkamar arkamar commented Jun 29, 2026

Copy link
Copy Markdown
Member

Add an exit_type attribute (success/timeout/canceled/transient/other) to the
orchestrator.sandbox.envd.init.calls and orchestrator.sandbox.envd.init.duration
meters so init outcomes can be told apart instead of only knowing success vs
failure. The success bool is kept for backward compatibility until consumers
move to exit_type.

WaitForEnvd signals both its timeout and a Firecracker process exit via context
cancellation, so dedicated ErrWaitForEnvdTimeout and ErrFcProcessExited causes
are used to distinguish a real timeout and a Firecracker crash (OOM-kill, panic,
segfault) from a generic cancellation. Firecracker exits are classified as other
rather than canceled.

Intermediate retried attempts that preceded an eventual success are tagged as
transient.

arkamar added 2 commits June 29, 2026 13:58
Add an exit_type attribute (success/timeout/canceled/other) to the
orchestrator.sandbox.envd.init.calls and orchestrator.sandbox.envd.init.duration
meters so init outcomes can be told apart instead of only knowing success vs
failure. The success bool is kept for backward compatibility until consumers
move to exit_type.

WaitForEnvd signals its timeout via context cancellation, so a dedicated
ErrWaitForEnvdTimeout cause is used to distinguish a real timeout from a
generic cancellation. Intermediate retried attempts that preceded an
eventual success are tagged as other.
@cursor

cursor Bot commented Jun 29, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Observability-only changes on the sandbox startup path; init behavior is unchanged aside from how cancel causes are labeled for metrics.

Overview
Envd init and WaitForEnvd metrics now carry an exit_type label (success, timeout, canceled, other, and transient for retried attempts before a final success) alongside the existing success bool for backward compatibility.

WaitForEnvd cancels with dedicated sentinels ErrWaitForEnvdTimeout and ErrFcProcessExited so timeouts and Firecracker exits are not lumped in with generic cancellation; classifyEnvdInitExit maps those (including wrapped context.Canceled + cause) into the new labels on orchestrator.sandbox.envd.init.calls and orchestrator.sandbox.envd.init.duration. Unit tests cover the classifier.

Reviewed by Cursor Bugbot for commit d9f8d64. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 35.29412% with 22 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/orchestrator/pkg/sandbox/envd.go 40.00% 18 Missing ⚠️
packages/orchestrator/pkg/sandbox/sandbox.go 0.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Appending directly to the shared attributes slice inside callAttributes can lead to an overwrite of metric attributes if the capacity of the attributes slice is larger than its length. To prevent this correctness issue, the base slice should be copied before appending the new attributes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/orchestrator/pkg/sandbox/envd.go

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bdcd0f624c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/envd.go
Comment thread packages/orchestrator/pkg/sandbox/sandbox.go
WaitForEnvd cancels its context when the Firecracker process exits, but
WithCancelCause forces ctx.Err() to context.Canceled, so a real FC crash
(OOM-kill, panic, segfault) was misclassified as exit_type=canceled.
Introduce an ErrFcProcessExited cancel cause and classify it as other so
FC deaths are no longer conflated with caller-initiated cancellations.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want fixes drafted automatically? Bugbot Autofix can create code changes for findings. A team admin can enable Autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit de17270. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/envd.go
Retried attempts that preceded a successful init were tagged exit_type=other,
overloading the residual terminal-failure bucket. Introduce a dedicated
transient value so a successful init episode reports its count-1 retries
distinctly, keeping other for unclassified terminal failures only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant