feat(orchestrator): add sandbox duration histogram#3097
Conversation
PR SummaryLow Risk Overview When the sandbox lifecycle goroutine finishes waiting on the VM, it emits Reviewed by Cursor Bugbot for commit 30d5899. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Code Review
If the sandbox start time is zero, calculating the duration using time.Since will result in an extremely large value that will corrupt the histogram metrics with a massive outlier. A check should be added to ensure the start time is not zero before recording the metric.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
❌ 1 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
LGTM - straightforward additive telemetry following the established histogram pattern.
Extended reasoning...
Overview
Adds a new orchestrator.sandbox.duration Int64 histogram that records total sandbox lifetime (from sbx.GetStartedAt() to Server.Delete) tagged with kill_reason. Three small changes: declare the histogram type/desc/unit in packages/shared/pkg/telemetry/meters.go, register it on the Server struct in main.go, and .Record() it in Delete next to the existing recordSandboxKill call.
Security risks
None. Purely additive observability — no auth, crypto, network, or data-handling code is touched. The kill_reason attribute is already bounded and reused from recordSandboxKill, so no new cardinality risk.
Level of scrutiny
Low. This mirrors the existing OrchestratorSandboxCreateDurationName pattern exactly (declaration, registration, record-with-attributes). The recording site is colocated with sibling kill-time telemetry so it stays in sync, and time.Since(sbx.GetStartedAt()).Milliseconds() matches the execution_time already computed in getSandboxExecutionData.
Other factors
The bug hunting system found nothing. The histogram is registered with the same GetHistogram helper used by every other ms-unit histogram in the file. Unit ("ms") and description map entries are present and consistent.
| // the lifecycle observer records it as paused rather than crashed. The | ||
| // execution-duration sample is emitted there, not here. | ||
| sbx.SetEndReason(endReasonPause) | ||
|
|
There was a problem hiding this comment.
Failed pause labeled as pause
Medium Severity
SetEndReason(endReasonPause) runs before snapshotAndCacheSandbox. If snapshotting fails, the handler still returns an error but the deferred stopSandboxAsync tears the sandbox down, and setupSandboxLifecycle records orchestrator.sandbox.execution.duration with end_reason=pause even though the pause RPC failed and no paused event was published.
Reviewed by Cursor Bugbot for commit 0177b3c. Configure here.
0177b3c to
bc04752
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc04752183
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // old sandbox as a checkpoint hand-off so its impending stop is recorded as | ||
| // checkpointing rather than a crash. On checkpoint failure we return before | ||
| // this point, so the old sandbox's abnormal stop is recorded as a crash. | ||
| sbx.SetStopReason(sandbox.StopReasonCheckpointing) |
There was a problem hiding this comment.
Mark checkpoint handoffs before snapshot stops them
When checkpointing succeeds, snapshotAndCacheSandbox has already called sbx.Pause, and Sandbox.Pause passes s.Close into the rootfs export; that close stops the old VM and can let setupSandboxLifecycle record the duration before this line executes. Because GetStopReason defaults to crashed until this assignment, successful checkpoint handoffs can be emitted as stop_reason=crashed rather than checkpointing; set the reason before the snapshot path can close the sandbox, or record it from the close path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4bf246a. Configure here.
The sync checkpoint upload-failure path tore down the resumed sandbox via stopSandboxAsync without calling SetStopReason, so its lifecycle recorded stop_reason=crashed and logged a bogus crash. Mark it killed since this is an orchestrator-initiated teardown.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30d5899948
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sbxlogger.I(sbx).Error(ctx, "failed to wait for sandbox, cleaning up", zap.Error(waitErr)) | ||
| } | ||
|
|
||
| s.recordExecutionDuration(ctx, sbx, sbx.GetStopReason(ctx)) |
There was a problem hiding this comment.
Record execution duration at pause time
When a sandbox is paused or checkpointed, sbx.Pause has already paused the VM, but this lifecycle path does not sample the new histogram until after stopSandboxAsync runs and sbx.Wait returns. In the checkpoint flow the stop is deferred before snapshotting and, with synchronous checkpoint uploads, the handler can also wait in res.upload.Run before that defer fires, so stop_reason=checkpointing samples include snapshot/resume/upload time after the old VM stopped executing. Record the duration when the VM is paused, or pass that end timestamp into the lifecycle recorder, so the new execution-duration metric is not inflated for pause/checkpoint paths.
Useful? React with 👍 / 👎.


Record total sandbox lifetime (create->kill) as the
orchestrator.sandbox.durationhistogram in Server.Delete, tagged with kill_reason.