Skip to content

feat(orchestrator): add sandbox duration histogram#3097

Draft
jakubno wants to merge 4 commits into
mainfrom
add-sandbox-duration-histogram
Draft

feat(orchestrator): add sandbox duration histogram#3097
jakubno wants to merge 4 commits into
mainfrom
add-sandbox-duration-histogram

Conversation

@jakubno

@jakubno jakubno commented Jun 26, 2026

Copy link
Copy Markdown
Member

Record total sandbox lifetime (create->kill) as the orchestrator.sandbox.duration histogram in Server.Delete, tagged with kill_reason.

@cursor

cursor Bot commented Jun 26, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Observability-only changes on sandbox teardown; no auth, data, or routing behavior changes. Minor edge case if GetStopReason runs without a Firecracker process when no reason was set.

Overview
This adds a StopReason on sandbox metadata (first write wins) and sets it on explicit teardown paths—kill, pause, checkpoint, and orchestrator kill after a failed checkpoint upload—so unexpected Firecracker exits can be classified as crashed or unknown when nothing was recorded.

When the sandbox lifecycle goroutine finishes waiting on the VM, it emits orchestrator.sandbox.execution.duration (milliseconds from GetStartedAt() until cleanup) with a stop_reason attribute, separate from the existing create-duration and kill counters.

Reviewed by Cursor Bugbot for commit 30d5899. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

If the sandbox start time is zero, calculating the duration using time.Since will result in an extremely large value that will corrupt the histogram metrics with a massive outlier. A check should be added to ensure the start time is not zero before recording the metric.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
3100 1 3099 6
View the top 3 failed test(s) by shortest run time
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory
Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files
Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory
Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir
Stack Traces | 1.29s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
--- FAIL: TestListDir (1.29s)
Executing command tar in sandbox idj5oy8qh7obmzu7y3luv (user: root)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildCOPY
Stack Traces | 164s run time
=== RUN   TestTemplateBuildCOPY
=== PAUSE TestTemplateBuildCOPY
=== CONT  TestTemplateBuildCOPY
    build_template_test.go:133: test-ubuntu-copy: [info] Building template aceiyvbs7nhue53upn5i/34e4efba-8598-4403-a8bc-fb71cb083068
    build_template_test.go:133: test-ubuntu-copy: [info] [base] FROM ubuntu:24.04 [a9b3ad91e89daabce8685d67ae12aacb4a128e5aea703dc57457c0ef17079b25]
    build_template_test.go:133: test-ubuntu-copy: [info] Base Docker image size: 30 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Creating file system and pulling Docker image
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:cb259a83ac3dd9fea0b394df41df2b298adf0df938fef5999475af18a751c257 30 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:7870c5cc1a974e3562cec3b145f85d4fac1a8740c18944c60027c8f367f91705 13 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:133: test-ubuntu-copy: [info] Layers extracted
    build_template_test.go:133: test-ubuntu-copy: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib64, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:133: test-ubuntu-copy: [info] Provisioning sandbox template
    build_template_test.go:133: test-ubuntu-copy: [info] Provisioning was successful, cleaning up
    build_template_test.go:133: test-ubuntu-copy: [info] Sandbox template provisioned
    build_template_test.go:133: test-ubuntu-copy: [info] [base] DEFAULT USER user [f9aaa150221ef19e492ba626b8d9200ab9434ed5f63d02341f9e3be29c4b8760]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 1/2] COPY . /app/ [2449098e8b0a6a85d9aaa0edf0136f8bc69e9d618dac9611e42bba7c698fd629]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 2/2] RUN cat /app/hello.txt | grep 'Hello from COPY!' [2f22b685841af46c090bf872491df49e98da0615c4d2c99657645a92a5b4ee67]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 2/2] [stdout]: Hello from COPY!
    build_template_test.go:133: test-ubuntu-copy: [info] [finalize] Finalizing template build [d8abedde856dad9179d9e99f9778f17106da49a5286a30633b898c2e893b0482]
    build_template_test.go:133: test-ubuntu-copy: [error] Build failed: build was cancelled
    build_template_test.go:1156: Build failed: {<nil> build was cancelled <nil>}
--- FAIL: TestTemplateBuildCOPY (164.32s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - straightforward additive telemetry following the established histogram pattern.

Extended reasoning...

Overview

Adds a new orchestrator.sandbox.duration Int64 histogram that records total sandbox lifetime (from sbx.GetStartedAt() to Server.Delete) tagged with kill_reason. Three small changes: declare the histogram type/desc/unit in packages/shared/pkg/telemetry/meters.go, register it on the Server struct in main.go, and .Record() it in Delete next to the existing recordSandboxKill call.

Security risks

None. Purely additive observability — no auth, crypto, network, or data-handling code is touched. The kill_reason attribute is already bounded and reused from recordSandboxKill, so no new cardinality risk.

Level of scrutiny

Low. This mirrors the existing OrchestratorSandboxCreateDurationName pattern exactly (declaration, registration, record-with-attributes). The recording site is colocated with sibling kill-time telemetry so it stays in sync, and time.Since(sbx.GetStartedAt()).Milliseconds() matches the execution_time already computed in getSandboxExecutionData.

Other factors

The bug hunting system found nothing. The histogram is registered with the same GetHistogram helper used by every other ms-unit histogram in the file. Unit ("ms") and description map entries are present and consistent.

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated
// the lifecycle observer records it as paused rather than crashed. The
// execution-duration sample is emitted there, not here.
sbx.SetEndReason(endReasonPause)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed pause labeled as pause

Medium Severity

SetEndReason(endReasonPause) runs before snapshotAndCacheSandbox. If snapshotting fails, the handler still returns an error but the deferred stopSandboxAsync tears the sandbox down, and setupSandboxLifecycle records orchestrator.sandbox.execution.duration with end_reason=pause even though the pause RPC failed and no paused event was published.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0177b3c. Configure here.

@jakubno jakubno force-pushed the add-sandbox-duration-histogram branch from 0177b3c to bc04752 Compare June 26, 2026 10:09
Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc04752183

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// old sandbox as a checkpoint hand-off so its impending stop is recorded as
// checkpointing rather than a crash. On checkpoint failure we return before
// this point, so the old sandbox's abnormal stop is recorded as a crash.
sbx.SetStopReason(sandbox.StopReasonCheckpointing)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark checkpoint handoffs before snapshot stops them

When checkpointing succeeds, snapshotAndCacheSandbox has already called sbx.Pause, and Sandbox.Pause passes s.Close into the rootfs export; that close stops the old VM and can let setupSandboxLifecycle record the duration before this line executes. Because GetStopReason defaults to crashed until this assignment, successful checkpoint handoffs can be emitted as stop_reason=crashed rather than checkpointing; set the reason before the snapshot path can close the sandbox, or record it from the close path.

Useful? React with 👍 / 👎.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4bf246a. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/sandbox.go
The sync checkpoint upload-failure path tore down the resumed sandbox via
stopSandboxAsync without calling SetStopReason, so its lifecycle recorded
stop_reason=crashed and logged a bogus crash. Mark it killed since this is
an orchestrator-initiated teardown.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30d5899948

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sbxlogger.I(sbx).Error(ctx, "failed to wait for sandbox, cleaning up", zap.Error(waitErr))
}

s.recordExecutionDuration(ctx, sbx, sbx.GetStopReason(ctx))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Record execution duration at pause time

When a sandbox is paused or checkpointed, sbx.Pause has already paused the VM, but this lifecycle path does not sample the new histogram until after stopSandboxAsync runs and sbx.Wait returns. In the checkpoint flow the stop is deferred before snapshotting and, with synchronous checkpoint uploads, the handler can also wait in res.upload.Run before that defer fires, so stop_reason=checkpointing samples include snapshot/resume/upload time after the old VM stopped executing. Record the duration when the VM is paused, or pass that end timestamp into the lifecycle recorder, so the new execution-duration metric is not inflated for pause/checkpoint paths.

Useful? React with 👍 / 👎.

@jakubno jakubno marked this pull request as draft June 30, 2026 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant