feat(orchestrator): add sandbox duration histogram by jakubno · Pull Request #3097 · e2b-dev/infra

jakubno · 2026-06-26T06:40:43Z

Record total sandbox lifetime (create->kill) as the orchestrator.sandbox.duration histogram in Server.Delete, tagged with kill_reason.

cursor · 2026-06-26T06:40:50Z

PR Summary

Low Risk
Observability-only changes on sandbox teardown; no auth, data, or routing behavior changes. Minor edge case if GetStopReason runs without a Firecracker process when no reason was set.

Overview
This adds a StopReason on sandbox metadata (first write wins) and sets it on explicit teardown paths—kill, pause, checkpoint, and orchestrator kill after a failed checkpoint upload—so unexpected Firecracker exits can be classified as crashed or unknown when nothing was recorded.

When the sandbox lifecycle goroutine finishes waiting on the VM, it emits orchestrator.sandbox.execution.duration (milliseconds from GetStartedAt() until cleanup) with a stop_reason attribute, separate from the existing create-duration and kill counters.

^{Reviewed by Cursor Bugbot for commit 30d5899. Bugbot is set up for automated code reviews on this repo. Configure here.}

gemini-code-assist

Code Review

If the sandbox start time is zero, calculating the duration using time.Since will result in an extremely large value that will corrupt the histogram metrics with a massive outlier. A check should be added to ensure the start time is not zero before recording the metric.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

codecov · 2026-06-26T06:41:54Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
3100	1	3099	6

View the top 3 failed test(s) by shortest run time

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Stack Traces | 0.01s run time

=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Stack Traces | 0.01s run time

=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Stack Traces | 0.02s run time

=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Stack Traces | 0.02s run time

=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.02s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Stack Traces | 1.29s run time

=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
--- FAIL: TestListDir (1.29s)
Executing command tar in sandbox idj5oy8qh7obmzu7y3luv (user: root)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildCOPY

Stack Traces | 164s run time

=== RUN   TestTemplateBuildCOPY
=== PAUSE TestTemplateBuildCOPY
=== CONT  TestTemplateBuildCOPY
    build_template_test.go:133: test-ubuntu-copy: [info] Building template aceiyvbs7nhue53upn5i/34e4efba-8598-4403-a8bc-fb71cb083068
    build_template_test.go:133: test-ubuntu-copy: [info] [base] FROM ubuntu:24.04 [a9b3ad91e89daabce8685d67ae12aacb4a128e5aea703dc57457c0ef17079b25]
    build_template_test.go:133: test-ubuntu-copy: [info] Base Docker image size: 30 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Creating file system and pulling Docker image
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:cb259a83ac3dd9fea0b394df41df2b298adf0df938fef5999475af18a751c257 30 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:7870c5cc1a974e3562cec3b145f85d4fac1a8740c18944c60027c8f367f91705 13 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:133: test-ubuntu-copy: [info] Layers extracted
    build_template_test.go:133: test-ubuntu-copy: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib64, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:133: test-ubuntu-copy: [info] Provisioning sandbox template
    build_template_test.go:133: test-ubuntu-copy: [info] Provisioning was successful, cleaning up
    build_template_test.go:133: test-ubuntu-copy: [info] Sandbox template provisioned
    build_template_test.go:133: test-ubuntu-copy: [info] [base] DEFAULT USER user [f9aaa150221ef19e492ba626b8d9200ab9434ed5f63d02341f9e3be29c4b8760]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 1/2] COPY . /app/ [2449098e8b0a6a85d9aaa0edf0136f8bc69e9d618dac9611e42bba7c698fd629]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 2/2] RUN cat /app/hello.txt | grep 'Hello from COPY!' [2f22b685841af46c090bf872491df49e98da0615c4d2c99657645a92a5b4ee67]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 2/2] [stdout]: Hello from COPY!
    build_template_test.go:133: test-ubuntu-copy: [info] [finalize] Finalizing template build [d8abedde856dad9179d9e99f9778f17106da49a5286a30633b898c2e893b0482]
    build_template_test.go:133: test-ubuntu-copy: [error] Build failed: build was cancelled
    build_template_test.go:1156: Build failed: {<nil> build was cancelled <nil>}
--- FAIL: TestTemplateBuildCOPY (164.32s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

claude

LGTM - straightforward additive telemetry following the established histogram pattern.

Extended reasoning...

Overview

Adds a new orchestrator.sandbox.duration Int64 histogram that records total sandbox lifetime (from sbx.GetStartedAt() to Server.Delete) tagged with kill_reason. Three small changes: declare the histogram type/desc/unit in packages/shared/pkg/telemetry/meters.go, register it on the Server struct in main.go, and .Record() it in Delete next to the existing recordSandboxKill call.

Security risks

None. Purely additive observability — no auth, crypto, network, or data-handling code is touched. The kill_reason attribute is already bounded and reused from recordSandboxKill, so no new cardinality risk.

Level of scrutiny

Low. This mirrors the existing OrchestratorSandboxCreateDurationName pattern exactly (declaration, registration, record-with-attributes). The recording site is colocated with sibling kill-time telemetry so it stays in sync, and time.Since(sbx.GetStartedAt()).Milliseconds() matches the execution_time already computed in getSandboxExecutionData.

Other factors

The bug hunting system found nothing. The histogram is registered with the same GetHistogram helper used by every other ms-unit histogram in the file. Unit ("ms") and description map entries are present and consistent.

cursor · 2026-06-26T08:55:42Z

+	// the lifecycle observer records it as paused rather than crashed. The
+	// execution-duration sample is emitted there, not here.
+	sbx.SetEndReason(endReasonPause)
+


Failed pause labeled as pause

Medium Severity

SetEndReason(endReasonPause) runs before snapshotAndCacheSandbox. If snapshotting fails, the handler still returns an error but the deferred stopSandboxAsync tears the sandbox down, and setupSandboxLifecycle records orchestrator.sandbox.execution.duration with end_reason=pause even though the pause RPC failed and no paused event was published.

^{Reviewed by Cursor Bugbot for commit 0177b3c. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc04752183

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-26T10:16:27Z

+	// old sandbox as a checkpoint hand-off so its impending stop is recorded as
+	// checkpointing rather than a crash. On checkpoint failure we return before
+	// this point, so the old sandbox's abnormal stop is recorded as a crash.
+	sbx.SetStopReason(sandbox.StopReasonCheckpointing)


Mark checkpoint handoffs before snapshot stops them

When checkpointing succeeds, snapshotAndCacheSandbox has already called sbx.Pause, and Sandbox.Pause passes s.Close into the rootfs export; that close stops the old VM and can let setupSandboxLifecycle record the duration before this line executes. Because GetStopReason defaults to crashed until this assignment, successful checkpoint handoffs can be emitted as stop_reason=crashed rather than checkpointing; set the reason before the snapshot path can close the sandbox, or record it from the close path.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 4bf246a. Configure here.}

The sync checkpoint upload-failure path tore down the resumed sandbox via stopSandboxAsync without calling SetStopReason, so its lifecycle recorded stop_reason=crashed and logged a bogus crash. Mark it killed since this is an orchestrator-initiated teardown.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30d5899948

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-30T10:10:29Z

 			sbxlogger.I(sbx).Error(ctx, "failed to wait for sandbox, cleaning up", zap.Error(waitErr))
 		}

+		s.recordExecutionDuration(ctx, sbx, sbx.GetStopReason(ctx))


Record execution duration at pause time

When a sandbox is paused or checkpointed, sbx.Pause has already paused the VM, but this lifecycle path does not sample the new histogram until after stopSandboxAsync runs and sbx.Wait returns. In the checkpoint flow the stop is deferred before snapshotting and, with synchronous checkpoint uploads, the handler can also wait in res.upload.Run before that defer fires, so stop_reason=checkpointing samples include snapshot/resume/upload time after the old VM stopped executing. Record the duration when the VM is paused, or pass that end timestamp into the lifecycle recorder, so the new execution-duration metric is not inflated for pause/checkpoint paths.

Useful? React with 👍 / 👎.

jakubno requested review from ValentaTomas and dobrac as code owners June 26, 2026 06:40

cla-bot Bot added the cla-signed label Jun 26, 2026

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

claude Bot reviewed Jun 26, 2026

View reviewed changes

cursor Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

cursor Bot reviewed Jun 26, 2026

View reviewed changes

chore(orch): add sandbox duration

bc04752

jakubno force-pushed the add-sandbox-duration-histogram branch from 0177b3c to bc04752 Compare June 26, 2026 10:09

cursor Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

chatgpt-codex-connector Bot reviewed Jun 26, 2026

View reviewed changes

jakubno added 2 commits June 26, 2026 14:20

fix: mark checkpointing at correct moment

18916e7

chore: handle crashed reason

4bf246a

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/sandbox.go

chatgpt-codex-connector Bot reviewed Jun 30, 2026

View reviewed changes

jakubno marked this pull request as draft June 30, 2026 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(orchestrator): add sandbox duration histogram#3097

feat(orchestrator): add sandbox duration histogram#3097
jakubno wants to merge 4 commits into
mainfrom
add-sandbox-duration-histogram

jakubno commented Jun 26, 2026

Uh oh!

cursor Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

cursor Bot Jun 26, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jakubno commented Jun 26, 2026

Uh oh!

cursor Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

cursor Bot Jun 26, 2026

Choose a reason for hiding this comment

Failed pause labeled as pause

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 26, 2026 •

edited

Loading

codecov Bot commented Jun 26, 2026 •

edited

Loading