Skip to content

feat(orch): make resume-build -gdb work on real nodes + add copy-build -gdb#3108

Open
kalyazin wants to merge 5 commits into
mainfrom
kalyazin/gdb-portable-debugging
Open

feat(orch): make resume-build -gdb work on real nodes + add copy-build -gdb#3108
kalyazin wants to merge 5 commits into
mainfrom
kalyazin/gdb-portable-debugging

Conversation

@kalyazin

Copy link
Copy Markdown
Contributor

Why

resume-build -gdb (guest-kernel debugging, added in #3040) didn't actually run on a real orchestrator node: it staged the gdb-enabled Firecracker by renaming the prod binary in place, but on a node the FC-versions directory is a read-only mount (gcsfuse), so it failed with read-only file system. It also located its fc-debug.gdb macro library relative to the source tree, so a binary copied to a node without its source failed with macro library not found. And bridging a snapshot from another environment for debugging meant hand-copying the FC + kernel debug artifacts.

This makes the -gdb path portable, adds a one-command artifact bridge, and trims a debug-artifact fetch path that never fired in practice.

What

  1. fix(orch): make resume-build -gdb portable to read-only mounts and standalone binaries

    • Stage the gdb-enabled Firecracker into a writable temp dir and point the factory's FirecrackerVersionsDir at it before the factory captures the config, instead of overwriting the prod binary in the read-only versions dir. The original dir is preserved only for resolving the published firecracker-debug; the real dir is never written. Removes the old backup/restore (and its interrupted-run recovery).
    • Embed fc-debug.gdb with go:embed so the binary is self-contained (a colocated copy still wins for local iteration).
    • Backstop: verify the binary about to launch actually contains the gdb feature (FIRECRACKER_GDB_SOCKET), so a stale/wrong firecracker-debug fails with a clear message instead of an opaque "gdb socket never bound".
  2. feat(orch): add copy-build -gdb to stage FC + kernel debug artifacts

    • copy-build -gdb reads the build's FC/kernel versions from metadata.json and ensures the runtime + debug artifacts (firecracker, firecracker-debug, vmlinux.bin, vmlinux.debug) exist at the destination, deriving the versions/kernels buckets from the template bucket. CRC-compares like the snapshot copy: skips identical content, otherwise copies — replacing a divergent/stale artifact. Requires gs:// source/dest.
  3. refactor(orch): drop the unused gdb debug-artifact URL auto-fetch

    • The by-version HTTP fetch never fired in practice (the artifacts aren't published at the default URL, and the per-cluster bucket layout doesn't match the URL scheme); artifacts always resolve locally or via copy-build -gdb. Removes the fetch code/env-var and renames resolveOrFetchresolveLocal.
  4. docs(orch): correct the gdb runbook artifact resolution

    • Replaces the inaccurate "auto-fetched by version" claim with how resolution actually works (local, next to the snapshot's FC/kernel; staged by copy-build -gdb; or explicit -gdb-fc/-gdb-symbols).

Testing

  • End-to-end on a dev node against a real snapshot, with the FC-versions directory mounted read-only: Firecracker bound the gdb socket and launched from the temp staging dir (never the read-only mount); DWARF source-level symbols, vCPU state, and page-fault profiling all worked.
  • Verified the backstop both ways: a non-gdb binary fails with the clear "is not gdb-enabled" error; a real gdb binary passes and runs.
  • copy-build -gdb exercised both the skip-identical and the copy/replace paths.
  • go build, go test, golangci-lint clean; each commit builds on its own.

Scope / risk

Touches only the resume-build and copy-build developer side-tools — no production service code path. The read-only-mount override is gated on -gdb and confined to that run (only the FC-versions dir; the kernel dir is untouched).

🤖 Generated with Claude Code

@cla-bot cla-bot Bot added the cla-signed label Jun 26, 2026
@cursor

cursor Bot commented Jun 26, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Changes are limited to developer CLI tools (resume-build, copy-build) and gdb docs; production orchestrator paths are untouched aside from the -gdb-gated temp FC versions redirect during a manual debug run.

Overview
This PR fixes resume-build -gdb on orchestrator nodes where Firecracker versions live on read-only storage: instead of swapping the prod binary in place, it redirects the factory to a writable temp versions dir, copies the gdb-enabled Firecracker there, and still resolves firecracker-debug from the original mount. Debug macros are embedded in the binary, gdb connect gets a longer remote timeout, and launch fails fast if the staged binary is not gdb-built-including a hbreak tweak in the fault macro. copy-build -gdb reads kernel/FC versions from copied metadata.json and conditionally copies FC/kernel runtime and debug objects between sibling GCS buckets using CRC checks. Artifact resolution is local-only (overrides or paths next to the snapshot); the previous by-version HTTP download and E2B_GDB_ARTIFACTS_URL path is removed, and the gdb runbook is updated to match.

Reviewed by Cursor Bugbot for commit ee3db93. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

❌ 11 Tests Failed:

Tests completed Failed Passed Skipped
3097 11 3086 6
View the top 3 failed test(s) by shortest run time
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildRUN
Stack Traces | 0s run time
=== RUN   TestTemplateBuildRUN
=== PAUSE TestTemplateBuildRUN
=== CONT  TestTemplateBuildRUN
--- FAIL: TestTemplateBuildRUN (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxFilesystemPauseResumeIntegrity
Stack Traces | 0s run time
=== RUN   TestSandboxFilesystemPauseResumeIntegrity
=== PAUSE TestSandboxFilesystemPauseResumeIntegrity
=== CONT  TestSandboxFilesystemPauseResumeIntegrity
--- FAIL: TestSandboxFilesystemPauseResumeIntegrity (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory
Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files
Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory
Stack Traces | 0.02s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:96: 
        	Error Trace:	.../tests/envd/filesystem_test.go:96
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir
Stack Traces | 0.36s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
--- FAIL: TestListDir (0.36s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
Stack Traces | 17.2s run time
=== RUN   TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
=== PAUSE TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
=== CONT  TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{start:{pid:1115}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] completed successfully in sandbox ist4an3c4qutjxi9tuc1c
Executing command bash in sandbox ist4an3c4qutjxi9tuc1c (user: root)
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{start:{pid:1120}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{data:{stdout:"68733113af49d9f3d9c4f7360642109bb8ffd6ffbcaf1bc79b4ad27f6a0eb629\n"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] completed successfully in sandbox ist4an3c4qutjxi9tuc1c
Executing command bash in sandbox ist4an3c4qutjxi9tuc1c (user: root)
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{start:{pid:1126}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{data:{stdout:"261095424\n"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] completed successfully in sandbox ist4an3c4qutjxi9tuc1c
Executing command bash in sandbox itp3q0ak53vmy0tzflqok (user: root)
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{start:{pid:1130}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{data:{stdout:"68733113af49d9f3d9c4f7360642109bb8ffd6ffbcaf1bc79b4ad27f6a0eb629\n"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    filesystem_pause_resume_integrity_test.go:43: Command [bash] completed successfully in sandbox ist4an3c4qutjxi9tuc1c
Executing command bash in sandbox ist4an3c4qutjxi9tuc1c (user: root)
    filesystem_pause_resume_integrity_test.go:43: Command [bash] output: event:{start:{pid:1136}}
    filesystem_pause_resume_integrity_test.go:43: 
        	Error Trace:	.../tests/orchestrator/filesystem_pause_resume_integrity_test.go:114
        	            				.../tests/orchestrator/filesystem_pause_resume_integrity_test.go:140
        	            				.../tests/orchestrator/filesystem_pause_resume_integrity_test.go:43
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox ist4an3c4qutjxi9tuc1c: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause
--- FAIL: TestSandboxFilesystemPauseResumeIntegrity/scattered_write_hash_survives_pause (17.20s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildRUN/Single_RUN_command
Stack Traces | 170s run time
=== RUN   TestTemplateBuildRUN/Single_RUN_command
=== PAUSE TestTemplateBuildRUN/Single_RUN_command
=== CONT  TestTemplateBuildRUN/Single_RUN_command
    build_template_test.go:133: test-ubuntu-run: [info] Building template k2ib11x237u5ra6rlb1w/d8718b72-dfd3-4ae1-97b9-754909c24835
    build_template_test.go:133: test-ubuntu-run: [info] [base] FROM ubuntu:22.04 [f9f564014e009a9561a82bf8c84f9314242971e833fb019936654ecba452f184]
    build_template_test.go:133: test-ubuntu-run: [info] Base Docker image size: 30 MB
    build_template_test.go:133: test-ubuntu-run: [info] Creating file system and pulling Docker image
    build_template_test.go:133: test-ubuntu-run: [info] Uncompressing layer sha256:40d16f30db405106ef8074779bdf41f012465c2a785bbeaa2eab9f2081099b47 30 MB
    build_template_test.go:133: test-ubuntu-run: [info] Uncompressing layer sha256:94bbaa35bd91d49373d6e08d328f63f1829c139a5eb120a51aa4737f2a941836 13 MB
    build_template_test.go:133: test-ubuntu-run: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:133: test-ubuntu-run: [info] Layers extracted
    build_template_test.go:133: test-ubuntu-run: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib32, lib64, libx32, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:133: test-ubuntu-run: [info] Provisioning sandbox template
    build_template_test.go:133: test-ubuntu-run: [info] Provisioning was successful, cleaning up
    build_template_test.go:133: test-ubuntu-run: [info] Sandbox template provisioned
    build_template_test.go:133: test-ubuntu-run: [info] [base] DEFAULT USER user [49e586c2171254c6bc4a09e84eedac32dbcf113a158c24248129af2f49cbed74]
    build_template_test.go:133: test-ubuntu-run: [info] [builder 1/1] RUN echo 'Hello, World!' [c72b4f813c2a16b0fc1a1c5da7b1365a304cbac516b22dc304a71f70aae48ac0]
    build_template_test.go:133: test-ubuntu-run: [info] [builder 1/1] [stdout]: Hello, World!
    build_template_test.go:133: test-ubuntu-run: [info] [finalize] Finalizing template build [92c524e30533398ebb41ce04c2596130f0cdecc9aa328e28fdb16a1b11f61d62]
    build_template_test.go:133: test-ubuntu-run: [error] Build failed: build was cancelled
    build_template_test.go:166: Build failed: {<nil> build was cancelled <nil>}
--- FAIL: TestTemplateBuildRUN/Single_RUN_command (169.54s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestDeleteTemplateFromAnotherTeamAPIKey
Stack Traces | 170s run time
=== RUN   TestDeleteTemplateFromAnotherTeamAPIKey
=== PAUSE TestDeleteTemplateFromAnotherTeamAPIKey
=== CONT  TestDeleteTemplateFromAnotherTeamAPIKey
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Building template rr9of5awzocodolt7he9/93a0f922-344c-4a87-94a5-2896d17409ea
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [base] FROM ubuntu:22.04 [f9f564014e009a9561a82bf8c84f9314242971e833fb019936654ecba452f184]
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Base Docker image size: 30 MB
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Creating file system and pulling Docker image
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Uncompressing layer sha256:40d16f30db405106ef8074779bdf41f012465c2a785bbeaa2eab9f2081099b47 30 MB
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Uncompressing layer sha256:94bbaa35bd91d49373d6e08d328f63f1829c139a5eb120a51aa4737f2a941836 13 MB
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Layers extracted
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib32, lib64, libx32, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Provisioning sandbox template
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Provisioning was successful, cleaning up
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] Sandbox template provisioned
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [base] DEFAULT USER user [49e586c2171254c6bc4a09e84eedac32dbcf113a158c24248129af2f49cbed74]
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [builder 1/1] RUN echo 'Hello, World!' [c72b4f813c2a16b0fc1a1c5da7b1365a304cbac516b22dc304a71f70aae48ac0]
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [builder 1/1] [stdout]: Hello, World!
    build_template_test.go:133: test-to-delete-another-team-api-key: [info] [finalize] Finalizing template build [92c524e30533398ebb41ce04c2596130f0cdecc9aa328e28fdb16a1b11f61d62]
    build_template_test.go:133: test-to-delete-another-team-api-key: [error] Build failed: build was cancelled
    delete_template_test.go:51: Build failed: {<nil> build was cancelled <nil>}
--- FAIL: TestDeleteTemplateFromAnotherTeamAPIKey (170.22s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildCOPY
Stack Traces | 177s run time
=== RUN   TestTemplateBuildCOPY
=== PAUSE TestTemplateBuildCOPY
=== CONT  TestTemplateBuildCOPY
    build_template_test.go:133: test-ubuntu-copy: [info] Building template hqm4klhzkxmtlo0z0oun/bc37cc07-f955-47f9-8ebb-bb148d9dcfe8
    build_template_test.go:133: test-ubuntu-copy: [info] [base] FROM ubuntu:24.04 [a9b3ad91e89daabce8685d67ae12aacb4a128e5aea703dc57457c0ef17079b25]
    build_template_test.go:133: test-ubuntu-copy: [info] Base Docker image size: 30 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Creating file system and pulling Docker image
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:cb259a83ac3dd9fea0b394df41df2b298adf0df938fef5999475af18a751c257 30 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:94bbaa35bd91d49373d6e08d328f63f1829c139a5eb120a51aa4737f2a941836 13 MB
    build_template_test.go:133: test-ubuntu-copy: [info] Uncompressing layer sha256:8c4b1b28875140ed3abacaf16ad0d696f6bef912f52d2148f261a23e3349465b 168 B
    build_template_test.go:133: test-ubuntu-copy: [info] Layers extracted
    build_template_test.go:133: test-ubuntu-copy: [info] Root filesystem structure: bin, boot, dev, etc, home, lib, lib64, media, mnt, opt, proc, root, run, sbin, srv, sys, tmp, usr, var
    build_template_test.go:133: test-ubuntu-copy: [info] Provisioning sandbox template
    build_template_test.go:133: test-ubuntu-copy: [info] Provisioning was successful, cleaning up
    build_template_test.go:133: test-ubuntu-copy: [info] Sandbox template provisioned
    build_template_test.go:133: test-ubuntu-copy: [info] [base] DEFAULT USER user [f9aaa150221ef19e492ba626b8d9200ab9434ed5f63d02341f9e3be29c4b8760]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 1/2] COPY . /app/ [2449098e8b0a6a85d9aaa0edf0136f8bc69e9d618dac9611e42bba7c698fd629]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 2/2] RUN cat /app/hello.txt | grep 'Hello from COPY!' [2f22b685841af46c090bf872491df49e98da0615c4d2c99657645a92a5b4ee67]
    build_template_test.go:133: test-ubuntu-copy: [info] [builder 2/2] [stdout]: Hello from COPY!
    build_template_test.go:133: test-ubuntu-copy: [info] [finalize] Finalizing template build [d8abedde856dad9179d9e99f9778f17106da49a5286a30633b898c2e893b0482]
    build_template_test.go:133: test-ubuntu-copy: [error] Build failed: build was cancelled
    build_template_test.go:1156: Build failed: {<nil> build was cancelled <nil>}
--- FAIL: TestTemplateBuildCOPY (176.63s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The temporary file created by macroLibPath to hold the embedded fc-debug.gdb macros is never deleted, leading to a resource leak on every run. Since the init script is already a temporary file that is properly cleaned up on exit, consider inlining the embedded macros directly into the init script instead of materializing a separate temporary file.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 280 to 291
f, err := os.CreateTemp("", "fc-debug-*.gdb")
if err != nil {
return "", err
}
if _, err := f.WriteString(fcDebugMacros); err != nil {
f.Close()

return "", errors.New("fc-debug.gdb macro library not found (next to the binary or under cmd/resume-build/)")
return "", err
}

return f.Name(), f.Close()
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The temporary file created by macroLibPath to hold the embedded fc-debug.gdb macros is never deleted, leading to a resource leak on every run. Since the init script is already a temporary file that is properly cleaned up on exit, consider inlining the embedded macros directly into the init script instead of materializing a separate temporary file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in a8c331c. The macro temp file is gone: macroLibContent() returns the macros (a colocated copy if present, else the embedded copy) and writeInitScript now inlines them into the init script instead of source-ing a separate file. The init script is the only temp file and is already removed on exit (defer os.Remove(initScript) in gdbMode), so nothing leaks.

Comment thread packages/orchestrator/cmd/resume-build/gdb.go Outdated
@kalyazin kalyazin marked this pull request as ready for review June 26, 2026 17:04
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — dev-only tooling change (resume-build/copy-build -gdb), well scoped and clearly described; inline comments cover UX nits.

Extended reasoning...

Overview

This PR touches only the resume-build and copy-build developer side-tools, gated behind the -gdb flag. It makes resume-build -gdb portable to read-only gcsfuse mounts by staging the gdb-enabled Firecracker into a writable temp FirecrackerVersionsDir (instead of the old in-place rename of the prod binary), embeds the fc-debug.gdb macro library via go:embed so the binary is standalone, adds a backstop that verifies the binary actually contains the FIRECRACKER_GDB_SOCKET literal before launch, drops the unused HTTP auto-fetch path (renaming resolveOrFetchresolveLocal), adds copy-build -gdb to bridge FC/kernel runtime+debug artifacts between gs:// buckets with CRC-skip, and rewrites the gdb runbook to match the new resolution model.

Security risks

None material. The change forces noEgress = true whenever -gdb is on (gdb.go run-path), which is a defensive improvement: the runbook’s “always -no-egress” rule now becomes mechanical. No auth/crypto/permissions code paths are touched, no new network endpoints, no new privileged operations. copy-build -gdb only operates between gs:// buckets the operator already has credentials for, and deriveArtifactBuckets constrains the destination to siblings of the template bucket.

Level of scrutiny

Low. The PR description is explicit ("Touches only the resume-build and copy-build developer side-tools — no production service code path"), and a code read confirms: nothing in the orchestrator service, sandbox factory hot path, or shared packages is modified beyond the additive gdbOrigVersionsDir field on the local runner struct. The in-process redirect of config.BuilderConfig.FirecrackerVersionsDir is annotated with an explicit invariant and confined to the gdb branch of resume-build’s run(). End-to-end testing on a dev node with a read-only mount is described, and go build/go test/lint are clean.

Other factors

The author has already addressed the two non-trivial reviewer findings on this PR — gemini’s macro-temp-file leak and cursor’s legacy-flat-layout miss — in commit a8c331c, with explicit fixes (inlined macros into the init script; new archOrLegacyArtifact helper that probes arch-first then legacy). The only outstanding bug-hunter findings on this run are documentation/UX nits (runbook step 1 wording, and the -gdb gs:// validation running after the copy) — minor enough to ship as inline comments rather than block. Codecov shows low patch coverage, but that’s expected for an interactive CLI debug tool and not a meaningful signal here.

Comment on lines +27 to +29
1. **Copy the build chain** to the dev node's local storage (`copy-build`), as for
`resume-prod-snapshot`.
`resume-prod-snapshot`. Add `-gdb` to also stage the debug artifacts
(`firecracker-debug`, `vmlinux.debug`) next to the FC/kernel so they resolve locally.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Step 1 of the gdb-debugging runbook instructs operators to add -gdb to the copy-build invocation that copies to the dev node's local storage, but copy-build -gdb hard-fails with log.Fatal("-gdb requires gs:// -from and -to") when either endpoint is not gs://. An operator following the runbook literally will eat a multi-GB copy before hitting the fatal. Clarify that -gdb only operates between gs:// buckets and is a separate bridging step from the local copy.

Extended reasoning...

The mismatch. The runbook's Steps section (gdb-debugging.md:27-29) reads:

Copy the build chain to the dev node's local storage (copy-build), as for resume-prod-snapshot. Add -gdb to also stage the debug artifacts (firecracker-debug, vmlinux.debug) next to the FC/kernel so they resolve locally.

The natural reading is: append -gdb to the same copy-build that copies to local storage, e.g. copy-build -from gs://… -to .local-build -gdb. But copy-build/main.go:499-501 hard-rejects that combination:

if !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") {
    log.Fatal("-gdb requires gs:// -from and -to")
}

And deriveArtifactBuckets only knows how to derive sibling -fc-versions / -fc-kernels buckets from a *-fc-templates GCS bucket — there is no concept of a local destination for staged debug artifacts.

Why the doc reads this way. "Local storage" has a specific meaning in this codebase: it is the non-gs:// filesystem path branch (see e.g. resume-build/main.go: "Saving snapshot to local storage" prints only when -storage is not gs://, and LOCAL_TEMPLATE_STORAGE_BASE_PATH gates the local provider). And copy-build's main copy step canonically supports local-to-local, local-to-GCS, and GCS-to-GCS. So when step 1 says "to the dev node's local storage", an operator reasonably reads it as a local filesystem path.

Why the Prerequisites blurb doesn't save it. The Prerequisites section mentions "copy-build -gdb stages them when bridging a snapshot". But "bridging" is never defined as "gs://→gs://" anywhere in the doc, and there is no example showing the actual gs://→gs:// invocation. A reader who scans step 1 and types the command will hit the fatal.

Step-by-step proof.

  1. Operator reads step 1: "Copy the build chain to the dev node's local storage (copy-build)… Add -gdb to also stage the debug artifacts."
  2. Operator runs: copy-build -build <id> -from gs://prod-fc-templates -to .local-build -gdb.
  3. The main snapshot copy runs successfully (gigabytes of data over many seconds/minutes).
  4. After the copy, main() reaches the if *gdb { … } block and executes if !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") { log.Fatal("-gdb requires gs:// -from and -to") }. Since -to is .local-build, the fatal fires.
  5. The operator has wasted the time and bandwidth of a multi-GB copy and is left with the build copied locally but no debug artifacts.

Impact. Documentation-only — affects only the resume-build/copy-build developer side-tool runbook, not production code paths. But the runbook is the canonical guide for a newly-introduced workflow, and step 1 as written is unachievable.

Fix. Reword step 1 to make explicit that -gdb is a separate gs://→gs:// bridge step. Something like:

  1. Copy the build chain to the dev node's local storage (copy-build), as for resume-prod-snapshot.

    When bridging a snapshot from another environment, first run copy-build -gdb between the two gs:// buckets (-from gs://… -to gs://…) to stage the debug artifacts next to the source FC/kernel in the destination GCS environment; -gdb requires gs:// for both -from and -to and does not write to local paths. The subsequent local copy-build then picks them up alongside the regular artifacts.

Alternatively, teach copyGdbArtifacts to stage to a local destination too, but the doc fix is the least-risk path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the runbook in 633248e. Step 1 now describes -gdb as a separate gs://gs:// bridge (with an explicit example), distinct from the local snapshot copy — it stages the debug artifacts into the destination environment's -fc-versions/-fc-kernels buckets so the node's mounts serve them, and requires gs:// for both endpoints. (Not teaching -gdb a local destination: on a node the FC/kernel dirs are read-only mounts, so the artifacts must land in the bucket.)

Comment on lines +499 to +501
if *gdb {
if !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") {
log.Fatal("-gdb requires gs:// -from and -to")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The -gdb gs:// validation only fires after the full build copy has already completed (errgroup.Wait at ~line 485), so copy-build -gdb -from <local> -to gs://... (or any other non-gs:// combination) wastes minutes / multi-GB of bandwidth before log.Fatal with -gdb requires gs:// -from and -to. Hoist the strings.HasPrefix check up next to flag.Parse(), alongside the existing -envd-version is required when -team is set validation at line 320, so it fails fast.

Extended reasoning...

What the bug is. copy-build's new -gdb flag has a precondition that both -from and -to must be gs:// locations (because copyGdbArtifacts derives the FC-versions / kernels buckets from each side via deriveArtifactBuckets, which itself errors on non-gs:// inputs). That precondition is checked at line ~500:

if *gdb {
    if !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") {
        log.Fatal("-gdb requires gs:// -from and -to")
    }
    ...
}

But this lives inside the if *teamID != "" || *gdb block at line ~491, which only runs after errgroup.Wait() (line 485) — i.e. after the entire parallel filesToCopy loop (lines ~411-487) has already copied every memfile data ref, rootfs data ref, snapfile, header, and metadata to the destination.

The code path that triggers it. Run copy-build -gdb -from /local/path -to gs://bucket (or any combination where either side is not gs://). The tool will:

  1. Build the file list from memfile/rootfs headers and metadata (lines ~336-403).
  2. Enter the errgroup loop and copy all of them in parallel with up to 20 workers (lines ~411-487) — for a real snapshot, this is the multi-GB part and is the only step that takes meaningful wall-clock.
  3. Print Build '...' copied to '...' (line ~490).
  4. Read metadata.json via readTemplateVersions (line ~494).
  5. Enter if *gdb and finally hit the prefix check at line ~500, calling log.Fatalnow the user learns their flags were wrong.

Why existing code doesn't prevent it. The pattern for early validation already exists in this file: flag.Parse() at line 318 is immediately followed at line 320 by

if *teamID != "" && *envdVersion == "" {
    log.Fatal("-envd-version is required when -team is set")
}

The -gdb flag should have been validated the same way, but instead the check was inlined where the value is used, after the expensive work.

Impact. Wasted developer time and bandwidth — copying a real template's data refs can take many minutes and many GB. No data corruption: the copy itself succeeds; the failure is on an unrelated precondition that could have been caught instantly. Worth noting that copy-build is a developer side-tool (not a production code path), so this is a UX wart, not a correctness bug.

How to fix it. Move the strings.HasPrefix check to right after flag.Parse(), alongside the existing -envd-version validation:

flag.Parse()

if *teamID != "" && *envdVersion == "" {
    log.Fatal("-envd-version is required when -team is set")
}
if *gdb && (!strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://")) {
    log.Fatal("-gdb requires gs:// -from and -to")
}

Then the redundant check inside the if *gdb block at line ~500 can be removed.

Step-by-step proof. Concrete example: copy-build -build <id> -gdb -from /tmp/local-build -to gs://my-bucket-fc-templates. Walking through main():

  1. flag.Parse() (line 318) parses -gdb=true, -from=/tmp/local-build, -to=gs://my-bucket-fc-templates.
  2. Line 320: *teamID == "", so the envd-version check is skipped. No -gdb check exists here.
  3. Lines 336-403: filesToCopy is built from headers — this reads from /tmp/local-build (which works, since the local path exists).
  4. Line ~411-487: the errgroup copies each file from /tmp/local-build to gs://my-bucket-fc-templates via gcloudCopy. This is the slow part: for any real snapshot, several GB of memfile/rootfs data refs are uploaded. Wall-clock: minutes.
  5. Line 485: errgroup.Wait() returns nil; line 490 prints success.
  6. Line 491: *teamID != "" || *gdb → true (because *gdb).
  7. Line 494: readTemplateVersions reads metadata.json from the destination.
  8. Line ~499: if *gdb → true.
  9. Line ~500: !strings.HasPrefix(*from, "gs://") → true (/tmp/local-build doesn't have the prefix).
  10. log.Fatal("-gdb requires gs:// -from and -to")after the full upload.

If the check were hoisted to step 2, the tool would fatal in under a millisecond instead of after a multi-GB upload.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 633248e. Hoisted the gs:// precondition next to flag.Parse() (beside the -envd-version check), so copy-build -gdb with a non-gs:// -from/-to now fails instantly instead of after the snapshot copy; the redundant post-copy check is removed. Verified: -gdb … -to /tmp/local errors immediately, before any copy.

Comment thread packages/orchestrator/cmd/copy-build/main.go
@kalyazin kalyazin force-pushed the kalyazin/gdb-portable-debugging branch 3 times, most recently from 2395403 to 084b2f3 Compare June 30, 2026 13:45

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 084b2f3. Configure here.

Comment thread packages/orchestrator/cmd/copy-build/main.go
kalyazin and others added 5 commits June 30, 2026 15:07
…andalone binaries

Two issues stopped resume-build -gdb from running on a real cluster node:

  - It staged the gdb-enabled Firecracker by renaming the prod binary aside and
    copying the debug binary over it at the resolved version path. On a node
    that path is a read-only gcsfuse mount, so the rename failed with
    "read-only file system". The launch resolves the FC binary from the
    factory's FirecrackerVersionsDir, so before creating the factory in gdb mode
    we redirect that dir to a writable temp dir and stage the gdb binary there;
    the original (read-only) dir is preserved only for resolving the published
    firecracker-debug. The prod binary in the real versions dir is never
    touched, and the backup/restore (with its interrupted-run recovery) is gone.
    The kernel dir is unchanged.

  - It located fc-debug.gdb on disk (next to the binary or under
    cmd/resume-build/), so a binary scp'd to a node without its source failed
    with "macro library not found". Embed it with go:embed and inline it into
    the init script (which is removed on exit), so the binary is self-contained
    and nothing leaks; a colocated copy still wins so macros stay editable
    without a rebuild.

The FirecrackerVersionsDir redirect is correct only because the factory is the
sole, last-created consumer of that dir; document that invariant, and as a
backstop verify the binary about to launch actually contains the gdb feature
(FIRECRACKER_GDB_SOCKET), so a stale/wrong firecracker-debug fails with a clear
error instead of an opaque "gdb socket never bound".

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
resume-build -gdb needs the gdb-enabled Firecracker (firecracker-debug) and the
guest kernel symbols (vmlinux.debug) alongside the snapshot's runtime FC and
kernel. Bridging a snapshot to another environment with copy-build previously
left those to be copied by hand.

With -gdb, copy-build reads the build's kernel/FC versions from metadata.json
(already parsed for -team) and ensures the runtime + debug artifacts exist at
the destination. It CRC32C-compares source and destination (like the snapshot
copy): skip only when the destination already holds identical content, otherwise
copy, replacing a divergent/stale artifact rather than trusting it. Presence is
tracked explicitly (a Destination.exists flag set from the object's Attrs), not
inferred from CRC == 0, so an artifact whose CRC32C is genuinely zero is neither
mistaken for missing nor falsely matched against an absent destination. The
versions/kernels buckets are derived from the template bucket location
(<prefix>-fc-templates -> -fc-versions / -fc-kernels), since the build records
no environment. Required artifacts must exist at the source; the optional
firecracker-debug.debug (FC's own symbols) is skipped if absent. Requires
gs:// -from/-to, validated up front (next to flag.Parse) so a wrong invocation
fails before the snapshot copy rather than after it.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
resume-build -gdb resolved firecracker-debug / vmlinux.debug in three tiers:
-gdb-fc/-gdb-symbols override, then a local copy next to the snapshot's FC /
kernel, then a by-version HTTP fetch from e2b-prod-public-builds (overridable
via E2B_GDB_ARTIFACTS_URL). The fetch tier never fires in practice: the debug
artifacts are published to the per-cluster fc-versions/fc-kernels buckets (a
different path layout), not e2b-prod-public-builds, so the default URL 404s and
no readable bucket matches the firecrackers/<ver> + kernels/<ver> scheme. On a
node the artifacts are in those buckets (gcsfuse-mounted), so the local tier
always wins; copy-build -gdb also stages them locally.

Remove the fetch tier: drop debugArtifactsBaseURL, download(), the URL
construction and the E2B_GDB_ARTIFACTS_URL env, and rename resolveOrFetch to
resolveLocal (override or local copy, else error). Update the gdb-fc/gdb-symbols
flag help and the test accordingly.

Resolve the debug artifacts with an explicit arch-first-then-legacy preference
(archOrLegacyArtifact) instead of deriving the directory from FirecrackerPath /
HostKernelPath, which on un-migrated nodes can resolve the prod binary at the
legacy flat layout and miss the arch-prefixed firecracker-debug / vmlinux.debug
that releases and copy-build -gdb publish.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
The runbook claimed firecracker-debug / vmlinux.debug are fetched automatically
by version from e2b-prod-public-builds in the common case. They are not fetched
over the network; they are resolved locally next to the snapshot's FC / kernel
(where the fc-versions/fc-kernels releases publish them and copy-build -gdb
stages them). Describe local resolution, drop the E2B_GDB_ARTIFACTS_URL
guidance, and clarify that copy-build -gdb is a separate gs://->gs:// bridge step
(staging the debug artifacts into the destination buckets), distinct from the
local snapshot copy.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
fc-faults set a software breakpoint on handle_mm_fault, which writes an
int3 into guest kernel text — a guest-visible memory change the gdb
runbook's "Observer effect" section says to avoid. The FC stub advertises
hwbreak+, and fc-faults only ever sets this single breakpoint, so use a
hardware breakpoint instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
@kalyazin kalyazin force-pushed the kalyazin/gdb-portable-debugging branch from 084b2f3 to ee3db93 Compare June 30, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant