feat(orch): make resume-build -gdb work on real nodes + add copy-build -gdb#3108
feat(orch): make resume-build -gdb work on real nodes + add copy-build -gdb#3108kalyazin wants to merge 5 commits into
Conversation
PR SummaryLow Risk Overview Reviewed by Cursor Bugbot for commit ee3db93. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 11 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
The temporary file created by macroLibPath to hold the embedded fc-debug.gdb macros is never deleted, leading to a resource leak on every run. Since the init script is already a temporary file that is properly cleaned up on exit, consider inlining the embedded macros directly into the init script instead of materializing a separate temporary file.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| f, err := os.CreateTemp("", "fc-debug-*.gdb") | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| if _, err := f.WriteString(fcDebugMacros); err != nil { | ||
| f.Close() | ||
|
|
||
| return "", errors.New("fc-debug.gdb macro library not found (next to the binary or under cmd/resume-build/)") | ||
| return "", err | ||
| } | ||
|
|
||
| return f.Name(), f.Close() | ||
| } |
There was a problem hiding this comment.
The temporary file created by macroLibPath to hold the embedded fc-debug.gdb macros is never deleted, leading to a resource leak on every run. Since the init script is already a temporary file that is properly cleaned up on exit, consider inlining the embedded macros directly into the init script instead of materializing a separate temporary file.
There was a problem hiding this comment.
Good catch — fixed in a8c331c. The macro temp file is gone: macroLibContent() returns the macros (a colocated copy if present, else the embedded copy) and writeInitScript now inlines them into the init script instead of source-ing a separate file. The init script is the only temp file and is already removed on exit (defer os.Remove(initScript) in gdbMode), so nothing leaks.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
LGTM — dev-only tooling change (resume-build/copy-build -gdb), well scoped and clearly described; inline comments cover UX nits.
Extended reasoning...
Overview
This PR touches only the resume-build and copy-build developer side-tools, gated behind the -gdb flag. It makes resume-build -gdb portable to read-only gcsfuse mounts by staging the gdb-enabled Firecracker into a writable temp FirecrackerVersionsDir (instead of the old in-place rename of the prod binary), embeds the fc-debug.gdb macro library via go:embed so the binary is standalone, adds a backstop that verifies the binary actually contains the FIRECRACKER_GDB_SOCKET literal before launch, drops the unused HTTP auto-fetch path (renaming resolveOrFetch → resolveLocal), adds copy-build -gdb to bridge FC/kernel runtime+debug artifacts between gs:// buckets with CRC-skip, and rewrites the gdb runbook to match the new resolution model.
Security risks
None material. The change forces noEgress = true whenever -gdb is on (gdb.go run-path), which is a defensive improvement: the runbook’s “always -no-egress” rule now becomes mechanical. No auth/crypto/permissions code paths are touched, no new network endpoints, no new privileged operations. copy-build -gdb only operates between gs:// buckets the operator already has credentials for, and deriveArtifactBuckets constrains the destination to siblings of the template bucket.
Level of scrutiny
Low. The PR description is explicit ("Touches only the resume-build and copy-build developer side-tools — no production service code path"), and a code read confirms: nothing in the orchestrator service, sandbox factory hot path, or shared packages is modified beyond the additive gdbOrigVersionsDir field on the local runner struct. The in-process redirect of config.BuilderConfig.FirecrackerVersionsDir is annotated with an explicit invariant and confined to the gdb branch of resume-build’s run(). End-to-end testing on a dev node with a read-only mount is described, and go build/go test/lint are clean.
Other factors
The author has already addressed the two non-trivial reviewer findings on this PR — gemini’s macro-temp-file leak and cursor’s legacy-flat-layout miss — in commit a8c331c, with explicit fixes (inlined macros into the init script; new archOrLegacyArtifact helper that probes arch-first then legacy). The only outstanding bug-hunter findings on this run are documentation/UX nits (runbook step 1 wording, and the -gdb gs:// validation running after the copy) — minor enough to ship as inline comments rather than block. Codecov shows low patch coverage, but that’s expected for an interactive CLI debug tool and not a meaningful signal here.
| 1. **Copy the build chain** to the dev node's local storage (`copy-build`), as for | ||
| `resume-prod-snapshot`. | ||
| `resume-prod-snapshot`. Add `-gdb` to also stage the debug artifacts | ||
| (`firecracker-debug`, `vmlinux.debug`) next to the FC/kernel so they resolve locally. |
There was a problem hiding this comment.
🟡 Step 1 of the gdb-debugging runbook instructs operators to add -gdb to the copy-build invocation that copies to the dev node's local storage, but copy-build -gdb hard-fails with log.Fatal("-gdb requires gs:// -from and -to") when either endpoint is not gs://. An operator following the runbook literally will eat a multi-GB copy before hitting the fatal. Clarify that -gdb only operates between gs:// buckets and is a separate bridging step from the local copy.
Extended reasoning...
The mismatch. The runbook's Steps section (gdb-debugging.md:27-29) reads:
Copy the build chain to the dev node's local storage (
copy-build), as forresume-prod-snapshot. Add-gdbto also stage the debug artifacts (firecracker-debug,vmlinux.debug) next to the FC/kernel so they resolve locally.
The natural reading is: append -gdb to the same copy-build that copies to local storage, e.g. copy-build -from gs://… -to .local-build -gdb. But copy-build/main.go:499-501 hard-rejects that combination:
if !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") {
log.Fatal("-gdb requires gs:// -from and -to")
}And deriveArtifactBuckets only knows how to derive sibling -fc-versions / -fc-kernels buckets from a *-fc-templates GCS bucket — there is no concept of a local destination for staged debug artifacts.
Why the doc reads this way. "Local storage" has a specific meaning in this codebase: it is the non-gs:// filesystem path branch (see e.g. resume-build/main.go: "Saving snapshot to local storage" prints only when -storage is not gs://, and LOCAL_TEMPLATE_STORAGE_BASE_PATH gates the local provider). And copy-build's main copy step canonically supports local-to-local, local-to-GCS, and GCS-to-GCS. So when step 1 says "to the dev node's local storage", an operator reasonably reads it as a local filesystem path.
Why the Prerequisites blurb doesn't save it. The Prerequisites section mentions "copy-build -gdb stages them when bridging a snapshot". But "bridging" is never defined as "gs://→gs://" anywhere in the doc, and there is no example showing the actual gs://→gs:// invocation. A reader who scans step 1 and types the command will hit the fatal.
Step-by-step proof.
- Operator reads step 1: "Copy the build chain to the dev node's local storage (
copy-build)… Add-gdbto also stage the debug artifacts." - Operator runs:
copy-build -build <id> -from gs://prod-fc-templates -to .local-build -gdb. - The main snapshot copy runs successfully (gigabytes of data over many seconds/minutes).
- After the copy,
main()reaches theif *gdb { … }block and executesif !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") { log.Fatal("-gdb requires gs:// -from and -to") }. Since-tois.local-build, the fatal fires. - The operator has wasted the time and bandwidth of a multi-GB copy and is left with the build copied locally but no debug artifacts.
Impact. Documentation-only — affects only the resume-build/copy-build developer side-tool runbook, not production code paths. But the runbook is the canonical guide for a newly-introduced workflow, and step 1 as written is unachievable.
Fix. Reword step 1 to make explicit that -gdb is a separate gs://→gs:// bridge step. Something like:
Copy the build chain to the dev node's local storage (
copy-build), as forresume-prod-snapshot.When bridging a snapshot from another environment, first run
copy-build -gdbbetween the twogs://buckets (-from gs://… -to gs://…) to stage the debug artifacts next to the source FC/kernel in the destination GCS environment;-gdbrequiresgs://for both-fromand-toand does not write to local paths. The subsequent localcopy-buildthen picks them up alongside the regular artifacts.
Alternatively, teach copyGdbArtifacts to stage to a local destination too, but the doc fix is the least-risk path.
There was a problem hiding this comment.
Fixed the runbook in 633248e. Step 1 now describes -gdb as a separate gs://→gs:// bridge (with an explicit example), distinct from the local snapshot copy — it stages the debug artifacts into the destination environment's -fc-versions/-fc-kernels buckets so the node's mounts serve them, and requires gs:// for both endpoints. (Not teaching -gdb a local destination: on a node the FC/kernel dirs are read-only mounts, so the artifacts must land in the bucket.)
| if *gdb { | ||
| if !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") { | ||
| log.Fatal("-gdb requires gs:// -from and -to") |
There was a problem hiding this comment.
🟡 The -gdb gs:// validation only fires after the full build copy has already completed (errgroup.Wait at ~line 485), so copy-build -gdb -from <local> -to gs://... (or any other non-gs:// combination) wastes minutes / multi-GB of bandwidth before log.Fatal with -gdb requires gs:// -from and -to. Hoist the strings.HasPrefix check up next to flag.Parse(), alongside the existing -envd-version is required when -team is set validation at line 320, so it fails fast.
Extended reasoning...
What the bug is. copy-build's new -gdb flag has a precondition that both -from and -to must be gs:// locations (because copyGdbArtifacts derives the FC-versions / kernels buckets from each side via deriveArtifactBuckets, which itself errors on non-gs:// inputs). That precondition is checked at line ~500:
if *gdb {
if !strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://") {
log.Fatal("-gdb requires gs:// -from and -to")
}
...
}But this lives inside the if *teamID != "" || *gdb block at line ~491, which only runs after errgroup.Wait() (line 485) — i.e. after the entire parallel filesToCopy loop (lines ~411-487) has already copied every memfile data ref, rootfs data ref, snapfile, header, and metadata to the destination.
The code path that triggers it. Run copy-build -gdb -from /local/path -to gs://bucket (or any combination where either side is not gs://). The tool will:
- Build the file list from memfile/rootfs headers and metadata (lines ~336-403).
- Enter the
errgrouploop and copy all of them in parallel with up to 20 workers (lines ~411-487) — for a real snapshot, this is the multi-GB part and is the only step that takes meaningful wall-clock. - Print
Build '...' copied to '...'(line ~490). - Read
metadata.jsonviareadTemplateVersions(line ~494). - Enter
if *gdband finally hit the prefix check at line ~500, callinglog.Fatal— now the user learns their flags were wrong.
Why existing code doesn't prevent it. The pattern for early validation already exists in this file: flag.Parse() at line 318 is immediately followed at line 320 by
if *teamID != "" && *envdVersion == "" {
log.Fatal("-envd-version is required when -team is set")
}The -gdb flag should have been validated the same way, but instead the check was inlined where the value is used, after the expensive work.
Impact. Wasted developer time and bandwidth — copying a real template's data refs can take many minutes and many GB. No data corruption: the copy itself succeeds; the failure is on an unrelated precondition that could have been caught instantly. Worth noting that copy-build is a developer side-tool (not a production code path), so this is a UX wart, not a correctness bug.
How to fix it. Move the strings.HasPrefix check to right after flag.Parse(), alongside the existing -envd-version validation:
flag.Parse()
if *teamID != "" && *envdVersion == "" {
log.Fatal("-envd-version is required when -team is set")
}
if *gdb && (!strings.HasPrefix(*from, "gs://") || !strings.HasPrefix(*to, "gs://")) {
log.Fatal("-gdb requires gs:// -from and -to")
}Then the redundant check inside the if *gdb block at line ~500 can be removed.
Step-by-step proof. Concrete example: copy-build -build <id> -gdb -from /tmp/local-build -to gs://my-bucket-fc-templates. Walking through main():
flag.Parse()(line 318) parses-gdb=true,-from=/tmp/local-build,-to=gs://my-bucket-fc-templates.- Line 320:
*teamID == "", so the envd-version check is skipped. No-gdbcheck exists here. - Lines 336-403:
filesToCopyis built from headers — this reads from/tmp/local-build(which works, since the local path exists). - Line ~411-487: the errgroup copies each file from
/tmp/local-buildtogs://my-bucket-fc-templatesviagcloudCopy. This is the slow part: for any real snapshot, several GB of memfile/rootfs data refs are uploaded. Wall-clock: minutes. - Line 485:
errgroup.Wait()returns nil; line 490 prints success. - Line 491:
*teamID != "" || *gdb→ true (because*gdb). - Line 494:
readTemplateVersionsreadsmetadata.jsonfrom the destination. - Line ~499:
if *gdb→ true. - Line ~500:
!strings.HasPrefix(*from, "gs://")→ true (/tmp/local-builddoesn't have the prefix). log.Fatal("-gdb requires gs:// -from and -to")— after the full upload.
If the check were hoisted to step 2, the tool would fatal in under a millisecond instead of after a multi-GB upload.
There was a problem hiding this comment.
Fixed in 633248e. Hoisted the gs:// precondition next to flag.Parse() (beside the -envd-version check), so copy-build -gdb with a non-gs:// -from/-to now fails instantly instead of after the snapshot copy; the redundant post-copy check is removed. Verified: -gdb … -to /tmp/local errors immediately, before any copy.
2395403 to
084b2f3
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 084b2f3. Configure here.
…andalone binaries
Two issues stopped resume-build -gdb from running on a real cluster node:
- It staged the gdb-enabled Firecracker by renaming the prod binary aside and
copying the debug binary over it at the resolved version path. On a node
that path is a read-only gcsfuse mount, so the rename failed with
"read-only file system". The launch resolves the FC binary from the
factory's FirecrackerVersionsDir, so before creating the factory in gdb mode
we redirect that dir to a writable temp dir and stage the gdb binary there;
the original (read-only) dir is preserved only for resolving the published
firecracker-debug. The prod binary in the real versions dir is never
touched, and the backup/restore (with its interrupted-run recovery) is gone.
The kernel dir is unchanged.
- It located fc-debug.gdb on disk (next to the binary or under
cmd/resume-build/), so a binary scp'd to a node without its source failed
with "macro library not found". Embed it with go:embed and inline it into
the init script (which is removed on exit), so the binary is self-contained
and nothing leaks; a colocated copy still wins so macros stay editable
without a rebuild.
The FirecrackerVersionsDir redirect is correct only because the factory is the
sole, last-created consumer of that dir; document that invariant, and as a
backstop verify the binary about to launch actually contains the gdb feature
(FIRECRACKER_GDB_SOCKET), so a stale/wrong firecracker-debug fails with a clear
error instead of an opaque "gdb socket never bound".
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
resume-build -gdb needs the gdb-enabled Firecracker (firecracker-debug) and the guest kernel symbols (vmlinux.debug) alongside the snapshot's runtime FC and kernel. Bridging a snapshot to another environment with copy-build previously left those to be copied by hand. With -gdb, copy-build reads the build's kernel/FC versions from metadata.json (already parsed for -team) and ensures the runtime + debug artifacts exist at the destination. It CRC32C-compares source and destination (like the snapshot copy): skip only when the destination already holds identical content, otherwise copy, replacing a divergent/stale artifact rather than trusting it. Presence is tracked explicitly (a Destination.exists flag set from the object's Attrs), not inferred from CRC == 0, so an artifact whose CRC32C is genuinely zero is neither mistaken for missing nor falsely matched against an absent destination. The versions/kernels buckets are derived from the template bucket location (<prefix>-fc-templates -> -fc-versions / -fc-kernels), since the build records no environment. Required artifacts must exist at the source; the optional firecracker-debug.debug (FC's own symbols) is skipped if absent. Requires gs:// -from/-to, validated up front (next to flag.Parse) so a wrong invocation fails before the snapshot copy rather than after it. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
resume-build -gdb resolved firecracker-debug / vmlinux.debug in three tiers: -gdb-fc/-gdb-symbols override, then a local copy next to the snapshot's FC / kernel, then a by-version HTTP fetch from e2b-prod-public-builds (overridable via E2B_GDB_ARTIFACTS_URL). The fetch tier never fires in practice: the debug artifacts are published to the per-cluster fc-versions/fc-kernels buckets (a different path layout), not e2b-prod-public-builds, so the default URL 404s and no readable bucket matches the firecrackers/<ver> + kernels/<ver> scheme. On a node the artifacts are in those buckets (gcsfuse-mounted), so the local tier always wins; copy-build -gdb also stages them locally. Remove the fetch tier: drop debugArtifactsBaseURL, download(), the URL construction and the E2B_GDB_ARTIFACTS_URL env, and rename resolveOrFetch to resolveLocal (override or local copy, else error). Update the gdb-fc/gdb-symbols flag help and the test accordingly. Resolve the debug artifacts with an explicit arch-first-then-legacy preference (archOrLegacyArtifact) instead of deriving the directory from FirecrackerPath / HostKernelPath, which on un-migrated nodes can resolve the prod binary at the legacy flat layout and miss the arch-prefixed firecracker-debug / vmlinux.debug that releases and copy-build -gdb publish. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
The runbook claimed firecracker-debug / vmlinux.debug are fetched automatically by version from e2b-prod-public-builds in the common case. They are not fetched over the network; they are resolved locally next to the snapshot's FC / kernel (where the fc-versions/fc-kernels releases publish them and copy-build -gdb stages them). Describe local resolution, drop the E2B_GDB_ARTIFACTS_URL guidance, and clarify that copy-build -gdb is a separate gs://->gs:// bridge step (staging the debug artifacts into the destination buckets), distinct from the local snapshot copy. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
fc-faults set a software breakpoint on handle_mm_fault, which writes an int3 into guest kernel text — a guest-visible memory change the gdb runbook's "Observer effect" section says to avoid. The FC stub advertises hwbreak+, and fc-faults only ever sets this single breakpoint, so use a hardware breakpoint instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
084b2f3 to
ee3db93
Compare

Why
resume-build -gdb(guest-kernel debugging, added in #3040) didn't actually run on a real orchestrator node: it staged the gdb-enabled Firecracker by renaming the prod binary in place, but on a node the FC-versions directory is a read-only mount (gcsfuse), so it failed withread-only file system. It also located itsfc-debug.gdbmacro library relative to the source tree, so a binary copied to a node without its source failed withmacro library not found. And bridging a snapshot from another environment for debugging meant hand-copying the FC + kernel debug artifacts.This makes the
-gdbpath portable, adds a one-command artifact bridge, and trims a debug-artifact fetch path that never fired in practice.What
fix(orch): make resume-build -gdb portable to read-only mounts and standalone binariesFirecrackerVersionsDirat it before the factory captures the config, instead of overwriting the prod binary in the read-only versions dir. The original dir is preserved only for resolving the publishedfirecracker-debug; the real dir is never written. Removes the old backup/restore (and its interrupted-run recovery).fc-debug.gdbwithgo:embedso the binary is self-contained (a colocated copy still wins for local iteration).FIRECRACKER_GDB_SOCKET), so a stale/wrongfirecracker-debugfails with a clear message instead of an opaque "gdb socket never bound".feat(orch): add copy-build -gdb to stage FC + kernel debug artifactscopy-build -gdbreads the build's FC/kernel versions frommetadata.jsonand ensures the runtime + debug artifacts (firecracker,firecracker-debug,vmlinux.bin,vmlinux.debug) exist at the destination, deriving the versions/kernels buckets from the template bucket. CRC-compares like the snapshot copy: skips identical content, otherwise copies — replacing a divergent/stale artifact. Requiresgs://source/dest.refactor(orch): drop the unused gdb debug-artifact URL auto-fetchcopy-build -gdb. Removes the fetch code/env-var and renamesresolveOrFetch→resolveLocal.docs(orch): correct the gdb runbook artifact resolutioncopy-build -gdb; or explicit-gdb-fc/-gdb-symbols).Testing
copy-build -gdbexercised both the skip-identical and the copy/replace paths.go build,go test,golangci-lintclean; each commit builds on its own.Scope / risk
Touches only the
resume-buildandcopy-builddeveloper side-tools — no production service code path. The read-only-mount override is gated on-gdband confined to that run (only the FC-versions dir; the kernel dir is untouched).🤖 Generated with Claude Code