feat(gdb): debug guest kernel of a restored microVM#19
Conversation
memory_info.rs, pagemap.rs and meminfo.rs (added by the guest-memory introspection API work) ship without the SPDX/Apache-2.0 header the license style check (integration_tests/style/test_licenses.py) requires. Prepend the standard two-line Amazon/Apache-2.0 header. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
test_balloon_wait_on_ack.py and test_drive_virtio.py are not black-formatted, so the python style check fails. Reformat them with `black --config tests/pyproject.toml`; no test-logic change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
docs/api_requests/block-write-zeroes.md and docs/ballooning.md are not mdformat-clean, failing the markdown style check. Reformat with mdformat; no content change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Several fork files predating this branch are not rustfmt-clean under tests/fmt.toml, failing the rust style check. Run `cargo fmt`; a mechanical reformat only, no logic change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Under `--features gdb`, clippy's mismatched_lifetime_syntaxes fires on five FirecrackerTarget trait methods that take `&mut self` and return a gdbstub `*Ops` type whose lifetime is elided in the path: the borrow is visible on the receiver but hidden in the return type. Spell the lifetime as `<'_, ...>` so the two syntaxes match. No behavior change; makes `cargo clippy --features gdb` clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
PR SummaryMedium Risk Overview All-stop multi-vCPU behavior: snapshot-editor prints per-MSR index/data from vcpu state (KASLR slide from Minor: copyright headers, doc/markdown wrapping, import/format churn in persist modules and tests. Reviewed by Cursor Bugbot for commit 9eeafcd. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 47dfd92. Configure here.
Upstream wires gdb only into the boot path; restored microVMs never started the gdb server. Accept a gdb_socket_path restore-time override on the load-snapshot request (alongside network_overrides and clock_realtime) and wire attach_debug_info + gdb_thread into build_microvm_from_snapshot (x86_64), arming the entry breakpoint at the restored vCPU RIP so gdb takes control at the resume point. Carrying the socket on LoadSnapshotParams keeps it a pure restore-time knob: no machine-config update is needed before the load (which would forbid the snapshot load), and there is no boot-time value to preserve across restore. persist sets the restored machine config's gdb_socket_path from the load param, which the snapshot builder reads. Also add resolve_gdb_socket_path() with a FIRECRACKER_GDB_SOCKET env fallback, so launchers that cannot set the load request can still enable gdb. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
When one vCPU stopped at a debug event the others kept running, so querying a running vCPU (info threads, per-vCPU backtraces) blocked indefinitely. Pause the sibling vCPUs on every stop (initial entry stop, breakpoint stops, and Ctrl-C), like QEMU's all-stop, reusing the existing per-vCPU pause (which kicks a running or halted vCPU out of KVM_RUN). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
When more than one vcpu hits a breakpoint while the VM runs, each sends
a debug event and parks itself in the paused emulation state. The gdb
event loop reports the first and force-pauses the rest, but their
already-queued debug events are never consumed. On the next resume those
stale events remain, so a following `wait_for_stop_reason` dequeues one
and processes it against a vcpu that has since resumed: it marks a
running vcpu as paused, desyncing the pause/resume handshake until the
vcpu threads exit and the event channel disconnects — surfacing as a
fatal `GdbQueueError` ("Remote connection closed" on the client) under a
sustained multi-vcpu breakpoint storm.
Drain the debug-event queue at the start of `resume_all_vcpus`. Every
vcpu is paused there, so none can emit an event and anything queued is
provably stale; dropping it is safe and keeps `vcpu_state` in sync with
the vcpus.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
The derived Debug of a vcpu's saved_msrs shows only the kvm_msrs headers, not the entries (a FAM array). Print each saved MSR's index and data so tooling can read values from a snapshot — e.g. MSR_LSTAR (entry_SYSCALL_64), used to recover the KASLR image slide of a restored guest for source-level debugging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Two tests on the production kernel built with DWARF (KASLR on), passing the gdb socket as a load-snapshot restore-time override. Both recover the KASLR image slide from the snapshot (MSR_LSTAR via snapshot-editor vs the link-time entry_SYSCALL_64) and attach gdb to the restored guest: - test_gdb_restore: multi-vCPU, restore file- and UFFD-backed (4K and 2M hugetlb), hit a breakpoint, print kernel structures/memory, and enumerate both vCPUs (info threads) with a per-vCPU backtrace. - test_gdb_restore_fault_attribution: attribute guest page faults to process+VMA (comm/pid/addr/VMA) by breaking handle_mm_fault on the restored multi-vCPU VM under a sustained fault storm. Doubles as a regression test for the stale debug-event drain on resume (two vCPUs hammering the breakpoint). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
47dfd92 to
9eeafcd
Compare

Why
Upstream wires gdb only into the boot path, so a snapshot-resumed microVM can't be
debugged — and the stub isn't all-stop, so multi-vCPU inspection hangs. We need to gdb
the guest kernel of a resumed snapshot (investigate resume behavior / slow
envd-init) with full source-level symbols, KASLR on.
What (all gdb code behind
--features gdb; default/prod build untouched)gdb_socket_pathrestore-time override on theload-snapshot request;
build_microvm_from_snapshotwiresattach_debug_info+gdb_threadat the restored RIP. Env fallbackFIRECRACKER_GDB_SOCKET.info threads/per-vCPU backtraces work.
the gdb connection (
GdbQueueError) under a breakpoint storm.snapshot (
MSR_LSTAR).recover the slide, hit a breakpoint, read kernel structs, enumerate vCPUs, and
attribute guest page faults to process+VMA.