diff --git a/docs/api_requests/block-write-zeroes.md b/docs/api_requests/block-write-zeroes.md index 202b9cb122e..0aeee0cb2ec 100644 --- a/docs/api_requests/block-write-zeroes.md +++ b/docs/api_requests/block-write-zeroes.md @@ -9,8 +9,8 @@ and journals), filesystem snapshots, encrypted-volume initial wipe, and ## How it works For all non-read-only block devices, Firecracker automatically advertises the -`VIRTIO_BLK_F_WRITE_ZEROES` feature to the guest driver. No API configuration -is required — write-zeroes support is always-on for writable drives. +`VIRTIO_BLK_F_WRITE_ZEROES` feature to the guest driver. No API configuration is +required — write-zeroes support is always-on for writable drives. Each `VIRTIO_BLK_T_WRITE_ZEROES` request carries a 16-byte segment with a `flags` field. Bit 0 (`VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP`) tells the device @@ -20,15 +20,15 @@ advertises `write_zeroes_may_unmap=1`, so guests are free to set this flag. Firecracker translates the guest's UNMAP bit into a `fallocate(2)` mode on the backing file: -| UNMAP | fallocate mode | Effect | -|-------|---------------------------------------------|---------------------------------------| -| 0 | `FALLOC_FL_ZERO_RANGE \| FALLOC_FL_KEEP_SIZE` | zeros in place, no deallocation | -| 1 | `FALLOC_FL_PUNCH_HOLE \| FALLOC_FL_KEEP_SIZE` | zeros + deallocate (sparse holes) | +| UNMAP | fallocate mode | Effect | +| ----- | --------------------------------------------- | --------------------------------- | +| 0 | `FALLOC_FL_ZERO_RANGE \| FALLOC_FL_KEEP_SIZE` | zeros in place, no deallocation | +| 1 | `FALLOC_FL_PUNCH_HOLE \| FALLOC_FL_KEEP_SIZE` | zeros + deallocate (sparse holes) | -The virtio spec requires that when UNMAP is clear the device MUST NOT -deallocate sectors (so `ZERO_RANGE` is mandatory for that path); when UNMAP -is set, the device MAY deallocate, and `PUNCH_HOLE` reads as zeros on every -filesystem that supports it. +The virtio spec requires that when UNMAP is clear the device MUST NOT deallocate +sectors (so `ZERO_RANGE` is mandatory for that path); when UNMAP is set, the +device MAY deallocate, and `PUNCH_HOLE` reads as zeros on every filesystem that +supports it. ## Host requirements @@ -36,8 +36,8 @@ The backing file must reside on a filesystem that supports the corresponding `fallocate` mode: - `FALLOC_FL_PUNCH_HOLE` (UNMAP=1) is widely supported: ext4, xfs, btrfs, tmpfs. -- `FALLOC_FL_ZERO_RANGE` (UNMAP=0) is supported on ext4, xfs, btrfs; on tmpfs - it requires Linux 6.8+. Other filesystems may not support it. +- `FALLOC_FL_ZERO_RANGE` (UNMAP=0) is supported on ext4, xfs, btrfs; on tmpfs it + requires Linux 6.8+. Other filesystems may not support it. If `fallocate` returns `EOPNOTSUPP` for either mode, Firecracker logs a one-time warning and replies with `VIRTIO_BLK_S_UNSUPP`. The Linux virtio-blk driver @@ -48,14 +48,14 @@ requests with `VIRTIO_BLK_S_UNSUPP` for the rest of the device's lifetime — no additional `fallocate` calls are made. The EOPNOTSUPP cache is shared across UNMAP=0 and UNMAP=1 paths: a single -fallback flag disables both. This is conservative — a filesystem that -supports `PUNCH_HOLE` but not `ZERO_RANGE` will see UNMAP=1 requests rejected -once an UNMAP=0 request fails — but it matches the discard fallback design -and avoids subtle host-side state. +fallback flag disables both. This is conservative — a filesystem that supports +`PUNCH_HOLE` but not `ZERO_RANGE` will see UNMAP=1 requests rejected once an +UNMAP=0 request fails — but it matches the discard fallback design and avoids +subtle host-side state. ## Limitations - Write-zeroes is only available for non-read-only block devices. - At most one segment per request is supported (`max_write_zeroes_seg = 1`). -- Only bit 0 (UNMAP) of the segment flags is allowed; non-zero reserved bits - are rejected with an I/O error. +- Only bit 0 (UNMAP) of the segment flags is allowed; non-zero reserved bits are + rejected with an I/O error. diff --git a/docs/ballooning.md b/docs/ballooning.md index 6f7cf5daf69..d70a3898290 100644 --- a/docs/ballooning.md +++ b/docs/ballooning.md @@ -468,22 +468,21 @@ your scenario. #### `VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK` Whenever `free_page_hinting` is enabled, Firecracker also advertises -`VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK` (bit 6). When negotiated, the guest -driver waits for the device to signal-used each hint buffer before -pushing the corresponding page onto its internal free list — closing -the data-loss race described in the warning above without any host-side -protocol change. - -The bit only takes effect on guests whose kernel carries the supporting -patch (Jack Thomson's `virtio_balloon: Support wait on ACK for hinting`, -not yet upstream as of this writing). On unsupported guests the driver -self-clears the bit during `validate`, so the advertise is ignored and -hinting falls back to the unsynchronised behaviour. There is no separate -configuration knob — opting into `free_page_hinting` is sufficient. - -Note that the per-buffer round trip introduces extra wait time per hint -cycle on supported guests; the safety/perf trade-off is intentional and -documented at the kernel-patch level. +`VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK` (bit 6). When negotiated, the guest driver +waits for the device to signal-used each hint buffer before pushing the +corresponding page onto its internal free list — closing the data-loss race +described in the warning above without any host-side protocol change. + +The bit only takes effect on guests whose kernel carries the supporting patch +(Jack Thomson's `virtio_balloon: Support wait on ACK for hinting`, not yet +upstream as of this writing). On unsupported guests the driver self-clears the +bit during `validate`, so the advertise is ignored and hinting falls back to the +unsynchronised behaviour. There is no separate configuration knob — opting into +`free_page_hinting` is sufficient. + +Note that the per-buffer round trip introduces extra wait time per hint cycle on +supported guests; the safety/perf trade-off is intentional and documented at the +kernel-patch level. ## Balloon Caveats diff --git a/src/firecracker/src/api_server/request/memory_info.rs b/src/firecracker/src/api_server/request/memory_info.rs index 2d8e55a420e..2be58bea137 100644 --- a/src/firecracker/src/api_server/request/memory_info.rs +++ b/src/firecracker/src/api_server/request/memory_info.rs @@ -1,3 +1,6 @@ +// Copyright 2026 Amazon.com, Inc. or its affiliates. All Rights Reserved. +// SPDX-License-Identifier: Apache-2.0 + use micro_http::Method; use vmm::rpc_interface::VmmAction; diff --git a/src/firecracker/src/api_server/request/snapshot.rs b/src/firecracker/src/api_server/request/snapshot.rs index 1fc9d2a6b06..ce38020a657 100644 --- a/src/firecracker/src/api_server/request/snapshot.rs +++ b/src/firecracker/src/api_server/request/snapshot.rs @@ -119,6 +119,8 @@ fn parse_put_snapshot_load(body: &Body) -> Result { resume_vm: snapshot_config.resume_vm, network_overrides: snapshot_config.network_overrides, clock_realtime: snapshot_config.clock_realtime, + #[cfg(feature = "gdb")] + gdb_socket_path: snapshot_config.gdb_socket_path, }; // Construct the `ParsedRequest` object. @@ -198,6 +200,8 @@ mod tests { resume_vm: false, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }; let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap(); assert!( @@ -230,6 +234,8 @@ mod tests { resume_vm: false, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }; let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap(); assert!( @@ -262,6 +268,8 @@ mod tests { resume_vm: true, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }; let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap(); assert!( @@ -303,6 +311,8 @@ mod tests { host_dev_name: String::from("vmtap2"), }], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }; let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap(); assert!( @@ -332,6 +342,8 @@ mod tests { resume_vm: true, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }; let parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap(); assert_eq!( @@ -435,9 +447,16 @@ mod tests { resume_vm: false, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }; let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap(); - assert!(parsed_request.parsing_info().take_deprecation_message().is_none()); + assert!( + parsed_request + .parsing_info() + .take_deprecation_message() + .is_none() + ); assert_eq!( vmm_action_from_request(parsed_request), VmmAction::LoadSnapshot(expected_config) diff --git a/src/firecracker/swagger/firecracker.yaml b/src/firecracker/swagger/firecracker.yaml index 6fab5b3e94d..10fbaf37301 100644 --- a/src/firecracker/swagger/firecracker.yaml +++ b/src/firecracker/swagger/firecracker.yaml @@ -1758,6 +1758,13 @@ definitions: elapsed since the snapshot was taken. When false (default), kvmclock resumes from where it was at snapshot time. This option may be extended to other clock sources and CPU architectures in the future." + gdb_socket_path: + type: string + description: + "Only available when Firecracker is built with the `gdb` feature. When set, + start the GDB server on this unix socket for the restored guest, for + source-level debugging of the guest kernel. Debug builds only; not for + production." TokenBucket: diff --git a/src/snapshot-editor/src/info.rs b/src/snapshot-editor/src/info.rs index 97d06a3e5f9..e10cc6cee1c 100644 --- a/src/snapshot-editor/src/info.rs +++ b/src/snapshot-editor/src/info.rs @@ -66,6 +66,13 @@ fn info_vcpu_states(snapshot: &Snapshot) -> Result<(), InfoVmState for (i, state) in snapshot.data.vcpu_states.iter().enumerate() { println!("vcpu {i}:"); println!("{state:#?}"); + // The derived Debug of `saved_msrs` only shows the kvm_msrs headers, not + // the entries (a FAM array). Print index/data so tooling can read MSR + // values (e.g. LSTAR, to recover the KASLR slide from a snapshot). + #[cfg(target_arch = "x86_64")] + for entry in state.saved_msrs.iter().flat_map(|m| m.as_slice()) { + println!(" msr index={:#x} data={:#x}", entry.index, entry.data); + } } Ok(()) } diff --git a/src/vmm/src/arch/aarch64/gic/mod.rs b/src/vmm/src/arch/aarch64/gic/mod.rs index 0fe0aa899b3..3b6da7e4e2c 100644 --- a/src/vmm/src/arch/aarch64/gic/mod.rs +++ b/src/vmm/src/arch/aarch64/gic/mod.rs @@ -7,9 +7,9 @@ mod regs; use gicv2::GICv2; use gicv3::GICv3; +pub use gicv3::regs::its_regs::ItsRegisterState; use kvm_ioctls::{DeviceFd, VmFd}; pub use regs::{GicRegState, GicState, GicVcpuState, VgicSysRegsState}; -pub use gicv3::regs::its_regs::ItsRegisterState; use super::layout; diff --git a/src/vmm/src/builder.rs b/src/vmm/src/builder.rs index 15be948861d..d6ed48089b5 100644 --- a/src/vmm/src/builder.rs +++ b/src/vmm/src/builder.rs @@ -133,6 +133,18 @@ impl std::convert::From for StartMicrovmError { } } +/// Resolves the GDB unix socket path. An explicit `machine-config.gdb_socket_path` +/// takes precedence; otherwise fall back to the `FIRECRACKER_GDB_SOCKET` environment +/// variable. The env fallback lets tooling that launches Firecracker (e.g. the e2b +/// orchestrator / resume-build, which inherit the environment) enable GDB without +/// setting machine-config. +#[cfg(feature = "gdb")] +fn resolve_gdb_socket_path(configured: &Option) -> Option { + configured + .clone() + .or_else(|| std::env::var("FIRECRACKER_GDB_SOCKET").ok()) +} + /// Builds and starts a microVM based on the current Firecracker VmResources configuration. /// /// The built microVM and all the created vCPUs start off in the paused state. @@ -343,9 +355,16 @@ pub fn build_microvm_for_boot( .map_err(VmmError::VcpuStart)?; #[cfg(feature = "gdb")] - if let Some(gdb_socket_path) = &vm_resources.machine_config.gdb_socket_path { - gdb::gdb_thread(vmm.clone(), gdb_rx, entry_point.entry_addr, gdb_socket_path) - .map_err(StartMicrovmError::GdbServer)?; + if let Some(gdb_socket_path) = + resolve_gdb_socket_path(&vm_resources.machine_config.gdb_socket_path) + { + gdb::gdb_thread( + vmm.clone(), + gdb_rx, + entry_point.entry_addr, + &gdb_socket_path, + ) + .map_err(StartMicrovmError::GdbServer)?; } else { debug!("No GDB socket provided not starting gdb server."); } @@ -528,6 +547,31 @@ pub fn build_microvm_from_snapshot( page_size: vm_resources.machine_config.huge_pages.page_size(), }; + // GDB debug support for restored microVMs (x86_64 only). Mirror the boot + // path: attach the debug-event channel to every restored vCPU before they + // start, then start the GDB server thread once the vCPUs are running. The + // server arms a hardware breakpoint at the restored instruction pointer so + // GDB takes control at the resume point on the first continue. + // + // Only wire the channel up when a GDB socket is actually configured: with no + // socket, no server thread drains the receiver, so a vCPU debug event would + // `send` on a dropped receiver and panic. Gating the attach keeps the channel + // paired with its consumer (and leaves the vCPUs' gdb_event as None otherwise). + #[cfg(all(feature = "gdb", target_arch = "x86_64"))] + let gdb_socket_path = + resolve_gdb_socket_path(&vm_resources.machine_config.gdb_socket_path); + + #[cfg(all(feature = "gdb", target_arch = "x86_64"))] + let gdb_rx = if gdb_socket_path.is_some() { + let (gdb_tx, gdb_rx) = mpsc::channel(); + vcpus + .iter_mut() + .for_each(|vcpu| vcpu.attach_debug_info(gdb_tx.clone())); + Some(gdb_rx) + } else { + None + }; + // Move vcpus to their own threads and start their state machine in the 'Paused' state. vmm.start_vcpus( vcpus, @@ -540,6 +584,17 @@ pub fn build_microvm_from_snapshot( let vmm = Arc::new(Mutex::new(vmm)); event_manager.add_subscriber(vmm.clone()); + #[cfg(all(feature = "gdb", target_arch = "x86_64"))] + if let Some(gdb_socket_path) = gdb_socket_path { + // On restore the vCPUs resume at their saved RIP; arm the entry + // breakpoint there so GDB stops at the resume point. + let entry_addr = GuestAddress(microvm_state.vcpu_states[0].regs.rip); + gdb::gdb_thread(vmm.clone(), gdb_rx.unwrap(), entry_addr, &gdb_socket_path) + .map_err(StartMicrovmError::GdbServer)?; + } else { + debug!("No GDB socket provided not starting gdb server."); + } + // Load seccomp filters for the VMM thread. // Keep this as the last step of the building process. crate::seccomp::apply_filter( diff --git a/src/vmm/src/devices/virtio/balloon/device.rs b/src/vmm/src/devices/virtio/balloon/device.rs index db6da845919..ce182d22bd8 100644 --- a/src/vmm/src/devices/virtio/balloon/device.rs +++ b/src/vmm/src/devices/virtio/balloon/device.rs @@ -21,12 +21,12 @@ use super::{ MIB_TO_4K_PAGES, STATS_INDEX, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_FREE_PAGE_HINTING, VIRTIO_BALLOON_F_FREE_PAGE_REPORTING, VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_PFN_SHIFT, - VIRTIO_BALLOON_S_ALLOC_STALL, - VIRTIO_BALLOON_S_ASYNC_RECLAIM, VIRTIO_BALLOON_S_ASYNC_SCAN, VIRTIO_BALLOON_S_AVAIL, - VIRTIO_BALLOON_S_CACHES, VIRTIO_BALLOON_S_DIRECT_RECLAIM, VIRTIO_BALLOON_S_DIRECT_SCAN, - VIRTIO_BALLOON_S_HTLB_PGALLOC, VIRTIO_BALLOON_S_HTLB_PGFAIL, VIRTIO_BALLOON_S_MAJFLT, - VIRTIO_BALLOON_S_MEMFREE, VIRTIO_BALLOON_S_MEMTOT, VIRTIO_BALLOON_S_MINFLT, - VIRTIO_BALLOON_S_OOM_KILL, VIRTIO_BALLOON_S_SWAP_IN, VIRTIO_BALLOON_S_SWAP_OUT, + VIRTIO_BALLOON_S_ALLOC_STALL, VIRTIO_BALLOON_S_ASYNC_RECLAIM, VIRTIO_BALLOON_S_ASYNC_SCAN, + VIRTIO_BALLOON_S_AVAIL, VIRTIO_BALLOON_S_CACHES, VIRTIO_BALLOON_S_DIRECT_RECLAIM, + VIRTIO_BALLOON_S_DIRECT_SCAN, VIRTIO_BALLOON_S_HTLB_PGALLOC, VIRTIO_BALLOON_S_HTLB_PGFAIL, + VIRTIO_BALLOON_S_MAJFLT, VIRTIO_BALLOON_S_MEMFREE, VIRTIO_BALLOON_S_MEMTOT, + VIRTIO_BALLOON_S_MINFLT, VIRTIO_BALLOON_S_OOM_KILL, VIRTIO_BALLOON_S_SWAP_IN, + VIRTIO_BALLOON_S_SWAP_OUT, }; use crate::devices::virtio::balloon::BalloonError; use crate::devices::virtio::device::ActiveState; diff --git a/src/vmm/src/devices/virtio/block/virtio/device.rs b/src/vmm/src/devices/virtio/block/virtio/device.rs index e8b3c1bd428..258caa75217 100644 --- a/src/vmm/src/devices/virtio/block/virtio/device.rs +++ b/src/vmm/src/devices/virtio/block/virtio/device.rs @@ -197,7 +197,8 @@ pub struct ConfigSpace { pub max_write_zeroes_seg: u32, // offset 52 pub write_zeroes_may_unmap: u8, // offset 56 pub(crate) _unused1: [u8; 3], // offset 57 (spec field — virtio_blk_config.unused1) - pub(crate) _pad: [u8; 4], // offset 60 (Rust alignment padding to 64; spec ends at 60) + pub(crate) _pad: [u8; 4], /* offset 60 (Rust alignment padding to 64; spec ends + * at 60) */ } const _: () = assert!(std::mem::size_of::() == 64); // Compile-time guards against accidental layout drift. The byte offsets here diff --git a/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs b/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs index 28d5fe14026..faf555b9526 100644 --- a/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs +++ b/src/vmm/src/devices/virtio/block/virtio/io/sync_io.rs @@ -101,12 +101,7 @@ impl SyncFileEngine { } } - pub fn write_zeroes( - &mut self, - offset: u64, - len: u64, - unmap: bool, - ) -> Result<(), SyncIoError> { + pub fn write_zeroes(&mut self, offset: u64, len: u64, unmap: bool) -> Result<(), SyncIoError> { // UNMAP=1 reuses PUNCH_HOLE (the spec lets the device deallocate); // UNMAP=0 must zero in place without deallocating, so use ZERO_RANGE. let mode = if unmap { diff --git a/src/vmm/src/gdb/event_loop.rs b/src/vmm/src/gdb/event_loop.rs index 13d1f438a47..0f72d9d4eca 100644 --- a/src/vmm/src/gdb/event_loop.rs +++ b/src/vmm/src/gdb/event_loop.rs @@ -24,7 +24,7 @@ pub fn event_loop( gdb_event_receiver: Receiver, entry_addr: GuestAddress, ) { - let target = FirecrackerTarget::new(vmm, gdb_event_receiver, entry_addr); + let mut target = FirecrackerTarget::new(vmm, gdb_event_receiver, entry_addr); let connection: Box> = { Box::new(connection) }; let debugger = GdbStub::new(connection); @@ -34,6 +34,13 @@ pub fn event_loop( .recv() .expect("Error getting initial gdb event"); + // All-stop: the initial breakpoint only stops the triggering vCPU; halt the + // others too so the whole VM is stopped when gdb attaches (this initial stop is + // consumed here rather than in `wait_for_stop_reason`). + target + .pause_all_vcpus() + .expect("Error pausing vcpus on initial stop"); + gdb_event_loop_thread(debugger, target); } @@ -85,6 +92,13 @@ impl run_blocking::BlockingEventLoop for GdbBlockingEventLoop { continue; }; + // All-stop: halt the still-running sibling vCPUs so GDB sees a + // fully-stopped VM. Without this, querying a running vCPU (e.g. + // `info threads`) blocks indefinitely. + target + .pause_all_vcpus() + .map_err(WaitForStopReasonError::Target)?; + trace!("Returned stop reason to gdb: {stop_response:?}"); return Ok(run_blocking::Event::TargetStopped(stop_response)); } @@ -112,7 +126,9 @@ impl run_blocking::BlockingEventLoop for GdbBlockingEventLoop { // notify the target that a ctrl-c interrupt has occurred. let main_core = vcpuid_to_tid(0)?; - target.pause_vcpu(main_core)?; + // All-stop: pause every vCPU, not just the main one, so the whole VM is + // halted while GDB inspects it. + target.pause_all_vcpus()?; target.set_paused_vcpu(main_core); let exit_reason = MultiThreadStopReason::SignalWithThread { diff --git a/src/vmm/src/gdb/target.rs b/src/vmm/src/gdb/target.rs index af3df20d25a..81c26c287e3 100644 --- a/src/vmm/src/gdb/target.rs +++ b/src/vmm/src/gdb/target.rs @@ -219,6 +219,15 @@ impl FirecrackerTarget { /// Resumes execution of all paused Vcpus, update them with current kvm debug info /// and resumes fn resume_all_vcpus(&mut self) -> Result<(), GdbTargetError> { + // Every vcpu is paused at this point (all-stop), so it is blocked in its + // emulation loop and cannot emit a debug event. Any event still queued is + // therefore stale: a sibling that also hit the breakpoint before we stopped + // the VM. Drain these now — if left queued, the next `wait_for_stop_reason` + // would process a stale event against a vcpu that has since resumed, marking + // a running vcpu as paused and desyncing the pause/resume handshake until the + // vcpu threads exit (surfacing as a fatal GdbQueueError). + while self.gdb_event.try_recv().is_ok() {} + for idx in 0..self.vcpu_state.len() { self.update_vcpu_kvm_debug(idx, &self.hw_breakpoints)?; } @@ -233,6 +242,23 @@ impl FirecrackerTarget { Ok(()) } + /// Pauses every vcpu that is still running so the whole VM is stopped while GDB + /// is in control (all-stop semantics, as QEMU does via `pause_all_vcpus`). The + /// vcpu that triggered the stop is already paused; this halts its still-running + /// siblings so they can be enumerated and inspected (`info threads`, `thread N`, + /// per-vCPU backtraces). `send_event` kicks a running or halted vcpu out of + /// `KVM_RUN`, so this completes even for idle siblings. + pub fn pause_all_vcpus(&mut self) -> Result<(), GdbTargetError> { + for cpu_id in 0..self.vcpu_state.len() { + if !self.vcpu_state[cpu_id].paused { + let tid = vcpuid_to_tid(cpu_id)?; + self.pause_vcpu(tid)?; + } + } + + Ok(()) + } + /// Resets all Vcpus to their base state fn reset_all_vcpu_states(&mut self) { for value in self.vcpu_state.iter_mut() { @@ -344,12 +370,12 @@ impl Target for FirecrackerTarget { type Arch = GdbArch; #[inline(always)] - fn base_ops(&mut self) -> BaseOps { + fn base_ops(&mut self) -> BaseOps<'_, Self::Arch, Self::Error> { BaseOps::MultiThread(self) } #[inline(always)] - fn support_breakpoints(&mut self) -> Option> { + fn support_breakpoints(&mut self) -> Option> { Some(self) } @@ -471,7 +497,7 @@ impl MultiThreadBase for FirecrackerTarget { } #[inline(always)] - fn support_resume(&mut self) -> Option> { + fn support_resume(&mut self) -> Option> { Some(self) } @@ -526,12 +552,12 @@ impl MultiThreadSingleStep for FirecrackerTarget { impl Breakpoints for FirecrackerTarget { #[inline(always)] - fn support_hw_breakpoint(&mut self) -> Option> { + fn support_hw_breakpoint(&mut self) -> Option> { Some(self) } #[inline(always)] - fn support_sw_breakpoint(&mut self) -> Option> { + fn support_sw_breakpoint(&mut self) -> Option> { Some(self) } } diff --git a/src/vmm/src/persist/mod.rs b/src/vmm/src/persist/mod.rs index 0f0c47d1780..92a8d0229fe 100644 --- a/src/vmm/src/persist/mod.rs +++ b/src/vmm/src/persist/mod.rs @@ -394,8 +394,11 @@ pub fn restore_from_snapshot( cpu_template: Some(microvm_state.vm_info.cpu_template), track_dirty_pages: Some(track_dirty_pages), huge_pages: Some(microvm_state.vm_info.huge_pages), + // GDB socket is a restore-time override carried on the load request, + // applied here so the restore-path gdb server (which reads + // machine_config.gdb_socket_path) starts. #[cfg(feature = "gdb")] - gdb_socket_path: None, + gdb_socket_path: params.gdb_socket_path.clone(), }) .map_err(BuildMicrovmFromSnapshotError::VmUpdateConfig)?; @@ -702,8 +705,7 @@ mod tests { use crate::vmm_config::balloon::BalloonDeviceConfig; use crate::vmm_config::net::NetworkInterfaceConfig; use crate::vmm_config::vsock::tests::default_config; - use crate::vstate::memory::create_memfd; - use crate::vstate::memory::{GuestMemoryRegionState, GuestRegionType}; + use crate::vstate::memory::{GuestMemoryRegionState, GuestRegionType, create_memfd}; fn default_vmm_with_devices() -> Vmm { let mut event_manager = EventManager::new().expect("Cannot create EventManager"); diff --git a/src/vmm/src/persist/v1_10/aarch64.rs b/src/vmm/src/persist/v1_10/aarch64.rs index c85896a0b32..09425f1b530 100644 --- a/src/vmm/src/persist/v1_10/aarch64.rs +++ b/src/vmm/src/persist/v1_10/aarch64.rs @@ -3,12 +3,8 @@ use serde::{Deserialize, Serialize}; -use crate::cpu_config::templates::KvmCapability; use super::MMIODeviceInfo; - -// Types that are identical across all versions — canonical definitions in v1_14. -pub use crate::persist::v1_14::DeviceType; - +use crate::cpu_config::templates::KvmCapability; // Types that are identical in v1.10 and v1.12 — canonical definitions in v1_12. pub use crate::persist::v1_12::{ // aarch64 GicState is identical in v1.10 and v1.12 (gains its_state in v1.14) @@ -16,6 +12,8 @@ pub use crate::persist::v1_12::{ // aarch64 VcpuState is identical in v1.10 and v1.12 (gains pvtime_ipa in v1.14) VcpuState, }; +// Types that are identical across all versions — canonical definitions in v1_14. +pub use crate::persist::v1_14::DeviceType; // ─────────────────────────────────────────────────────────────────── // aarch64 legacy device info (v1.10 layout: uses v1.10 MMIODeviceInfo with irqs: Vec) diff --git a/src/vmm/src/persist/v1_10/mod.rs b/src/vmm/src/persist/v1_10/mod.rs index f95ce37bdca..1c9e3c51204 100644 --- a/src/vmm/src/persist/v1_10/mod.rs +++ b/src/vmm/src/persist/v1_10/mod.rs @@ -28,12 +28,6 @@ pub(crate) mod aarch64; #[cfg(target_arch = "aarch64")] pub use aarch64::*; -// ─────────────────────────────────────────────────────────────────── -// Types identical to v1.12 — imported from that module (canonical source) -// ─────────────────────────────────────────────────────────────────── - -use crate::persist::VmInfo; - pub use super::v1_12::{ // ACPI device manager state (used in MicrovmState defined below) ACPIDeviceManagerState, @@ -48,6 +42,10 @@ pub use super::v1_12::{ NetState, VsockState, }; +// ─────────────────────────────────────────────────────────────────── +// Types identical to v1.12 — imported from that module (canonical source) +// ─────────────────────────────────────────────────────────────────── +use crate::persist::VmInfo; // ─────────────────────────────────────────────────────────────────── // MMIO device info (v1.10 uses `irqs: Vec`, changed to `irq: Option` in v1.11) diff --git a/src/vmm/src/persist/v1_12/aarch64.rs b/src/vmm/src/persist/v1_12/aarch64.rs index f57079ea6ae..3d6f1d589cb 100644 --- a/src/vmm/src/persist/v1_12/aarch64.rs +++ b/src/vmm/src/persist/v1_12/aarch64.rs @@ -5,16 +5,15 @@ use kvm_bindings::{kvm_mp_state, kvm_vcpu_init}; use serde::{Deserialize, Serialize}; use super::{GuestMemoryState, MMIODeviceInfo}; - // Types that are canonical in v1_14 and unchanged through all versions pub use crate::persist::v1_14::{ + // Register vector with custom serde + Aarch64RegisterVec, // Legacy device type enum DeviceType, // GIC helper types (GicState itself changed — its_state added — so redefined in v1_14) GicRegState, GicVcpuState, - // Register vector with custom serde - Aarch64RegisterVec, }; // ─────────────────────────────────────────────────────────────────── diff --git a/src/vmm/src/persist/v1_12/x86_64.rs b/src/vmm/src/persist/v1_12/x86_64.rs index 912bb10b7ab..c8a1eaa1bb9 100644 --- a/src/vmm/src/persist/v1_12/x86_64.rs +++ b/src/vmm/src/persist/v1_12/x86_64.rs @@ -4,9 +4,9 @@ use kvm_bindings::{kvm_clock_data, kvm_irqchip, kvm_pit_state2}; use serde::{Deserialize, Serialize}; -use crate::{arch::VcpuState, persist::v1_14::x86_64::xsave_from_v1_10}; - use super::{GuestMemoryState, v1_10}; +use crate::arch::VcpuState; +use crate::persist::v1_14::x86_64::xsave_from_v1_10; // ─────────────────────────────────────────────────────────────────── // Changed in v1.12: memory moved into VmState; kvm_cap_modifiers → KvmState diff --git a/src/vmm/src/persist/v1_14/aarch64.rs b/src/vmm/src/persist/v1_14/aarch64.rs index 8ac93332eb9..690034248b1 100644 --- a/src/vmm/src/persist/v1_14/aarch64.rs +++ b/src/vmm/src/persist/v1_14/aarch64.rs @@ -3,20 +3,20 @@ use serde::{Deserialize, Serialize}; -use super::{ACPIDeviceManagerState, ConvertError, GuestMemoryState, MMIODeviceInfo, - ResourceAllocator, irq_to_gsi}; -use crate::devices::acpi::vmgenid::VMGenIDState; -use crate::persist::v1_12; - +use super::{ + ACPIDeviceManagerState, ConvertError, GuestMemoryState, MMIODeviceInfo, ResourceAllocator, + irq_to_gsi, +}; // ─────────────────────────────────────────────────────────────────── // Re-export runtime types — v1.14 snapshot format matches the runtime format. // These are used by v1.12 (and v1.10 via v1.12) as canonical type definitions. // ─────────────────────────────────────────────────────────────────── - pub use crate::arch::aarch64::gic::{GicRegState, GicState, GicVcpuState}; pub use crate::arch::aarch64::regs::Aarch64RegisterVec; pub use crate::arch::aarch64::vcpu::VcpuState; pub use crate::arch::aarch64::vm::VmState; +use crate::devices::acpi::vmgenid::VMGenIDState; +use crate::persist::v1_12; // ─────────────────────────────────────────────────────────────────── // StaticCpuTemplate — aarch64-specific snapshot enum (same in v1.10, v1.12, v1.14) diff --git a/src/vmm/src/persist/v1_14/mod.rs b/src/vmm/src/persist/v1_14/mod.rs index eb780bbe6f5..980c26903b6 100644 --- a/src/vmm/src/persist/v1_14/mod.rs +++ b/src/vmm/src/persist/v1_14/mod.rs @@ -18,8 +18,8 @@ //! - aarch64 `VcpuState`: gains `pvtime_ipa` //! - `GuestMemoryRegionState`: gains `region_type` and `plugged` //! - `ACPIDeviceManagerState`: vmgenid now mandatory, adds vmclock (x86_64) -//! - New types: `ConnectedDeviceState`, `DevicesState`, `ResourceAllocator`, -//! `PmemState`, `VirtioMemState`, `MmdsState`, `GuestRegionType`, etc. +//! - New types: `ConnectedDeviceState`, `DevicesState`, `ResourceAllocator`, `PmemState`, +//! `VirtioMemState`, `MmdsState`, `GuestRegionType`, etc. use vm_allocator::{AddressAllocator, AllocPolicy, IdAllocator}; @@ -31,21 +31,21 @@ pub(crate) mod aarch64; #[cfg(target_arch = "aarch64")] pub use aarch64::*; +#[cfg(target_arch = "x86_64")] +use crate::arch::VmState; use crate::arch::{ FIRST_ADDR_PAST_64BITS_MMIO, GSI_LEGACY_END, GSI_LEGACY_START, GSI_MSI_END, GSI_MSI_START, MEM_32BIT_DEVICES_SIZE, MEM_32BIT_DEVICES_START, MEM_64BIT_DEVICES_SIZE, MEM_64BIT_DEVICES_START, PAST_64BITS_MMIO_SIZE, SYSTEM_MEM_SIZE, SYSTEM_MEM_START, }; -#[cfg(target_arch = "x86_64")] -use crate::arch::VmState; use crate::device_manager::DevicesState; use crate::device_manager::mmio::MMIODeviceInfo; use crate::device_manager::pci_mngr::PciDevicesState; +#[cfg(target_arch = "aarch64")] +use crate::device_manager::persist::ConnectedLegacyState; use crate::device_manager::persist::{ ACPIDeviceManagerState, DeviceStates, MmdsState, VirtioDeviceState as ConnectedDeviceState, }; -#[cfg(target_arch = "aarch64")] -use crate::device_manager::persist::ConnectedLegacyState; use crate::devices::acpi::vmgenid::VMGENID_MEM_SIZE; use crate::devices::virtio::balloon::device::HintingState; use crate::devices::virtio::balloon::persist::{BalloonState, BalloonStatsState}; @@ -89,16 +89,17 @@ pub(crate) fn irq_to_gsi(irq: u32) -> u32 { impl VirtioDeviceState { /// Convert v1.12 VirtioDeviceState → v1.14 VirtioDeviceState. /// - /// With v1.14, the `interrupt_status` moves from [`VirtioDeviceState`] to [`MmioTransportState`]. - /// That's why we don't use `From` here, so we can return - /// `interrupt_status` separately. + /// With v1.14, the `interrupt_status` moves from [`VirtioDeviceState`] to + /// [`MmioTransportState`]. That's why we don't use `From` here, + /// so we can return `interrupt_status` separately. pub(crate) fn from(old_state: v1_12::VirtioDeviceState) -> (Self, u32) { let interrupt_status = old_state.interrupt_status; let new_state = VirtioDeviceState { device_type: old_state.device_type, avail_features: old_state.avail_features, acked_features: old_state.acked_features, - queues: old_state.queues, // QueueState is the same type (re-exported v1_10 → v1_12 → v1_14) + queues: old_state.queues, /* QueueState is the same type (re-exported v1_10 → v1_12 → + * v1_14) */ activated: old_state.activated, }; (new_state, interrupt_status) diff --git a/src/vmm/src/persist/v1_14/x86_64.rs b/src/vmm/src/persist/v1_14/x86_64.rs index d772c78016e..60956a9bce2 100644 --- a/src/vmm/src/persist/v1_14/x86_64.rs +++ b/src/vmm/src/persist/v1_14/x86_64.rs @@ -1,22 +1,17 @@ // Copyright 2025 Amazon.com, Inc. or its affiliates. All Rights Reserved. // SPDX-License-Identifier: Apache-2.0 +pub use kvm_bindings::Xsave; use kvm_bindings::kvm_xsave; use vm_allocator::AllocPolicy; -use super::v1_12; +use super::{ACPIDeviceManagerState, GuestMemoryState, ResourceAllocator, v1_12}; +use crate::arch::VmState; use crate::devices::acpi::generated::vmclock_abi::{ VMCLOCK_COUNTER_INVALID, VMCLOCK_MAGIC, VMCLOCK_STATUS_UNKNOWN, vmclock_abi, }; -use crate::{ - arch::VmState, - devices::acpi::vmclock::{VMCLOCK_SIZE, VmClockState}, - persist::v1_14::ConvertError, -}; - -use super::{ACPIDeviceManagerState, GuestMemoryState, ResourceAllocator}; - -pub use kvm_bindings::Xsave; +use crate::devices::acpi::vmclock::{VMCLOCK_SIZE, VmClockState}; +use crate::persist::v1_14::ConvertError; // ─────────────────────────────────────────────────────────────────── // ACPI device state impl (x86_64: allocates vmclock) diff --git a/src/vmm/src/rpc_interface.rs b/src/vmm/src/rpc_interface.rs index a3fe8f7421f..dd7edd30f8b 100644 --- a/src/vmm/src/rpc_interface.rs +++ b/src/vmm/src/rpc_interface.rs @@ -1443,6 +1443,8 @@ mod tests { resume_vm: false, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }, ))); check_unsupported(runtime_request(VmmAction::SetEntropyDevice( diff --git a/src/vmm/src/utils/mod.rs b/src/vmm/src/utils/mod.rs index 4179be93fec..78049915798 100644 --- a/src/vmm/src/utils/mod.rs +++ b/src/vmm/src/utils/mod.rs @@ -5,12 +5,12 @@ pub mod byte_order; /// Module with network related helpers pub mod net; +/// Module with pagemap utilities +pub mod pagemap; /// Module with external libc functions pub mod signal; /// Module with state machine pub mod sm; -/// Module with pagemap utilities -pub mod pagemap; use std::fs::{File, OpenOptions}; use std::num::Wrapping; diff --git a/src/vmm/src/utils/pagemap.rs b/src/vmm/src/utils/pagemap.rs index fff9e1f5cb2..b9dcac14c03 100644 --- a/src/vmm/src/utils/pagemap.rs +++ b/src/vmm/src/utils/pagemap.rs @@ -1,3 +1,6 @@ +// Copyright 2026 Amazon.com, Inc. or its affiliates. All Rights Reserved. +// SPDX-License-Identifier: Apache-2.0 + //! Utilities for reading /proc/self/pagemap to track dirty pages. #![allow(clippy::cast_possible_wrap)] diff --git a/src/vmm/src/vmm_config/machine_config.rs b/src/vmm/src/vmm_config/machine_config.rs index a87e61eff9c..e337a5a9dcd 100644 --- a/src/vmm/src/vmm_config/machine_config.rs +++ b/src/vmm/src/vmm_config/machine_config.rs @@ -335,5 +335,4 @@ mod tests { assert!(deserialized.cpu_template.is_none()); } - } diff --git a/src/vmm/src/vmm_config/meminfo.rs b/src/vmm/src/vmm_config/meminfo.rs index 693ece6b4d4..7db58fc1344 100644 --- a/src/vmm/src/vmm_config/meminfo.rs +++ b/src/vmm/src/vmm_config/meminfo.rs @@ -1,3 +1,6 @@ +// Copyright 2026 Amazon.com, Inc. or its affiliates. All Rights Reserved. +// SPDX-License-Identifier: Apache-2.0 + use serde::Serialize; use crate::persist::GuestRegionUffdMapping; diff --git a/src/vmm/src/vmm_config/snapshot.rs b/src/vmm/src/vmm_config/snapshot.rs index 4f73c8f90af..0a53b943a35 100644 --- a/src/vmm/src/vmm_config/snapshot.rs +++ b/src/vmm/src/vmm_config/snapshot.rs @@ -76,6 +76,10 @@ pub struct LoadSnapshotParams { /// advancing kvmclock by the wall-clock time elapsed since the snapshot was taken. When false /// (default), kvmclock resumes from where it was at snapshot time. pub clock_realtime: bool, + /// [gdb] When set, start the GDB server on this unix socket for the restored + /// guest. A restore-time override (not configured via machine-config). + #[cfg(feature = "gdb")] + pub gdb_socket_path: Option, } /// Stores the configuration for loading a snapshot that is provided by the user. @@ -108,6 +112,10 @@ pub struct LoadSnapshotConfig { /// [x86_64 only] When set to true, passes `KVM_CLOCK_REALTIME` to `KVM_SET_CLOCK` on restore. #[serde(default)] pub clock_realtime: bool, + /// [gdb] Unix socket path for the GDB server (debug builds only). + #[cfg(feature = "gdb")] + #[serde(default)] + pub gdb_socket_path: Option, } /// Stores the configuration used for managing snapshot memory. diff --git a/src/vmm/tests/integration_tests.rs b/src/vmm/tests/integration_tests.rs index c114ebbf411..333b5e05957 100644 --- a/src/vmm/tests/integration_tests.rs +++ b/src/vmm/tests/integration_tests.rs @@ -304,6 +304,8 @@ fn verify_load_snapshot(snapshot_file: TempFile, memory_file: TempFile) { resume_vm: true, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, })) .unwrap(); @@ -390,6 +392,8 @@ fn verify_load_snap_disallowed_after_boot_resources(res: VmmAction, res_name: &s resume_vm: false, network_overrides: vec![], clock_realtime: false, + #[cfg(feature = "gdb")] + gdb_socket_path: None, }); let err = preboot_api_controller.handle_preboot_request(req); assert!( diff --git a/tests/framework/microvm.py b/tests/framework/microvm.py index 7ba32305187..bfcea8faa1f 100644 --- a/tests/framework/microvm.py +++ b/tests/framework/microvm.py @@ -1079,6 +1079,7 @@ def restore_from_snapshot( clock_realtime: bool = False, *, uffd_handler_name: str = None, + gdb_socket_path: str = None, ): """Restore a snapshot""" @@ -1136,6 +1137,12 @@ def restore_from_snapshot( if clock_realtime: optional_kwargs["clock_realtime"] = clock_realtime + # Restore-time GDB: start the gdb server on this socket. The guest is then + # held at the entry breakpoint, so the usual post-resume SSH check is skipped. + if gdb_socket_path is not None: + optional_kwargs["gdb_socket_path"] = gdb_socket_path + self.gdb_socket = gdb_socket_path + self.api.snapshot_load.put( mem_backend=mem_backend, snapshot_path=str(jailed_vmstate), @@ -1144,7 +1151,7 @@ def restore_from_snapshot( **optional_kwargs, ) # This is not a "wait for boot", but rather a "VM still works after restoration" - if jailed_snapshot.net_ifaces and resume: + if jailed_snapshot.net_ifaces and resume and gdb_socket_path is None: self.wait_for_ssh_up() return jailed_snapshot diff --git a/tests/integration_tests/functional/test_balloon_wait_on_ack.py b/tests/integration_tests/functional/test_balloon_wait_on_ack.py index 77cb3ebdf4e..ee95e377c68 100644 --- a/tests/integration_tests/functional/test_balloon_wait_on_ack.py +++ b/tests/integration_tests/functional/test_balloon_wait_on_ack.py @@ -66,9 +66,9 @@ def test_fph_wait_on_ack_negotiated(uvm_plain_6_1): features = _read_balloon_features(vm) # Format: LSB-first '0'/'1' string. - assert features[VIRTIO_BALLOON_F_FREE_PAGE_HINT] == "1", ( - f"FREE_PAGE_HINT (bit 3) not negotiated; features={features!r}" - ) + assert ( + features[VIRTIO_BALLOON_F_FREE_PAGE_HINT] == "1" + ), f"FREE_PAGE_HINT (bit 3) not negotiated; features={features!r}" assert features[VIRTIO_BALLOON_F_HINT_WAIT_ON_ACK] == "1", ( f"HINT_WAIT_ON_ACK (bit 6) not negotiated; features={features!r}. " "The guest kernel likely lacks the wait-on-ACK patch — did you " diff --git a/tests/integration_tests/functional/test_drive_virtio.py b/tests/integration_tests/functional/test_drive_virtio.py index 25d8f2cd421..daa00c712a1 100644 --- a/tests/integration_tests/functional/test_drive_virtio.py +++ b/tests/integration_tests/functional/test_drive_virtio.py @@ -419,7 +419,9 @@ def test_discard(uvm_plain_any, microvm_factory, io_engine): # Disk is still mounted in the restored guest; write+trim again. _fill_and_trim(vm.ssh) st = os.stat(fs.path) - assert st.st_blocks * 512 < st.st_size, "backing file has no holes after trim post-restore" + assert ( + st.st_blocks * 512 < st.st_size + ), "backing file has no holes after trim post-restore" metrics = vm.flush_metrics() assert metrics["block"]["discard_count"] > 0 @@ -451,12 +453,10 @@ def _exercise_write_zeroes(ssh): """Write random data, issue blkdiscard -z, verify zeros on /dev/vdb.""" # Sysfs check: the kernel populates write_zeroes_max_bytes from the # negotiated feature; a non-zero value proves the feature is advertised. - _, stdout, _ = ssh.check_output( - "cat /sys/block/vdb/queue/write_zeroes_max_bytes" - ) - assert int(stdout.strip()) > 0, ( - f"Expected non-zero write_zeroes_max_bytes, got: {stdout.strip()}" - ) + _, stdout, _ = ssh.check_output("cat /sys/block/vdb/queue/write_zeroes_max_bytes") + assert ( + int(stdout.strip()) > 0 + ), f"Expected non-zero write_zeroes_max_bytes, got: {stdout.strip()}" # Write random non-zero data so we can tell zeroing apart from # "the device was already zero". ssh.check_output("dd if=/dev/urandom of=/dev/vdb bs=1M count=1 conv=fsync") @@ -525,8 +525,6 @@ def test_write_zeroes_not_advertised_for_read_only(uvm_plain_any, io_engine): _, stdout, _ = vm.ssh.check_output( "cat /sys/block/vdb/queue/write_zeroes_max_bytes" ) - assert stdout.strip() == "0", ( - f"Expected write_zeroes_max_bytes=0 for read-only device, got: {stdout.strip()}" - ) - - + assert ( + stdout.strip() == "0" + ), f"Expected write_zeroes_max_bytes=0 for read-only device, got: {stdout.strip()}" diff --git a/tests/integration_tests/functional/test_gdb_restore.py b/tests/integration_tests/functional/test_gdb_restore.py new file mode 100644 index 00000000000..253d608060d --- /dev/null +++ b/tests/integration_tests/functional/test_gdb_restore.py @@ -0,0 +1,317 @@ +# Copyright 2026 Amazon.com, Inc. or its affiliates. All Rights Reserved. +# SPDX-License-Identifier: Apache-2.0 +"""GDB debugging of a microVM *restored from a snapshot* (the e2b resume path). + +Upstream Firecracker only wires GDB into the fresh-boot path; this exercises the +restore-path wiring added to `build_microvm_from_snapshot`. It boots a multi-vCPU +VM on the production kernel built with DWARF (KASLR *on*, as in prod), snapshots +it, restores into a new VM (file/UFFD-backed, 4K/2M hugetlb), recovers the KASLR +image slide *from the snapshot itself*, attaches GDB, and checks that we can set a +breakpoint and print kernel structures/memory across multiple vCPUs. + +KASLR slide recovery: the kernel image is slid by a single offset, so +`slide = MSR_LSTAR - &entry_SYSCALL_64`, where `MSR_LSTAR` is read from the +snapshot's saved vcpu MSRs (via `snapshot-editor info-vmstate vcpu-states`) and +`&entry_SYSCALL_64` is the link-time address from the vmlinux symbols. Applied with +`add-symbol-file -o `. This mirrors how resume-build recovers the +slide in prod. +""" + +import base64 +import platform +import re +import subprocess +import tempfile +import time +from pathlib import Path + +import pytest + +import host_tools.cargo_build +from framework.microvm import HugePagesConfig, MicroVMFactory + +# Production kernel (6.1.158) built with DWARF, KASLR on — same config as prod, +# only debug info added. Placed here by the test setup. +KERNEL = Path(__file__).parents[3] / "build/img/x86_64/vmlinux-6.1.158-dwarf" + +GDB_TIMEOUT = 40 + + +def _recover_slide(snapshot_editor, vmstate_path, vmlinux): + """Recover the KASLR image slide from the snapshot. Uses MSR_LSTAR (the syscall + entry, i.e. entry_SYSCALL_64 — kernel text, slid with the image) minus the + link-time address of entry_SYSCALL_64. (IDTR/GDTR are mapped in the fixed + cpu_entry_area, not slid with the image, so they can't be used.)""" + out = subprocess.check_output( + [ + str(snapshot_editor), + "info-vmstate", + "vcpu-states", + "--vmstate-path", + str(vmstate_path), + ], + text=True, + ) + m = re.search(r"msr index=0xc0000082 data=0x([0-9a-fA-F]+)", out) # MSR_LSTAR + assert m, f"MSR_LSTAR not found in vmstate dump:\n{out[-2000:]}" + lstar = int(m.group(1), 16) + + link = subprocess.check_output( + f"readelf -sW {vmlinux} | awk '$NF==\"entry_SYSCALL_64\"{{print $2; exit}}'", + shell=True, + text=True, + ).strip() + assert link, "entry_SYSCALL_64 symbol not found in vmlinux" + return lstar - int(link, 16) + + +def _spawn_gdb(gdb_socket, out_path, commands): + """Drive gdb in batch mode against FC's gdbstub, writing all output to + `out_path`. No symbol file on the command line — symbols are loaded in-script + with the recovered slide. Polls for the socket (created inside FC's restore + path, which then blocks for the connection).""" + with tempfile.NamedTemporaryFile( + mode="w", suffix=".gdb", delete=False, prefix="fc_gdb_restore_" + ) as f: + f.write(commands) + gdb_script = f.name + + return subprocess.Popen( + f""" + until [ -S {gdb_socket} ]; do sleep 0.2; done; + exec gdb -q -batch -x {gdb_script} > {out_path} 2>&1 + """, + shell=True, + ) + + +def _prelude(slide, gdb_socket): + """gdb commands to load slid symbols and connect.""" + return f""" + set pagination off + set confirm off + add-symbol-file {KERNEL} -o {slide} + target remote {gdb_socket} + """ + + +# Hugetlbfs guest memory is anonymous MAP_HUGETLB, which the File restore backend +# can't mmap — so the 2M case uses UFFD (also the production backing). +@pytest.mark.parametrize( + "use_uffd,huge_pages", + [ + (False, HugePagesConfig.NONE), + (True, HugePagesConfig.NONE), + (True, HugePagesConfig.HUGETLBFS_2MB), + ], + ids=["file-4k", "uffd-4k", "uffd-2M"], +) +@pytest.mark.skipif( + platform.machine() != "x86_64", reason="restore-path GDB wiring is x86_64-only" +) +def test_gdb_restore(use_uffd, huge_pages, rootfs): + """Restore a snapshot under GDB and debug the (KASLR-on) guest kernel.""" + bin_dir = host_tools.cargo_build.build_gdb() + if use_uffd: + host_tools.cargo_build.cargo( + "build", + f"--example uffd_on_demand_handler --features gdb " + f"--target {host_tools.cargo_build.DEFAULT_TARGET}", + env={"CARGO_TARGET_DIR": str(bin_dir.parents[1])}, + ) + vmfcty = MicroVMFactory(bin_dir) + + base = vmfcty.build(KERNEL, rootfs) + base.memory_monitor = None + base.spawn() + base.basic_config(vcpu_count=2, mem_size_mib=512, huge_pages=huge_pages) + base.add_net_iface() + base.start() + base.wait_for_ssh_up() + snapshot = base.snapshot_full() + slide = _recover_slide(bin_dir / "snapshot-editor", snapshot.vmstate, KERNEL) + base.kill() + + uvm = vmfcty.build() + uvm.memory_monitor = None + uvm.spawn(validate_api=False) + gdb_socket = Path(uvm.jailer.chroot_path(), "gdb.socket") + gdb_out = Path(uvm.path) / "gdb_out.txt" + + gdb_commands = ( + _prelude(slide, gdb_socket) + + """ + echo \\n=== STRUCT ===\\n + print sizeof(struct task_struct) + print init_task.pid + print init_task.comm + echo \\n=== MEMORY ===\\n + x/2xg &init_task + echo \\n=== THREADS ===\\n + info threads + echo \\n=== THREAD2-BT ===\\n + thread 2 + bt + echo \\n=== BREAKPOINT ===\\n + thread 1 + break do_idle + continue + bt + echo \\n=== DONE ===\\n + kill + """ + ) + gdb_proc = _spawn_gdb(gdb_socket, gdb_out, gdb_commands) + + uffd_handler_name = "on_demand" if use_uffd else None + uvm.restore_from_snapshot( + snapshot, + resume=True, + uffd_handler_name=uffd_handler_name, + gdb_socket_path="gdb.socket", + ) + + timed_out = False + try: + gdb_proc.wait(timeout=GDB_TIMEOUT) + except subprocess.TimeoutExpired: + timed_out = True + gdb_proc.kill() + + out = gdb_out.read_text() if gdb_out.exists() else "(no gdb output captured)" + diag = f"\nslide={slide:#x} timed_out={timed_out}\n--- gdb output ---\n{out}" + + assert not timed_out, f"gdb did not finish in {GDB_TIMEOUT}s:{diag}" + assert "=== DONE ===" in out, f"gdb script did not run to completion:{diag}" + assert "swapper" in out, f"init_task.comm (swapper) not read:{diag}" + assert "$1 = " in out, f"sizeof(struct task_struct) not resolved:{diag}" + assert ( + "Breakpoint 1, " in out and "do_idle" in out + ), f"breakpoint on do_idle not hit:{diag}" + assert out.count("Vcpu ID:") >= 2, f"both vCPUs not enumerated by gdb:{diag}" + assert ( + "#0 " in out.split("=== THREAD2-BT ===", 1)[-1] + ), f"per-vCPU backtrace of vCPU 1 not resolved:{diag}" + + uvm.kill() + + +# A guest workload that continuously page-faults: repeatedly mmap an anonymous +# region and write every page, attributed to comm "python3". Throttled so it +# faults steadily without starving sshd. +_FAULTER_PY = b"""import mmap, time +ms = [] +while True: + m = mmap.mmap(-1, 4 * 1024 * 1024) + m.write(b"x" * (4 * 1024 * 1024)) + ms.append(m) + if len(ms) > 4: + ms.pop(0) + time.sleep(0.05) +""" + + +@pytest.mark.parametrize( + "huge_pages", + [HugePagesConfig.NONE, HugePagesConfig.HUGETLBFS_2MB], + ids=["4k", "2M"], +) +@pytest.mark.skipif( + platform.machine() != "x86_64", reason="restore-path GDB wiring is x86_64-only" +) +def test_gdb_restore_fault_attribution(huge_pages, rootfs): + """Useful application: attribute guest page faults during restore to the + responsible process and VMA — invisible to host/UFFD telemetry. Breaks + handle_mm_fault on the restored (KASLR-on) VM and reads, per fault, the + faulting process (vma->vm_mm->owner) + VMA + address from the SysV args.""" + bin_dir = host_tools.cargo_build.build_gdb() + host_tools.cargo_build.cargo( + "build", + f"--example uffd_on_demand_handler --features gdb " + f"--target {host_tools.cargo_build.DEFAULT_TARGET}", + env={"CARGO_TARGET_DIR": str(bin_dir.parents[1])}, + ) + vmfcty = MicroVMFactory(bin_dir) + + # Two vCPUs on purpose: both hammer handle_mm_fault, so the gdb event loop has to + # coalesce concurrent breakpoint hits and drain the stale debug events of the + # force-paused siblings on each resume. This is the regression test for that drain + # — without it the pause/resume handshake desyncs under the fault storm and the + # connection drops. + base = vmfcty.build(KERNEL, rootfs) + base.memory_monitor = None + base.spawn() + base.basic_config(vcpu_count=2, mem_size_mib=512, huge_pages=huge_pages) + base.add_net_iface() + base.start() + base.wait_for_ssh_up() + + b64 = base64.b64encode(_FAULTER_PY).decode() + base.ssh.check_output(f"echo {b64} | base64 -d > /tmp/faulter.py") + base.ssh.check_output("nohup python3 /tmp/faulter.py >/dev/null 2>&1 vm_mm + if $mm != 0 + set $task = $mm->owner + if $task != 0 + printf "FAULT comm=%s pid=%d addr=0x%lx vma=0x%lx-0x%lx flags=0x%lx\\n", $task->comm, $task->pid, $rsi, $vma->vm_start, $vma->vm_end, $vma->vm_flags + end + end + set $i = $i + 1 + end + echo \\n=== DONE ===\\n + kill + """ + ) + gdb_proc = _spawn_gdb(gdb_socket, gdb_out, gdb_commands) + uvm.restore_from_snapshot( + snapshot, + resume=True, + uffd_handler_name="on_demand", + gdb_socket_path="gdb.socket", + ) + + timed_out = False + try: + gdb_proc.wait(timeout=120) + except subprocess.TimeoutExpired: + timed_out = True + gdb_proc.kill() + + out = gdb_out.read_text() if gdb_out.exists() else "(no gdb output captured)" + diag = f"\nslide={slide:#x} timed_out={timed_out}\n--- gdb output ---\n{out}" + + assert not timed_out, f"gdb did not finish in 120s:{diag}" + assert "=== DONE ===" in out, f"gdb script did not run to completion:{diag}" + + faults = [ln for ln in out.splitlines() if ln.startswith("FAULT comm=")] + print("\nGuest faults attributed during restore (sample):") + print("\n".join(faults[:8])) + assert len(faults) >= 10, f"too few faults captured ({len(faults)}):{diag}" + assert any( + "comm=python3" in ln for ln in faults + ), f"workload process not attributed:{diag}" + vmas = re.findall(r"vma=0x([0-9a-f]+)-0x([0-9a-f]+)", out) + assert vmas and all( + int(s, 16) < int(e, 16) for s, e in vmas + ), f"no valid VMA ranges captured:{diag}" + + uvm.kill()