Skip to content

Commit f313f81

Browse files
authored
support firecracker as hypervisor backend (#16)
* feat: add --fc persistent root flag for Firecracker backend selection Add --fc flag to select Firecracker as hypervisor backend. Validates mutual exclusion with --windows and rejects cloudimg (UEFI boot) since Firecracker only supports direct kernel boot. InitHypervisor dispatches based on config; FC returns stub error until the backend is implemented. * feat: add hypervisor/firecracker package skeleton Create Firecracker backend package with Config (path helpers), main Firecracker struct (constructor, Inspect, List, Watchable), and helper utilities (toVM, path functions). Wire up InitHypervisor to create FC backend when --fc is set. Lifecycle methods are stubs pending implementation. * feat: implement Firecracker Create + Start lifecycle Add FC REST API client (pre-boot config model), Create (COW disk + device-path cmdline), and Start (launch process → REST API config sequence → InstanceStart). FC references disks by /dev/vdX path since it lacks virtio serial support. Update overlay.sh init script to resolve both device paths and serial names. * feat: implement Firecracker Stop and Delete Add Stop (SendCtrlAltDel → SIGTERM → SIGKILL) and Delete (stop-if-running → cleanup dirs → remove DB record) for the Firecracker backend. Follows the same patterns as CH. * feat: implement Firecracker Console with PTY + Unix socket relay FC binds serial to process stdin/stdout. Create PTY pair at launch: slave → FC stdin/stdout, master → background relay process. The relay (self-exec with env var detection) listens on console.sock and bridges connections to the PTY master. Auto-exits when FC dies. Console() connects to console.sock, consistent with CH backend. * feat: implement Firecracker Snapshot, Clone, and Restore Add full snapshot lifecycle for FC backend: - Snapshot: pause → PUT /snapshot/create (vmstate+mem) → reflink COW → resume - Clone: extract → launch new FC → PUT /snapshot/load → reconfigure drives/NICs → resume - Restore: kill running → extract → new FC → snapshot/load → reconfigure → resume - Direct: hardlink mem, reflink COW, copy vmstate for local snapshots FC snapshot/load does not preserve drive/NIC config, so drives and networks are re-attached after load. Implements hypervisor.Direct interface for reflink-optimized local snapshot operations. * feat: add Firecracker detection and install to doctor/check.sh Add FC_VERSION variable (v1.12.0), firecracker binary detection in check_binary, and auto-install from GitHub releases in --upgrade mode. * docs: update README with Firecracker backend documentation Add --fc flag to global flags, Firecracker section with feature comparison matrix, limitations, OCI image compatibility notes. Update requirements, doctor, VM lifecycle, and shutdown behavior sections to reflect dual-backend support. * fix: FC launch issues found during e2e testing - Pre-create FC log file (FC requires O_WRONLY|O_APPEND, no O_CREATE) - Use underscores in drive/iface IDs (FC rejects hyphens) - Add vmlinux extraction from vmlinuz (FC needs uncompressed ELF kernel) - Support zstd and gzip compressed kernels via CLI decompressor - Fix FC download URL in doctor/check.sh (tarball format) * fix: address code review findings from /simplify - Guard boot pointer nil dereference in prepareOCI - Fix relayBidirectional goroutine leak: buffer 2, close conn, wait - Optimize ensureVmlinux: check ELF magic (4 bytes) and cache before reading full vmlinuz into memory - Extract magic strings to constants (driveIDFmt, ifaceIDFmt, cowFileName, FC action types, VM state strings) - Deep-copy SnapshotIDs map in toVM to prevent shared DB mutation - Return real error from decompressZstd when output is empty * refactor: extract shared Backend struct and helpers to hypervisor/ Extract ~650 lines of duplicated code from CH and FC backends into shared hypervisor/ layer: - Backend struct with BackendConfig interface: provides Inspect, List, ToVM, ResolveRef(s), LoadRecord, WithRunningVM, UpdateStates, MarkError, ReserveVM, RollbackCreate, ForEachVM, AbortLaunch - shared.go: EnterNetns, WaitForSocket, ExtractBlobIDs, BuildIPParams, PrefixToNetmask, CopyFile, RemoveVMDirs, CleanupRuntimeFiles, BlobHexFromPath, SocketPath, ConsoleSockPath - config.HypervisorType enum + switch-case in InitHypervisor - FC version updated to v1.15.0 * fix: handle CopyFile writable file close error * fix: address P1/P2 review findings from PR #16 P1 GC: Implement RegisterGC for FC backend — protects blob IDs referenced by FC VMs from garbage collection, mirroring CH's GC module. P1 Clone paths: Save cocoon.json metadata (StorageConfigs + BootConfig) in snapshot tar. Create temporary symlinks from source drive paths to clone paths before snapshot/load so FC finds drives at expected locations. Symlinks are cleaned up after load + reconfigure. P2 Rebuild: Replace fragile rebuildFromSnapshot (searched live VM records) with self-contained metadata from cocoon.json. Clones no longer depend on the source VM or any sibling VM existing in the DB. P2 Console relay: Add 3s timeout on second goroutine wait after client disconnect to prevent blocking the accept loop when PTY read is stuck. * fix: GC registers both backends, doctor optional FC, debug rejects --fc P1: GC now registers ALL hypervisor backends (CH + FC) via InitAllHypervisors, protecting blobs from both backends on mixed-backend hosts regardless of --fc flag. P2: doctor/check.sh treats firecracker as optional — warns instead of failing when not installed, since it's only needed for --fc. P3: vm debug rejects --fc with a clear error since it only generates Cloud Hypervisor launch commands. * fix: clone always redirects source COW, snapshot stores portable kernel path P1: createDriveRedirects now unconditionally redirects the source COW path to the clone's copy. When the source VM is still running, its cow.raw is renamed to a temporary backup, a symlink is placed, and after snapshot/load the backup is restored. This prevents FC from reopening the live source VM's disk state. P2: saveSnapshotMeta stores the portable vmlinuz path instead of the host-local vmlinux cache. cloneAfterExtract runs ensureVmlinux on the clone host to (re)create vmlinux from vmlinuz, making FC snapshots fully portable across hosts. * fix: abort clone on redirect failure, store portable relative paths P2 redirect: createDriveRedirects now returns error. On symlink failure after backup rename, the backup is immediately restored and all prior redirects are cleaned up, preventing source VM disk corruption from a half-installed redirect. P2 portable paths: snapshot metadata (cocoon.json) now stores paths relative to root_dir using filepath.Rel. loadSnapshotMeta resolves them against the local host's root_dir. Snapshots exported from one host can be imported on another with a different Cocoon directory layout, as long as the same OCI image has been pulled. * fix: persist hypervisor type in snapshots, serialize COW redirects P1: SnapshotConfig now carries a Hypervisor field ("cloud-hypervisor" or "firecracker") set during Snapshot(). Clone validates that the snapshot's backend matches the active backend before proceeding, with a clear error suggesting the correct flag. P2: COW redirect during clone is now serialized via a per-source-COW flock (.clone.lock). Concurrent snapshot/restore/clone operations on the source VM block until the redirect is cleaned up, preventing them from following the temporary symlink to the wrong disk. * fix: include source COW path in snapshot metadata saveSnapshotMeta now stores ALL drive entries (RO layers + RW COW), not just RO entries. Without the source COW path, createDriveRedirects had no old→new mapping to redirect, so snapshot/load would reopen the live source cow.raw (if source VM exists) or fail (if deleted). * fix: vmstate-aware redirects, COW lock in snapshot/restore, lock dir creation P1: acquireCOWLock (via lockCOWPath) now creates the parent directory before locking, fixing ENOENT when source VM has been deleted. P2: snapshotMeta stores SourceRootDir. vmstatePaths() reconstructs the original absolute paths baked into FC's vmstate binary. createDriveRedirects uses vmstate paths as symlink targets, so cross-host clones redirect at the correct (source host) paths. P2: COW flock is now taken in Snapshot and Restore too (via shared lockCOWPath helper), not just Clone. Concurrent snapshot/restore operations on the source VM are serialized with clone redirects. * fix: Codex review — GC fail-fast, atomic vmlinux, zstd dep, relay redesign P1: InitAllHypervisors now returns error instead of silently skipping failed backends. GC aborts if any hypervisor can't be loaded, preventing blob deletion when pinning data is incomplete. P2: ensureVmlinux writes to a temp file and renames atomically, preventing concurrent readers from observing a truncated kernel cache. P2: Added zstd to doctor/check.sh binary checks — required by FC's kernel decompression but was previously an undeclared dependency. P2: Redesigned console relay to use a single persistent PTY reader goroutine with broadcaster pattern. Each session subscribes/unsubscribes via setSink(). No per-session read goroutines on the PTY master, eliminating stale goroutine data theft after disconnect. * fix: Codex review round 2 — vmstate paths, optional zstd, stop flags P1: vmstatePaths() now reconstructs from raw relative paths saved before local resolution, so cross-host clones correctly redirect at source-host paths even when root_dir differs. P2: zstd treated as optional in doctor/check.sh (like firecracker), warns instead of failing on CH-only hosts. P3: FC Stop now honors --force (skip SendCtrlAltDel, immediate kill) and --timeout (wait for guest response before escalating). Added gracefulStop with SendCtrlAltDel → poll → forceTerminate pattern. * fix: Codex review round 3 — snapshot Hypervisor field in export, devPath >26 P2: snapshotRecordToConfig now copies the Hypervisor field so export/import preserves the backend tag. Clone validation works correctly after a round-trip. P2: devPath handles >26 drives with Linux-style multi-letter naming (vda..vdz, vdaa..vdaz, ...) for OCI images with deep layer stacks. * fix: Codex review round 4 — FC CPU/memory override correctness, zstd install P1: FC clone/restore now clamp CPU/memory to snapshot's original values since FC cannot PATCH machine-config after snapshot/load. Snapshot metadata stores CPU/Memory for clone to use. Prevents metadata from advertising overrides FC didn't actually apply. P2: doctor --upgrade now installs zstd via apt-get/yum when missing, so fresh FC setups don't silently break on zstd-compressed kernels. * fix: Codex round 5 — set clone VM ID, scope redirects to same-host P2: Set VM.ID in synthetic VMRecord for clone launchProcess so FC gets a valid --id flag instead of empty string. P2: Drive redirects now only apply for same-host clones (where SourceRootDir matches local rootDir). Cross-host clones skip redirects entirely — they require the same rootDir layout, and creating symlinks under a foreign path tree would be incorrect. * fix: Codex round 6 — cross-host redirects, reject CPU/mem overrides, keep PTY P1: Always create drive redirects from vmstate paths → local paths, including cross-host clones. COW flock only on same-host (where source VM may be running). Cross-host redirects are safe since no live VM owns those paths on the target host. P2: FC clone/restore now reject --cpu/--memory overrides with a clear error instead of silently clamping, since FC cannot PATCH machine-config after snapshot/load. P2: Keep PTY master open (intentional fd leak) when console relay fails, preventing the slave-side hangup that would crash FC's serial console output during boot. * fix: Codex round 7 — validate CPU/memory overrides before destructive ops Move FC CPU/memory override rejection to before any destructive operations. Clone validates against snapshot metadata before launch. Restore validates against current VM record before killing the running VM (via validateRestoreOverrides helper). Prevents downtime from unsupported override requests. * remove claude * fix: Codex round 8 — validate snapshot paths, stable COW lock inode P1: loadSnapshotMeta now validates all resolved paths stay within Cocoon's rootDir via validateManagedPath. Prevents path traversal from tampered cocoon.json in imported snapshot archives that could rewrite arbitrary host files through drive redirect symlinks. P2: lockCOWPath no longer removes the lock file after unlock. flock synchronizes on the inode — removing the file under contention lets a new caller create a different inode and acquire it immediately, defeating serialization. The lock file is small and harmless to keep. * fix: validate snapshot paths against all Cocoon-managed dirs Path validation now accepts rootDir, runDir, and logDir as valid managed directories. COW disks live under runDir which may be outside rootDir (e.g., /var/lib/cocoon/run vs /var/lib/cocoon). The previous rootDir-only check rejected valid COW paths on installations with a custom run_dir. * fix: validate source_root_dir and raw paths, skip RW COW validation P1: Validate SourceRootDir is absolute (or empty), and all raw relative paths in cocoon.json have no ".." traversal components. This prevents tampered archives from using vmstate redirect paths to create symlinks outside Cocoon-managed directories. P2: Skip local managed-path validation for RW COW entries since they are source-host-specific and always replaced by rebuildCloneStorage. Only RO layer paths (actually used locally) are validated against the destination host's managed roots. * fix: validate vmstate redirect targets, allow RW COW with custom run_dir P1: vmstate redirect paths (from vmstatePaths()) are now validated against SourceRootDir before createDriveRedirects operates on them. Prevents tampered archives from targeting arbitrary host files via drive redirects. Removed validateNoTraversal which was too broad. P2: Removed traversal check on raw relative paths that rejected legitimate ".." segments from custom run_dir layouts. RW COW paths are source-host-specific and skip local managed-root validation (already in place). Only vmstate targets and local RO/boot paths are validated. * fix: validate vmstate RO paths against local roots, skip RW and cross-host P1: Replaced SourceRootDir-based validation with local managed-root validation for same-host vmstate RO paths. SourceRootDir is untrusted from imported archives and no longer used as a security boundary. Cross-host RO paths are already validated during loadSnapshotMeta. P2: validateVMStateROPaths skips RW COW entries entirely — they are source-host-specific and always replaced by rebuildCloneStorage. Custom run_dir layouts where COW is outside rootDir now work. * fix: cross-host vmstate validation, DirectRestore overrides and COW lock P1: validateVMStateROPaths now validates RO paths for both same-host and cross-host clones against local managed roots. Cross-host RO blob paths should exist locally if the image was pulled. RW COW entries remain exempt (source-host-specific, always replaced). P2: DirectRestore now calls validateRestoreOverrides before killing the running VM, matching the streamed Restore path. P2: DirectRestore now takes the COW lock via lockCOWPath to serialize with concurrent clone redirect operations, matching Restore. * refactor: simplify FC snapshot to absolute paths, remove cross-layout complexity Remove all cross-host path translation machinery that caused 5+ rounds of Codex review findings: - Removed: SourceRootDir, managedRoots, validateManagedPath, resolveAndValidateBootPaths, validateVMStateROPaths, vmstatePaths, rawRelPaths, makeRelative, relative path serialization - snapshotMeta now stores absolute paths directly - FC snapshots require same directory layout across hosts (documented) - COW redirect logic retained for same-host clone (simple, correct) - COW flock retained for snapshot/restore/clone serialization - Net deletion: ~192 lines Document FC snapshot portability requirements in KNOWN_ISSUES.md and README limitations section. * fix: validate imported FC snapshot metadata paths against managed dirs loadSnapshotMeta now takes rootDir and runDir params and validates all storage/boot paths are under Cocoon-managed directories. Rejects tampered snapshot archives that reference arbitrary host files. Simple prefix check — no cross-layout complexity. * refactor: /simplify findings — clean up FC code quality - forceTerminate: remove unused hc/vmID params, simplify call sites - api.go: filepath.Join instead of string concat for snapshot paths - backend.go: move constants to top of file per convention - relay.go: guard against invalid fcPid (exit early if <= 0) - helper.go: extract pidFileName constant, use in config.go - start.go: close leaked PTY master in fcCmd.Wait goroutine when relay fails, preventing permanent fd leak on retry * refactor: extract shared helpers to hypervisor/ layer Move exact duplicates from CH and FC into shared hypervisor/ package: - BatchMarkStarted → Backend method (was batchMarkStarted on each) - CleanStalePlaceholders → Backend method (was on each for GC) - VerifyBaseFiles → shared.go (CH version is superset, works for both) - CowSerial → backend.go constant (was in both create.go) - CreatingStateGCGrace → backend.go constant (was in both gc.go) * feat: FC balloon support, debug command, capability docs update - Enable balloon on FC VMs (PUT /balloon with 25% memory, deflate_on_oom, free_page_reporting) — matches CH behavior, fixes incorrect "No balloon" in docs - Debug command now supports --fc: outputs FC launch command + full REST API curl sequence (machine-config, boot-source, drives, balloon, start) - Fix CH comparison: CH supports CPU/memory override on clone/restore - Add TODO for FC PR #5774 (drive_overrides) in clone symlink redirect - Update KNOWN_ISSUES: PR #5774 tracking, virtio-blk serial explanation - Update README feature matrix: balloon=Y, add CPU/memory override row * refactor: use t.Context() instead of context.Background() in utils tests * refactor: reuse memMiB in balloon calc, remove stale nolint:unparam * clean up useless * fix: console relay socket deleted on listener Close SetUnlinkOnClose(false) before closing the Go listener so the socket file persists on disk for the relay child process. Without this, net.UnixListener.Close() removes the socket file, making console.sock disappear before the relay starts accepting. * chore: remove test binary, add to gitignore * feat: FC networking, console fix, snapshot/clone/restore fully working Network: - Add SingleQueueNet flag to VMConfig for FC single-queue TAPs - CNI creates TAPs with IFF_NO_PI when SingleQueueNet is set (FC requires it) - Set SingleQueueNet in both createVM and prepareClone paths Console: - Fix SetUnlinkOnClose(false) so console.sock persists for relay Snapshot/Clone: - Use FC network_overrides (v1.14+) during snapshot/load to provide clone's TAP devices, avoiding TAP flag mismatch - Skip drive reconfiguration after snapshot/load (FC opens drives via fd during load, fds survive symlink cleanup) - Remove unused reconfigureDrives function Restore: - Skip drive reconfiguration (same VM, paths unchanged) - Pass nil network_overrides (same TAP) COW lock: - Rewrite lockCOWPath to withCOWPathLocked closure form - Update all callers (snapshot, clone, restore, direct) All e2e tests pass: FC create/start/network/console/snapshot/clone/ restore/stop/delete + CH smoke test (no regression). * fix: mark VM error state on clone restore failure * fix: prepareClone ctx param order, stale MAC re-read, FirstBooted omitempty - prepareClone: move ctx before cmd per Go convention - create_linux.go: re-read link after LinkSetHardwareAddr to get the actual MAC (link.Attrs() is stale after override) - types/vm.go: add omitempty to FirstBooted for consistent JSON - debug.go: normalize nolint comment alignment * docs: clarify SingleQueueNet as generic TAP flag, not FC-specific * refactor: remove SingleQueueNet, decide TAP queues at cmd layer Remove SingleQueueNet from VMConfig — FC queue decision stays at the cmd layer via tapQueues parameter to initNetwork. The network layer uses vmCfg.CPU for TAP queues, which initNetwork temporarily overrides to 1 for FC. Also add IFF_NO_PI to all TAPs unconditionally — both CH and FC open TAPs with IFF_NO_PI, so the flag must always be set at creation time for TUNSETIFF to succeed. * refactor: unify InitHypervisor and InitAllHypervisors via constructor map * fix: reject extra NICs on FC clone, use vmlinux in debug output P2: FC clone now rejects --nics > snapshot NIC count since FC can't hot-add NICs after snapshot/load (only network_overrides for existing). P3: Debug command runs EnsureVmlinux to resolve vmlinuz → vmlinux before printing the FC boot-source curl, so the output is runnable. Export EnsureVmlinux for use by cmd/vm/debug.go. * docs: document FC clone guest MAC limitation in KNOWN_ISSUES * rename utils from shared * fix: remove duplicate MarkError in clone launch failure path * refactor: auto-detect hypervisor backend, --fc only for create/run/debug Add Hypervisor field to types.VM so each VM carries its backend identity. Move --fc from root PersistentFlags to create/run/debug subcommands only. Commands like list/inspect/console/stop/rm now auto-detect the backend by querying all registered backends — no --fc needed for existing VMs. Clone infers the backend from the snapshot's Hypervisor field. Snapshot save and list --vm auto-detect from the VM ref. Status merges watchers from all backends via fan-in channel. * fix: reject FC clone resource overrides early, add MAC fix hints Validate --cpu/--memory/--nics overrides at cmd layer before creating network and VM dirs, avoiding late failure and unnecessary rollback. Add MAC change instructions to FC clone post-clone hints since FC vmstate bakes in the source VM's guest MAC. * refactor: use config.HypervisorFirecracker constant instead of string literal * fix: add reboot=k to FC kernel cmdline to fix guest reboot/stop hang FC has no ACPI PM on x86 — the only shutdown/reboot signal path is the i8042 keyboard controller reset. Without reboot=k, guest reboot hangs (FC doesn't recognize the signal) and SendCtrlAltDel-based vm stop times out after 30s before falling back to SIGTERM. * fix: self-deadlock in GC Collect — use lock-free DB access GC orchestrator holds the module's flock for the entire cycle. Collect called LoadRecord which called DB.With → locker.Lock on the same flock, causing self-deadlock since flock is not re-entrant. Replace LoadRecord (lock-acquiring) with DB.ReadRaw (lock-free) in both FC and CH GC Collect. This is safe because the GC orchestrator already holds the lock, preventing concurrent DB mutations. * fix: replace IP=dhcp with IP=off in initramfs to fix boot and network issues IP=dhcp caused three problems: 1. --nics 0 VMs hung forever (dhcpcd retries every 120s with no interface) 2. DHCP network VMs had leases persisted as static configs by systemd-network-generator, breaking DHCP semantics on reboot 3. Source VMs and cloned VMs had inconsistent network behavior IP=off tells initramfs to skip networking entirely. Kernel ip= parameters (when present for static IP networks) override this setting and still trigger ipconfig. DHCP networks rely on systemd-networkd via the existing 20-wired.network (DHCP=yes) fallback, or cocoon-network's MAC-based DHCP config generation. Fixes #17 * fix: skip configure_networking in initramfs when no kernel ip= param configure_networking probes for devices and waits for udev even when IP=off, adding ~180s delay on VMs with no NICs. Only call it when a kernel ip= parameter is present on the cmdline. * docs: update README and KNOWN_ISSUES for --fc auto-detect and initramfs fixes - Move --fc from Global Flags to VM Flags (only create/run/debug) - Update FC examples to show auto-detect for list/console/stop/clone - Fix debug command description - Add initramfs IP=off note to DHCP networking section * minor refactor to fix leaked goroutine * fix: Android overlay.sh support for FC /dev/vdX paths, code cleanup - Add /dev/vdX direct path branch to Android overlay.sh resolve_disk() so FC VMs can find disks (FC has no virtio serial support) - Skip configure_networking unless kernel ip= param is present - Extract GC Collect to shared Backend.GCCollect() (was duplicated) - Fix goroutine leak in mergeWatchChannels (missing ctx.Done check) * feat: add DHCP fallback to Android network.sh via busybox udhcpc * fix: persist DHCP gateway for Android netd policy table sync * fix: use ndc to register network with netd for Android DHCP routing * fix: destroy stale netd network before creating to avoid ndc conflict * fix: guard Android network.sh against repeated netd trigger * fix: unify Android network.sh to use ip route for both static and DHCP Remove ndc dependency — ndc network interface add causes netd to take over eth0 and clear existing routes from the main table. Instead: - Static IP: kernel ip= routes already in main table, copy to policy tables - DHCP: udhcpc obtains lease and configures main table, then same copy logic Both paths use ip route replace into legacy_system/legacy_network/local_network policy tables. Add /proc/1/cmdline fallback for SELinux-restricted /proc/cmdline. Add guard file to prevent repeated execution on netd restart. * fix: use /data/local/tmp instead of /tmp for Android SELinux compatibility * revert: remove Android DHCP support, static IP only Android netd blocks external route modifications after boot (RTNETLINK: Network is unreachable). ipconfigstore cannot read gateway without a pre-existing default route, creating a deadlock. DHCP requires routes to exist before netd starts, which is only possible with kernel ip=. Revert to clean static IP-only network.sh. DHCP support requires redroid-level changes (EthernetService/ConnectivityService integration). * fix: Android DHCP via EthernetService default DHCP mode Delete ipconfig.txt (broken STATIC config from ipconfigstore) when no kernel ip= is present. EthernetService defaults to DHCP mode when no ipconfig.txt exists, using Android's built-in DhcpClient through the standard ConnectivityService → netd path. This correctly populates all policy routing tables without manual ndc or ip route commands. Static IP path unchanged: ipconfigstore writes correct STATIC config, network.sh copies routes to policy tables as safety net.
1 parent 015a5d0 commit f313f81

65 files changed

Lines changed: 3552 additions & 814 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,4 +47,7 @@ TODO*
4747

4848
tmp/*
4949
.cache/*
50-
.tmp/*
50+
.tmp/*
51+
52+
.claude/*
53+
cocoon-test

KNOWN_ISSUES.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,9 @@ This applies to **all CNI plugins** where the upstream network provides DHCP (br
6969
}
7070
```
7171

72-
Cocoon detects when CNI returns no IP allocation and automatically configures the guest for DHCP — cloudimg VMs get `DHCP=ipv4` in their Netplan config, and OCI VMs get DHCP systemd-networkd units generated by the initramfs.
72+
Cocoon detects when CNI returns no IP allocation and automatically configures the guest for DHCP — cloudimg VMs get `DHCP=ipv4` in their Netplan config, and OCI VMs get DHCP systemd-networkd units generated by the initramfs `cocoon-network` script.
73+
74+
Note: the OCI initramfs uses `IP=off` to prevent the initramfs from running its own DHCP client during boot. DHCP is handled entirely by systemd-networkd after switch_root. The `configure_networking` function is only called when a kernel `ip=` parameter is present (static IP from CNI).
7375

7476
## Windows VM requires Cloud Hypervisor v50.2
7577

@@ -125,3 +127,31 @@ The Windows image's `autounattend.xml` includes defensive power-button configura
125127
## Installing patched binaries for Windows
126128

127129
See [`os-image/windows/`](os-image/windows/) for download and installation instructions.
130+
131+
132+
## Firecracker snapshot portability
133+
134+
Firecracker snapshots store absolute host paths in the vmstate binary (Rust serde format, not patchable). This means:
135+
136+
- **Same-host clone/restore**: works without restrictions
137+
- **Cross-host export/import**: requires the target host to use **identical `root_dir` and `run_dir`** (default: `/var/lib/cocoon` and `/var/lib/cocoon/run`) and have the **same OCI image pulled**
138+
- **CPU/memory overrides**: not supported on clone/restore — Firecracker cannot change machine config after snapshot/load; `--cpu` and `--memory` flags are rejected if they differ from the snapshot values
139+
- **Drive path redirect**: Cocoon uses a temporary symlink to redirect the source COW path to the clone's COW during `snapshot/load`. This requires a COW flock to serialize with concurrent operations
140+
141+
This is a fundamental Firecracker design limitation. Cloud Hypervisor snapshots do not have this restriction because CH stores device config in a patchable JSON format (`config.json`).
142+
143+
**Upstream fix in progress**: Firecracker [PR #5774](https://github.com/firecracker-microvm/firecracker/pull/5774) adds `drive_overrides` to the `PUT /snapshot/load` API, which would eliminate the symlink redirect and make FC snapshots natively portable. Track this PR for future simplification.
144+
145+
## Firecracker virtio-blk serial numbers
146+
147+
Firecracker does not support virtio-blk serial numbers. Cocoon's OCI init script (`overlay.sh`) uses device paths (`/dev/vdX`) instead of serial names to identify disks when booting under Firecracker. OCI images built from `os-image/ubuntu/overlay.sh` (v0.3+) support both formats automatically. Older images must be rebuilt to work with `--fc`.
148+
149+
## Firecracker clone guest MAC address
150+
151+
Firecracker does not support overriding the guest MAC address during snapshot/load. Cloned FC VMs retain the source VM's guest MAC (baked into the vmstate binary). In Cocoon's TC redirect architecture, each VM runs in an isolated network namespace, so MAC identity is not visible to other VMs or the host bridge — **no MAC conflict occurs in practice**.
152+
153+
On CNI plugins with strict per-veth MAC enforcement (Cilium eBPF, Calico eBPF), the guest MAC vs veth MAC mismatch could theoretically cause packet drops. This has not been observed in testing with the standard bridge CNI.
154+
155+
**Upstream status**: FC's `NetworkOverride` struct only has `iface_id` and `host_dev_name` — no `guest_mac` field. Adding it would follow the existing `VsockOverride` pattern. No issue or PR exists yet.
156+
157+
**Workaround**: If MAC matching is required, run `ip link set dev ethX address <new-mac>` inside the guest after clone (the post-clone hints print the expected MAC values).

README.md

Lines changed: 62 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Cocoon
22

3-
Lightweight MicroVM engine built on [Cloud Hypervisor](https://github.com/cloud-hypervisor/cloud-hypervisor).
3+
Lightweight MicroVM engine with dual hypervisor backends: [Cloud Hypervisor](https://github.com/cloud-hypervisor/cloud-hypervisor) (default) and [Firecracker](https://github.com/firecracker-microvm/firecracker).
44

55
## Features
66

@@ -24,7 +24,8 @@ Lightweight MicroVM engine built on [Cloud Hypervisor](https://github.com/cloud-
2424
- **Docker-like CLI**`create`, `run`, `start`, `stop`, `list`, `inspect`, `console`, `rm`, `debug`, `clone`, `status`
2525
- **Structured logging** — configurable log level (`--log-level`), log rotation (max size / age / backups)
2626
- **Debug command**`cocoon vm debug` generates a copy-pasteable `cloud-hypervisor` command for manual debugging
27-
- **Zero-daemon architecture** — one Cloud Hypervisor process per VM, no long-running daemon
27+
- **Firecracker backend**`--fc` flag selects Firecracker for OCI images: ~125ms boot, <5 MiB overhead, minimal attack surface (no UEFI, no qcow2, no Windows)
28+
- **Zero-daemon architecture** — one hypervisor process per VM, no long-running daemon
2829
- **Garbage collection** — modular lock-safe GC with cross-module snapshot resolution; protects blobs referenced by running VMs and snapshots
2930
- **Doctor script** — pre-flight environment check and one-command dependency installation
3031

@@ -33,8 +34,9 @@ Lightweight MicroVM engine built on [Cloud Hypervisor](https://github.com/cloud-
3334
- Linux with KVM (x86_64 or aarch64)
3435
- Root access (sudo)
3536
- [Cloud Hypervisor](https://github.com/cloud-hypervisor/cloud-hypervisor) v51.0+ (for Windows VMs, use our [CH fork](https://github.com/cocoonstack/cloud-hypervisor/tree/dev) and [firmware fork](https://github.com/cocoonstack/rust-hypervisor-firmware/tree/dev) for full compatibility — see [KNOWN_ISSUES.md](KNOWN_ISSUES.md))
37+
- [Firecracker](https://github.com/firecracker-microvm/firecracker) v1.12+ (optional, for `--fc` backend)
3638
- `qemu-img` (from qemu-utils, for cloud images)
37-
- UEFI firmware (`CLOUDHV.fd`, for cloud images)
39+
- UEFI firmware (`CLOUDHV.fd`, for cloud images, not needed with `--fc`)
3840
- CNI plugins (`bridge`, `host-local`, `loopback`)
3941
- Go 1.25+ (build only)
4042

@@ -85,6 +87,7 @@ cocoon-check --upgrade
8587

8688
The `--upgrade` flag downloads and installs:
8789
- Cloud Hypervisor + ch-remote (static binaries)
90+
- Firecracker (static binary)
8891
- CLOUDHV.fd firmware (rust-hypervisor-firmware)
8992
- CNI plugins (bridge, host-local, loopback, etc.)
9093

@@ -136,7 +139,7 @@ cocoon
136139
│ ├── rm [flags] VM [VM...] Delete VM(s) (--force to stop first)
137140
│ ├── restore [flags] VM SNAP Restore a running VM to a snapshot
138141
│ ├── status [VM...] Watch VM status in real time
139-
│ └── debug [flags] IMAGE Generate CH launch command (dry run)
142+
│ └── debug [flags] IMAGE Generate hypervisor launch command (dry run)
140143
├── snapshot
141144
│ ├── save [flags] VM Create a snapshot from a running VM
142145
│ ├── list (alias: ls) List all snapshots
@@ -169,6 +172,7 @@ Applies to `cocoon vm create`, `cocoon vm run`, and `cocoon vm debug`:
169172

170173
| Flag | Default | Description |
171174
| ----------- | ---------------- | --------------------------------------------- |
175+
| `--fc` | `false` | Use Firecracker backend (OCI images only) |
172176
| `--name` | `cocoon-<image>` | VM name |
173177
| `--cpu` | `2` | Boot CPUs |
174178
| `--memory` | `1G` | Memory size (e.g., 512M, 2G) |
@@ -356,22 +360,71 @@ cocoon image import win11-25h2 windows-11-25h2.qcow2
356360

357361
For more details, see the [Cloud Hypervisor Windows documentation](https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/windows.md).
358362

363+
## Firecracker Backend
364+
365+
Cocoon supports [Firecracker](https://github.com/firecracker-microvm/firecracker) as an alternative hypervisor for workloads that prioritize boot speed and resource density.
366+
367+
```bash
368+
# Run with Firecracker (--fc only needed for create/run/debug)
369+
cocoon vm run --fc --name fast-vm ghcr.io/cocoonstack/cocoon/ubuntu:24.04
370+
371+
# Other commands auto-detect the backend — no --fc needed
372+
cocoon vm list # shows both CH and FC VMs
373+
cocoon vm console fast-vm
374+
cocoon vm stop fast-vm
375+
376+
# Clone infers backend from the snapshot
377+
cocoon snapshot save fast-vm --name my-snap
378+
cocoon vm clone my-snap --name clone-vm
379+
```
380+
381+
### Feature Comparison
382+
383+
| Feature | Cloud Hypervisor | Firecracker |
384+
|---------|:---:|:---:|
385+
| OCI images (direct boot) | Y | Y |
386+
| Cloud images (UEFI boot) | Y | N |
387+
| Windows guests | Y | N |
388+
| Snapshot / Clone / Restore | Y | Y |
389+
| CPU/memory override on clone/restore | Y | N |
390+
| Multi-queue networking | Y | N |
391+
| Memory balloon | Y | Y |
392+
| qcow2 storage | Y | N |
393+
| Interactive console | Y | Y |
394+
| HugePages | Y | Y |
395+
| Boot time | ~200-500ms | ~125ms |
396+
| Memory overhead | ~10-20 MiB/VM | <5 MiB/VM |
397+
398+
### Limitations
399+
400+
- **OCI images only**: `--fc` is mutually exclusive with `--windows` and rejects cloudimg (UEFI boot) images
401+
- **Raw disks only**: Firecracker uses raw virtio-blk without serial support; disks are referenced by device path (`/dev/vdX`)
402+
- **Single-queue networking**: `NetworkConfig.NumQueues` is ignored
403+
- **No CPU/memory override on clone/restore**: Firecracker cannot change machine config after snapshot/load
404+
- **Snapshot portability requires same directory layout**: FC snapshots store absolute paths in the vmstate binary (not patchable); cross-host export/import requires the target host to use the same `root_dir`/`run_dir` and have the same OCI image pulled
405+
- **Console via PTY relay**: a background relay process bridges FC's serial (stdin/stdout) to `console.sock`
406+
407+
### OCI Image Compatibility
408+
409+
OCI images must include a `resolve_disk()` init script that supports device paths (e.g., `/dev/vda`) in addition to virtio serial names. Images built from `os-image/ubuntu/overlay.sh` (v0.3+) support both formats automatically.
410+
359411
## VM Lifecycle
360412

361413
| State | Description |
362414
| ---------- | -------------------------------------------------------- |
363415
| `creating` | DB placeholder written, disks being prepared |
364-
| `created` | Registered, cloud-hypervisor process not yet started |
365-
| `running` | Cloud-hypervisor process alive, guest is up |
366-
| `stopped` | Cloud-hypervisor process exited cleanly |
416+
| `created` | Registered, hypervisor process not yet started |
417+
| `running` | Hypervisor process alive, guest is up |
418+
| `stopped` | Hypervisor process exited cleanly |
367419
| `error` | Start or stop failed |
368420

369421
### Shutdown Behavior
370422

371423
- **UEFI VMs (cloudimg)**: ACPI power-button → poll for graceful exit → timeout (default 30s, configurable via `stop_timeout_seconds` in config or `--timeout` flag) → SIGTERM → 5s → SIGKILL
372424
- **Windows VMs**: ACPI power-button works with our [firmware fork](https://github.com/cocoonstack/rust-hypervisor-firmware/tree/dev) (~13.5s shutdown); with upstream firmware, use `ssh shutdown /s /t 0` before stopping, or `--force` to skip the ACPI timeout (see [KNOWN_ISSUES.md](KNOWN_ISSUES.md))
373-
- **Direct-boot VMs (OCI)**: `vm.shutdown` API → SIGTERM → 5s → SIGKILL (no ACPI support)
374-
- **Force stop** (`--force`): skip ACPI, immediate `vm.shutdown` → SIGTERM → SIGKILL
425+
- **Direct-boot VMs (CH, OCI)**: `vm.shutdown` API → SIGTERM → 5s → SIGKILL (no ACPI support)
426+
- **Firecracker VMs**: `SendCtrlAltDel` → SIGTERM → 5s → SIGKILL
427+
- **Force stop** (`--force`): skip ACPI, immediate SIGTERM → SIGKILL
375428
- PID ownership is verified before sending signals to prevent killing unrelated processes
376429

377430
### Stop Flags

cmd/core/helpers.go

Lines changed: 80 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ import (
1515
"github.com/cocoonstack/cocoon/config"
1616
"github.com/cocoonstack/cocoon/hypervisor"
1717
"github.com/cocoonstack/cocoon/hypervisor/cloudhypervisor"
18+
"github.com/cocoonstack/cocoon/hypervisor/firecracker"
1819
imagebackend "github.com/cocoonstack/cocoon/images"
1920
"github.com/cocoonstack/cocoon/images/cloudimg"
2021
"github.com/cocoonstack/cocoon/images/oci"
@@ -26,6 +27,12 @@ import (
2627
"github.com/cocoonstack/cocoon/utils"
2728
)
2829

30+
// hypervisorConstructors maps backend type to its constructor.
31+
var hypervisorConstructors = map[config.HypervisorType]func(*config.Config) (hypervisor.Hypervisor, error){
32+
config.HypervisorCH: func(c *config.Config) (hypervisor.Hypervisor, error) { return cloudhypervisor.New(c) },
33+
config.HypervisorFirecracker: func(c *config.Config) (hypervisor.Hypervisor, error) { return firecracker.New(c) },
34+
}
35+
2936
// BaseHandler provides shared config access for all command handlers.
3037
type BaseHandler struct {
3138
ConfProvider func() *config.Config
@@ -71,11 +78,11 @@ func InitBackends(ctx context.Context, conf *config.Config) ([]imagebackend.Imag
7178
if err != nil {
7279
return nil, nil, err
7380
}
74-
ch, err := cloudhypervisor.New(conf)
81+
hyper, err := InitHypervisor(conf)
7582
if err != nil {
76-
return nil, nil, fmt.Errorf("init hypervisor: %w", err)
83+
return nil, nil, err
7784
}
78-
return backends, ch, nil
85+
return backends, hyper, nil
7986
}
8087

8188
// InitImageBackends initializes only image backends (no hypervisor needed).
@@ -100,13 +107,80 @@ func InitImageBackendsForPull(ctx context.Context, conf *config.Config) (*oci.OC
100107
return ociStore, cloudimgStore, nil
101108
}
102109

103-
// InitHypervisor initializes only the hypervisor.
110+
// InitHypervisor initializes the selected hypervisor backend.
104111
func InitHypervisor(conf *config.Config) (hypervisor.Hypervisor, error) {
105-
ch, err := cloudhypervisor.New(conf)
112+
ctor, ok := hypervisorConstructors[conf.Hypervisor()]
113+
if !ok {
114+
return nil, fmt.Errorf("unknown hypervisor type: %s", conf.Hypervisor())
115+
}
116+
h, err := ctor(conf)
106117
if err != nil {
107118
return nil, fmt.Errorf("init hypervisor: %w", err)
108119
}
109-
return ch, nil
120+
return h, nil
121+
}
122+
123+
// InitAllHypervisors initializes all registered backends for GC.
124+
// Returns error if any backend fails — GC must not proceed without
125+
// full blob pinning or it risks deleting referenced layers.
126+
func InitAllHypervisors(conf *config.Config) ([]hypervisor.Hypervisor, error) {
127+
result := make([]hypervisor.Hypervisor, 0, len(hypervisorConstructors))
128+
for typ, ctor := range hypervisorConstructors {
129+
h, err := ctor(conf)
130+
if err != nil {
131+
return nil, fmt.Errorf("init %s for GC: %w", typ, err)
132+
}
133+
result = append(result, h)
134+
}
135+
return result, nil
136+
}
137+
138+
// FindHypervisor returns the backend that owns the given VM ref.
139+
// Tries all registered backends; returns ErrNotFound if no backend has it.
140+
func FindHypervisor(ctx context.Context, conf *config.Config, ref string) (hypervisor.Hypervisor, error) {
141+
hypers, err := InitAllHypervisors(conf)
142+
if err != nil {
143+
return nil, err
144+
}
145+
for _, h := range hypers {
146+
if _, resolveErr := h.Inspect(ctx, ref); resolveErr == nil {
147+
return h, nil
148+
}
149+
}
150+
return nil, fmt.Errorf("VM %q: %w", ref, hypervisor.ErrNotFound)
151+
}
152+
153+
// ListAllVMs returns VMs from all registered backends, merged.
154+
func ListAllVMs(ctx context.Context, hypers []hypervisor.Hypervisor) ([]*types.VM, error) {
155+
var all []*types.VM
156+
for _, h := range hypers {
157+
vms, listErr := h.List(ctx)
158+
if listErr != nil {
159+
continue
160+
}
161+
all = append(all, vms...)
162+
}
163+
return all, nil
164+
}
165+
166+
// RouteRefs groups VM refs by their owning backend.
167+
// Returns a map from hypervisor to refs it owns, or error if any ref is unresolvable.
168+
func RouteRefs(ctx context.Context, hypers []hypervisor.Hypervisor, refs []string) (map[hypervisor.Hypervisor][]string, error) {
169+
result := map[hypervisor.Hypervisor][]string{}
170+
for _, ref := range refs {
171+
found := false
172+
for _, h := range hypers {
173+
if _, resolveErr := h.Inspect(ctx, ref); resolveErr == nil {
174+
result[h] = append(result[h], ref)
175+
found = true
176+
break
177+
}
178+
}
179+
if !found {
180+
return nil, fmt.Errorf("VM %q: %w", ref, hypervisor.ErrNotFound)
181+
}
182+
}
183+
return result, nil
110184
}
111185

112186
// InitNetwork creates the CNI network provider.

cmd/others/handler.go

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ func (h Handler) GC(cmd *cobra.Command, _ []string) error {
2020
if err != nil {
2121
return err
2222
}
23-
backends, hyper, err := cmdcore.InitBackends(ctx, conf)
23+
backends, err := cmdcore.InitImageBackends(ctx, conf)
2424
if err != nil {
2525
return err
2626
}
@@ -37,7 +37,14 @@ func (h Handler) GC(cmd *cobra.Command, _ []string) error {
3737
for _, b := range backends {
3838
b.RegisterGC(o)
3939
}
40-
hyper.RegisterGC(o)
40+
// Register ALL hypervisor backends so GC protects blobs from both CH and FC VMs.
41+
hypers, hyperErr := cmdcore.InitAllHypervisors(conf)
42+
if hyperErr != nil {
43+
return hyperErr
44+
}
45+
for _, hyper := range hypers {
46+
hyper.RegisterGC(o)
47+
}
4148
netProvider.RegisterGC(o)
4249
snapBackend.RegisterGC(o)
4350
if err := o.Run(ctx); err != nil {

cmd/root.go

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ var (
5959
viper.SetDefault("run_dir", "/var/lib/cocoon/run")
6060
viper.SetDefault("log_dir", "/var/log/cocoon")
6161
viper.SetDefault("ch_binary", "cloud-hypervisor")
62+
viper.SetDefault("fc_binary", "firecracker")
6263
viper.SetDefault("cni_conf_dir", "/etc/cni/net.d")
6364
viper.SetDefault("cni_bin_dir", "/opt/cni/bin")
6465
viper.SetDefault("dns", "8.8.8.8,1.1.1.1")
@@ -83,8 +84,8 @@ var (
8384
)
8485

8586
// Execute is the main entry point called from main.go.
86-
func Execute() error {
87-
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
87+
func Execute(ctx context.Context) error {
88+
ctx, cancel := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM)
8889
defer cancel()
8990
return rootCmd.ExecuteContext(ctx)
9091
}

cmd/snapshot/handler.go

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,16 +30,15 @@ func (h Handler) Save(cmd *cobra.Command, args []string) error {
3030
}
3131
logger := log.WithFunc("cmd.snapshot.save")
3232

33-
hyper, err := cmdcore.InitHypervisor(conf)
33+
vmRef := args[0]
34+
hyper, err := cmdcore.FindHypervisor(ctx, conf, vmRef)
3435
if err != nil {
35-
return err
36+
return fmt.Errorf("find VM %s: %w", vmRef, err)
3637
}
3738
snapBackend, err := cmdcore.InitSnapshot(conf)
3839
if err != nil {
3940
return err
4041
}
41-
42-
vmRef := args[0]
4342
name, _ := cmd.Flags().GetString("name")
4443
description, _ := cmd.Flags().GetString("description")
4544

@@ -95,7 +94,7 @@ func (h Handler) List(cmd *cobra.Command, _ []string) error {
9594
vmRef, _ := cmd.Flags().GetString("vm")
9695
var filterIDs map[string]struct{}
9796
if vmRef != "" {
98-
hyper, hyperErr := cmdcore.InitHypervisor(conf)
97+
hyper, hyperErr := cmdcore.FindHypervisor(ctx, conf, vmRef)
9998
if hyperErr != nil {
10099
return hyperErr
101100
}

cmd/vm/commands.go

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ func Command(h Actions) *cobra.Command {
112112

113113
debugCmd := &cobra.Command{
114114
Use: "debug [flags] IMAGE",
115-
Short: "Generate cloud-hypervisor launch command (dry run)",
115+
Short: "Generate hypervisor launch command (dry run)",
116116
Args: cobra.ExactArgs(1),
117117
RunE: h.Debug,
118118
}
@@ -149,6 +149,7 @@ func Command(h Actions) *cobra.Command {
149149
}
150150

151151
func addVMFlags(cmd *cobra.Command) {
152+
cmd.Flags().Bool("fc", false, "use Firecracker backend instead of Cloud Hypervisor (OCI images only)")
152153
cmd.Flags().String("name", "", "VM name")
153154
cmd.Flags().Int("cpu", 2, "boot CPUs") //nolint:mnd
154155
cmd.Flags().String("memory", "1G", "memory size") //nolint:mnd

0 commit comments

Comments
 (0)