|
| 1 | +# Snapshot File Format |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Save a `Snapshot` to disk and load it back via zero-copy file |
| 6 | +mapping, so that a `MultiUseSandbox` can be created directly from a |
| 7 | +file without re-parsing the guest ELF or re-running guest init code. |
| 8 | + |
| 9 | +- **Linux**: `mmap(MAP_PRIVATE)` at page-aligned offset - zero copy, |
| 10 | + demand-paged by the kernel. |
| 11 | +- **Windows**: `CreateFileMappingA(PAGE_READONLY)` + |
| 12 | + `MapViewOfFile(FILE_MAP_READ)` - zero copy, demand-paged by the OS. |
| 13 | + |
| 14 | +Cross-platform (Linux + Windows). Default feature flags only |
| 15 | +(`nanvix-unstable`, `crashdump`, `gdb` not handled). |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## File Format |
| 20 | + |
| 21 | +The file uses a versioned header with two independent version checks: |
| 22 | + |
| 23 | +- **Format version** (`FormatVersion` enum): controls the byte layout |
| 24 | + of the header itself. A format version mismatch may be convertible |
| 25 | + by re-serializing the header. |
| 26 | +- **ABI version** (`SNAPSHOT_ABI_VERSION` constant): covers the |
| 27 | + contents and interpretation of the memory blob. An ABI mismatch |
| 28 | + means the snapshot must be regenerated from the guest binary. |
| 29 | + |
| 30 | +``` |
| 31 | +Offset Size Field |
| 32 | +------ ------- -------------------------------------------------- |
| 33 | +0 4 Magic bytes: "HLS\0" |
| 34 | +4 4 Format version (u32 LE: 1 = V1) |
| 35 | +8 4 Architecture tag (u32 LE: 1 = x86_64, 2 = aarch64) |
| 36 | +12 4 ABI version (u32 LE: must match SNAPSHOT_ABI_VERSION) |
| 37 | +16 32 Content hash (blake3, over memory blob only) |
| 38 | +48 8 stack_top_gva (u64 LE) |
| 39 | +56 8 Entrypoint tag (u64 LE: 0 = Initialise, 1 = Call) |
| 40 | +64 8 Entrypoint address (u64 LE) |
| 41 | +72 8 input_data_size (u64 LE) |
| 42 | +80 8 output_data_size (u64 LE) |
| 43 | +88 8 heap_size (u64 LE) |
| 44 | +96 8 code_size (u64 LE) |
| 45 | +104 8 init_data_size (u64 LE) |
| 46 | +112 8 init_data_permissions (u64 LE: 0 = None, else bits) |
| 47 | +120 8 scratch_size (u64 LE) |
| 48 | +128 8 snapshot_size (u64 LE) |
| 49 | +136 8 pt_size (u64 LE: 0 = None) |
| 50 | +144 8 memory_size (u64 LE) - byte length of memory blob |
| 51 | + Derivable from layout fields today, but stored for |
| 52 | + forward compat (e.g. compression). |
| 53 | +152 8 memory_offset (u64 LE) - byte offset from file start |
| 54 | + Always SNAPSHOT_HEADER_SIZE today, but stored so a |
| 55 | + future format can relocate the blob without breaking. |
| 56 | +160 8 has_sregs (u64 LE: 1 = present, 0 = absent) |
| 57 | +168 8 hypervisor_tag (u64 LE: 1 = KVM, 2 = MSHV, 3 = WHP) |
| 58 | +176 952 sregs fields (all widened to u64 LE, see below) |
| 59 | +1120 2976 Zero padding to 4096-byte boundary |
| 60 | +4096 * Memory blob (page-aligned, uncompressed, mmap target) |
| 61 | +*+4096 4096 Trailing zero padding (guard page backing for Windows) |
| 62 | +``` |
| 63 | + |
| 64 | +Total header before padding: 1128 bytes, well within the 4096-byte |
| 65 | +page. |
| 66 | + |
| 67 | +The trailing PAGE_SIZE padding exists because Windows read-only file |
| 68 | +mappings cannot extend beyond the file's actual size. |
| 69 | +`ReadonlySharedMemory::from_file_windows` maps the entire file and |
| 70 | +uses `VirtualProtect(PAGE_NOACCESS)` on both the first page (header) |
| 71 | +and last page (trailing padding) as guard pages. Linux ignores this |
| 72 | +padding - its guard pages come from an anonymous mmap reservation. |
| 73 | + |
| 74 | +### Layout fields |
| 75 | + |
| 76 | +The 9 layout fields (offsets 72-136) are the primary inputs to |
| 77 | +`SandboxMemoryLayout::new()`. On load, a `SandboxConfiguration` is |
| 78 | +reconstructed from `input_data_size`, `output_data_size`, `heap_size`, |
| 79 | +and `scratch_size`; the remaining fields (`code_size`, |
| 80 | +`init_data_size`, `init_data_permissions`) are passed directly. |
| 81 | +`snapshot_size` and `pt_size` are set after construction. |
| 82 | + |
| 83 | +### Hypervisor tag |
| 84 | + |
| 85 | +Segment register hidden-cache fields (`unusable`, `type_`, |
| 86 | +`granularity`, `db`) differ between KVM, MSHV, and WHP for the same |
| 87 | +architectural state. Restoring sregs captured on one hypervisor into |
| 88 | +another may be rejected or produce subtly wrong behavior. The |
| 89 | +`hypervisor_tag` field ensures snapshots are only loaded on the same |
| 90 | +hypervisor that created them. See "Cross-hypervisor snapshot |
| 91 | +portability" under Future Work for how this restriction could be |
| 92 | +relaxed. |
| 93 | + |
| 94 | +### Special registers (sregs) |
| 95 | + |
| 96 | +The vCPU special registers are persisted because the guest init |
| 97 | +code sets up a GDT, IDT, TSS, and segment descriptors that differ |
| 98 | +from `standard_64bit_defaults`. Without the captured sregs, the guest |
| 99 | +triple-faults on dispatch. Specifically, the guest init sets: |
| 100 | + |
| 101 | +- cs/ds/es/fs/gs/ss with proper selectors, limits, and granularity |
| 102 | +- GDT and IDT base/limit pointing into guest high memory |
| 103 | +- TSS (task register) with a valid base, selector, and limit |
| 104 | +- LDT marked as unusable |
| 105 | + |
| 106 | +All fields widened to u64 LE: 8 segment regs x 13 fields + 2 table |
| 107 | +regs x 2 fields + 7 control regs + 4 interrupt bitmap = 119 u64s |
| 108 | +(952 bytes). Always written; ignored on load when `has_sregs = 0`. |
| 109 | + |
| 110 | +### What is NOT persisted |
| 111 | + |
| 112 | +| Field | Reason | |
| 113 | +|---|---| |
| 114 | +| `sandbox_id` | Process-local counter; fresh ID assigned on load | |
| 115 | +| `LoadInfo` | Debug-only; reconstructible from ELF if needed | |
| 116 | +| `regions` | Always empty after snapshot (absorbed into memory) | |
| 117 | +| Runtime config | Defaults used at load time | |
| 118 | +| Host function defs | Deferred to a follow-up PR | |
| 119 | + |
| 120 | +### What IS persisted |
| 121 | + |
| 122 | +The memory blob contains **only the snapshot region**: guest code, |
| 123 | +PEB, heap, init data, and page tables (`ReadonlySharedMemory`). |
| 124 | + |
| 125 | +The **scratch region** is recreated fresh on load via |
| 126 | +`ExclusiveSharedMemory::new()`, then initialized by |
| 127 | +`update_scratch_bookkeeping()` (copies page tables from snapshot to |
| 128 | +scratch, writes I/O buffer metadata). |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +## Saving and Loading |
| 133 | + |
| 134 | +### `Snapshot::to_file(&self, path)` / `Snapshot::from_file(path)` |
| 135 | + |
| 136 | +Manual binary serialization via `SnapshotPreamble` + `SnapshotHeaderV1` |
| 137 | +structs with `write_to` / `read_from` methods, followed by the raw |
| 138 | +memory blob and trailing padding. `from_file` maps the memory blob |
| 139 | +via `ReadonlySharedMemory::from_file(&file, offset, len)`. |
| 140 | +`from_file_unchecked` skips the blake3 hash verification for trusted |
| 141 | +environments. |
| 142 | + |
| 143 | +On load, the header is validated in order: magic, format version, |
| 144 | +architecture, ABI version, hypervisor tag. Any mismatch produces a |
| 145 | +descriptive error. |
| 146 | + |
| 147 | +### `ReadonlySharedMemory::from_file(file, offset, len)` |
| 148 | + |
| 149 | +Cross-platform entry point that dispatches to platform-specific |
| 150 | +implementations: |
| 151 | + |
| 152 | +- **Linux** (`from_file_linux`): Allocates anonymous `PROT_NONE` |
| 153 | + region (with guard pages), then `MAP_FIXED` the file content over |
| 154 | + the usable portion with `PROT_READ | PROT_WRITE` + `MAP_PRIVATE`. |
| 155 | + KVM/MSHV need writable host mappings for CoW page fault handling. |
| 156 | + `HostMapping::Drop` calls `munmap` on the full region. |
| 157 | + |
| 158 | +- **Windows** (`from_file_windows`): `CreateFileMappingA(PAGE_READONLY)` |
| 159 | + + `MapViewOfFile(FILE_MAP_READ)` covering the full file (header + |
| 160 | + blob + trailing padding). The header becomes the leading guard page |
| 161 | + and the trailing padding becomes the trailing guard page, both via |
| 162 | + `VirtualProtect(PAGE_NOACCESS)`. The `HostMapping` carries the file |
| 163 | + mapping handle for the surrogate process. `HostMapping::Drop` calls |
| 164 | + `UnmapViewOfFile` + `CloseHandle`. |
| 165 | + |
| 166 | +Both paths produce a `HostMapping` with the standard layout: |
| 167 | +`ptr` = start of first guard page, `size` = guard + usable + guard. |
| 168 | +`base_ptr() = ptr + PAGE_SIZE`, `mem_size() = size - 2*PAGE_SIZE`. |
| 169 | + |
| 170 | +### `MultiUseSandbox::from_snapshot(snapshot: Arc<Snapshot>)` |
| 171 | + |
| 172 | +Creates a sandbox bypassing `UninitializedSandbox` and `evolve()`: |
| 173 | + |
| 174 | +1. Create default `FunctionRegistry` |
| 175 | +2. Build `SandboxConfiguration` from snapshot layout fields |
| 176 | +3. `SandboxMemoryManager::from_snapshot()` - clones the |
| 177 | + `ReadonlySharedMemory`, creates fresh scratch |
| 178 | +4. `mgr.build()` - splits into host/guest views, runs |
| 179 | + `update_scratch_bookkeeping()` |
| 180 | +5. `setup_signal_handlers()` (Linux only - VCPU interrupt signaling) |
| 181 | +6. `set_up_hypervisor_partition()` - creates VM (KVM/MSHV on Linux, |
| 182 | + WHP on Windows), maps slot 0 (snapshot) and slot 1 (scratch) |
| 183 | +7. `vm.initialise()` - runs guest init if `NextAction::Initialise`, |
| 184 | + no-op if `NextAction::Call` |
| 185 | +8. For post-init snapshots, `vm.apply_sregs()` applies captured |
| 186 | + sregs (sets sregs + pending TLB flush, no redundant GPR/debug/FPU |
| 187 | + resets) |
| 188 | +9. Returns `MultiUseSandbox` |
| 189 | + |
| 190 | +Host functions are not yet supported when loading from snapshot. |
| 191 | +A `SnapshotLoader` builder with `.with_host_function()` is planned |
| 192 | +as future work. |
| 193 | + |
| 194 | +### Supporting changes |
| 195 | + |
| 196 | +- `SandboxMemoryLayout` simplified to 9 `pub(crate)` fields with |
| 197 | + computed `#[inline]` offset methods; `new()` takes |
| 198 | + `SandboxConfiguration`, `code_size`, `init_data_size`, |
| 199 | + `init_data_permissions` |
| 200 | +- `HyperlightPEB::write_to()` and `GuestMemoryRegion::write_to()` |
| 201 | + added to `hyperlight_common` |
| 202 | +- `HyperlightVm::apply_sregs()` added to `hyperlight_vm/x86_64.rs` |
| 203 | + for efficient sreg restore without redundant register resets |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +## Files |
| 208 | + |
| 209 | +| File | Purpose | |
| 210 | +|---|---| |
| 211 | +| `src/hyperlight_host/src/sandbox/snapshot.rs` | File format types, `to_file`, `from_file`, `from_file_unchecked`, sregs serialization, `HypervisorTag`, 10 tests | |
| 212 | +| `src/hyperlight_host/src/sandbox/initialized_multi_use.rs` | `MultiUseSandbox::from_snapshot(Arc<Snapshot>)` (cross-platform) | |
| 213 | +| `src/hyperlight_host/src/mem/shared_mem.rs` | `ReadonlySharedMemory::from_file()` (cross-platform dispatch to `from_file_linux` / `from_file_windows`) | |
| 214 | +| `src/hyperlight_host/src/mem/memory_region.rs` | `SurrogateMapping` routing for `Snapshot` regions | |
| 215 | +| `src/hyperlight_host/src/mem/layout.rs` | Simplified to 9 fields, computed offset methods, `write_peb()` uses `HyperlightPEB::write_to()` | |
| 216 | +| `src/hyperlight_common/src/mem.rs` | `HyperlightPEB::write_to()`, `GuestMemoryRegion::write_to()` | |
| 217 | +| `src/hyperlight_host/src/hypervisor/hyperlight_vm/x86_64.rs` | `apply_sregs()` method | |
| 218 | +| `src/hyperlight_host/benches/benchmarks.rs` | `snapshot_files` benchmark group | |
| 219 | + |
| 220 | +--- |
| 221 | + |
| 222 | +## Tests |
| 223 | + |
| 224 | +All in `snapshot_file_tests` module inside `snapshot.rs`: |
| 225 | + |
| 226 | +1. `from_snapshot_in_memory` - pre-init snapshot (Initialise entrypoint) |
| 227 | +2. `from_snapshot_post_init_in_memory` - post-init snapshot (Call |
| 228 | + entrypoint) |
| 229 | +3. `round_trip_save_load_call` - save post-init snapshot, load from |
| 230 | + file, create sandbox, call guest function |
| 231 | +4. `hash_verification_detects_corruption` - corrupt memory blob byte, |
| 232 | + verify load fails |
| 233 | +5. `arch_mismatch_rejected` - modify arch tag, verify load fails |
| 234 | +6. `format_version_mismatch_rejected` - modify version, verify load |
| 235 | + fails with "convertible" hint |
| 236 | +7. `abi_version_mismatch_rejected` - modify ABI version, verify load |
| 237 | + fails with "regenerated" hint |
| 238 | +8. `restore_from_loaded_snapshot` - load, mutate, snapshot, mutate, |
| 239 | + restore, verify |
| 240 | +9. `multiple_sandboxes_from_same_file` - two sandboxes from same file, |
| 241 | + verify independence |
| 242 | +10. `snapshot_then_save_round_trip` - load, mutate, save, load again, |
| 243 | + verify mutated state persisted |
| 244 | + |
| 245 | +--- |
| 246 | + |
| 247 | +## Benchmarks |
| 248 | + |
| 249 | +Benchmark group `snapshot_files` with 5 benchmarks per size (default, |
| 250 | +small/8MB, medium/64MB, large/256MB): |
| 251 | + |
| 252 | +- `save_snapshot` - `snapshot.to_file()` |
| 253 | +- `load_snapshot` - `Snapshot::from_file()` (mmap + hash verify) |
| 254 | +- `cold_start_via_evolve` - `new()` + `evolve()` + `call("Echo")` |
| 255 | +- `cold_start_via_snapshot` - `from_file()` + `from_snapshot()` |
| 256 | + + `call("Echo")` |
| 257 | +- `cold_start_via_snapshot_unchecked` - same with `from_file_unchecked()` |
| 258 | + |
| 259 | +--- |
| 260 | + |
| 261 | +## Results (Linux/KVM) |
| 262 | + |
| 263 | +All three paths measure end-to-end wall-clock time from zero state to |
| 264 | +a completed guest function call (`Echo("hello\n") -> "hello\n"`). |
| 265 | +Each path includes creating the VM, mapping memory, and dispatching |
| 266 | +one guest call. |
| 267 | + |
| 268 | +- **evolve path**: parse ELF, build page tables, create VM, run guest |
| 269 | + init code, call guest function |
| 270 | +- **snapshot path (verified)**: open file, read header, mmap memory |
| 271 | + blob from file at page-aligned offset, hash-verify entire blob, |
| 272 | + create VM from snapshot, call guest function |
| 273 | +- **snapshot path (unverified)**: same but skip hash verification |
| 274 | + |
| 275 | +| Heap size | evolve path | snapshot (verified) | snapshot (unverified) | Speedup (unverified vs evolve) | |
| 276 | +|---|---|---|---|---| |
| 277 | +| 128 KB (default) | 3.09 ms | 2.32 ms | 2.24 ms | 1.4x | |
| 278 | +| 8 MB | 7.29 ms | 4.91 ms | 2.39 ms | 3.1x | |
| 279 | +| 64 MB | 24.1 ms | 22.3 ms | 2.74 ms | 8.8x | |
| 280 | +| 256 MB | 78.9 ms | 57.3 ms | 2.64 ms | 30x | |
| 281 | + |
| 282 | +The unverified snapshot path is constant time (~3 ms) regardless of |
| 283 | +snapshot size because the mmap is lazy - pages are only faulted in as |
| 284 | +the guest touches them. Hash verification dominates for larger |
| 285 | +snapshots since it touches the entire memory blob. |
| 286 | + |
| 287 | +--- |
| 288 | + |
| 289 | +## Future Work |
| 290 | + |
| 291 | +- **`SnapshotLoader` builder**: Replace `from_snapshot(snapshot)` |
| 292 | + with a builder that takes `.with_host_function()`, |
| 293 | + `.with_interrupt_retry_delay()`, validates host functions at |
| 294 | + `build()`. |
| 295 | +- **Host function defs in file format**: Serialize function signatures |
| 296 | + into the snapshot file, validate on load |
| 297 | +- **Typed error variants**: `SnapshotVersionMismatch`, etc. |
| 298 | +- **Feature-gate support**: `nanvix-unstable`, `crashdump`, `gdb` cfgs |
| 299 | +- **Single-mmap loading**: mmap the entire snapshot file once and parse |
| 300 | + the header from the mapped bytes instead of `read()` + separate mmap. |
| 301 | + Requires refactoring `HostMapping` guard page assumptions. Saves ~1 us |
| 302 | + per load (negligible vs ~3 ms total), but simplifies the I/O path. |
| 303 | +- **Fuzz target**: Fuzz `from_file` with arbitrary bytes |
| 304 | +- **CLI tool**: `hl snap bake?` |
| 305 | +- **CoW overlay layers** |
| 306 | +- **Cross-hypervisor snapshot portability**: The `hypervisor_tag` |
| 307 | + rejects cross-hypervisor loads because segment register hidden-cache |
| 308 | + fields differ between KVM, MSHV, and WHP. Could potentially be |
| 309 | + relaxed in the future (needs sregs normalization and maybe more). |
| 310 | +- **Huge page support**: The 4 KB header is sufficient for transparent |
| 311 | + huge pages via `madvise(MADV_HUGEPAGE)`. Explicit `MAP_HUGETLB` |
| 312 | + would require a 2 MB-aligned blob offset; the `memory_offset` field |
| 313 | + already supports this without a format version bump. |
| 314 | +- **OCI distribution** |
| 315 | +- **Malicious header hardening**: The header is currently trusted after |
| 316 | + magic/version/arch/ABI/hypervisor validation. A crafted snapshot |
| 317 | + file could supply out-of-range layout fields (e.g. huge heap_size, |
| 318 | + memory_size larger than the file, overlapping regions) that cause |
| 319 | + excessive allocation, out-of-bounds access, or other misbehavior. |
| 320 | + The blake3 hash covers the memory blob but not the header itself. |
| 321 | + Consider: validating header fields against sane bounds, hashing the |
| 322 | + full header, and fuzzing `from_file` with arbitrary bytes. |
0 commit comments