Skip to content

Commit 202cc70

Browse files
committed
feat(sandbox): persist snapshots to disk + zero-copy load
Squash of hyperlight-dev#1373 by Ludvig Liljenberg onto current upstream main. Ports his three-commit series (layout refactor, design doc, persistence) as a single commit on this branch so we can iterate it without touching his fork. Highlights: Snapshot::to_file(path) — write a sandbox snapshot to disk (header + page-aligned blob + CoW bitmap + guard-page padding) Snapshot::from_file(path) — mmap it back with zero copy MultiUseSandbox::from_snapshot() — instantiate a sandbox directly from a persisted snapshot, bypassing ELF parsing and guest init ReadonlySharedMemory::from_file — the shared-memory primitive under both of the above, with Linux (mmap(MAP_PRIVATE)) and Windows (CreateFileMappingA + MapViewOfFile) zero-copy paths See docs/snapshot-file-implementation-plan.md for the wire format. The Windows code path currently maps the file as read-only shared (PAGE_READONLY / FILE_MAP_READ) rather than true copy-on-write (PAGE_WRITECOPY / FILE_MAP_COPY). That works for the boot path on WHP because guest writes go through the surrogate's own mapping, but breaks the contract for anything that writes directly through the host view. A follow-up commit on this branch switches it to true CoW so the API matches the Linux semantics end-to-end. Based-on: hyperlight-dev#1373 Authored-by: Ludvig Liljenberg <ludfjig@users.noreply.github.com> Signed-off-by: danbugs <danilochiarlone@gmail.com>
1 parent 2fca7ae commit 202cc70

13 files changed

Lines changed: 2056 additions & 312 deletions

File tree

Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
# Snapshot File Format
2+
3+
## Overview
4+
5+
Save a `Snapshot` to disk and load it back via zero-copy file
6+
mapping, so that a `MultiUseSandbox` can be created directly from a
7+
file without re-parsing the guest ELF or re-running guest init code.
8+
9+
- **Linux**: `mmap(MAP_PRIVATE)` at page-aligned offset - zero copy,
10+
demand-paged by the kernel.
11+
- **Windows**: `CreateFileMappingA(PAGE_READONLY)` +
12+
`MapViewOfFile(FILE_MAP_READ)` - zero copy, demand-paged by the OS.
13+
14+
Cross-platform (Linux + Windows). Default feature flags only
15+
(`nanvix-unstable`, `crashdump`, `gdb` not handled).
16+
17+
---
18+
19+
## File Format
20+
21+
The file uses a versioned header with two independent version checks:
22+
23+
- **Format version** (`FormatVersion` enum): controls the byte layout
24+
of the header itself. A format version mismatch may be convertible
25+
by re-serializing the header.
26+
- **ABI version** (`SNAPSHOT_ABI_VERSION` constant): covers the
27+
contents and interpretation of the memory blob. An ABI mismatch
28+
means the snapshot must be regenerated from the guest binary.
29+
30+
```
31+
Offset Size Field
32+
------ ------- --------------------------------------------------
33+
0 4 Magic bytes: "HLS\0"
34+
4 4 Format version (u32 LE: 1 = V1)
35+
8 4 Architecture tag (u32 LE: 1 = x86_64, 2 = aarch64)
36+
12 4 ABI version (u32 LE: must match SNAPSHOT_ABI_VERSION)
37+
16 32 Content hash (blake3, over memory blob only)
38+
48 8 stack_top_gva (u64 LE)
39+
56 8 Entrypoint tag (u64 LE: 0 = Initialise, 1 = Call)
40+
64 8 Entrypoint address (u64 LE)
41+
72 8 input_data_size (u64 LE)
42+
80 8 output_data_size (u64 LE)
43+
88 8 heap_size (u64 LE)
44+
96 8 code_size (u64 LE)
45+
104 8 init_data_size (u64 LE)
46+
112 8 init_data_permissions (u64 LE: 0 = None, else bits)
47+
120 8 scratch_size (u64 LE)
48+
128 8 snapshot_size (u64 LE)
49+
136 8 pt_size (u64 LE: 0 = None)
50+
144 8 memory_size (u64 LE) - byte length of memory blob
51+
Derivable from layout fields today, but stored for
52+
forward compat (e.g. compression).
53+
152 8 memory_offset (u64 LE) - byte offset from file start
54+
Always SNAPSHOT_HEADER_SIZE today, but stored so a
55+
future format can relocate the blob without breaking.
56+
160 8 has_sregs (u64 LE: 1 = present, 0 = absent)
57+
168 8 hypervisor_tag (u64 LE: 1 = KVM, 2 = MSHV, 3 = WHP)
58+
176 952 sregs fields (all widened to u64 LE, see below)
59+
1120 2976 Zero padding to 4096-byte boundary
60+
4096 * Memory blob (page-aligned, uncompressed, mmap target)
61+
*+4096 4096 Trailing zero padding (guard page backing for Windows)
62+
```
63+
64+
Total header before padding: 1128 bytes, well within the 4096-byte
65+
page.
66+
67+
The trailing PAGE_SIZE padding exists because Windows read-only file
68+
mappings cannot extend beyond the file's actual size.
69+
`ReadonlySharedMemory::from_file_windows` maps the entire file and
70+
uses `VirtualProtect(PAGE_NOACCESS)` on both the first page (header)
71+
and last page (trailing padding) as guard pages. Linux ignores this
72+
padding - its guard pages come from an anonymous mmap reservation.
73+
74+
### Layout fields
75+
76+
The 9 layout fields (offsets 72-136) are the primary inputs to
77+
`SandboxMemoryLayout::new()`. On load, a `SandboxConfiguration` is
78+
reconstructed from `input_data_size`, `output_data_size`, `heap_size`,
79+
and `scratch_size`; the remaining fields (`code_size`,
80+
`init_data_size`, `init_data_permissions`) are passed directly.
81+
`snapshot_size` and `pt_size` are set after construction.
82+
83+
### Hypervisor tag
84+
85+
Segment register hidden-cache fields (`unusable`, `type_`,
86+
`granularity`, `db`) differ between KVM, MSHV, and WHP for the same
87+
architectural state. Restoring sregs captured on one hypervisor into
88+
another may be rejected or produce subtly wrong behavior. The
89+
`hypervisor_tag` field ensures snapshots are only loaded on the same
90+
hypervisor that created them. See "Cross-hypervisor snapshot
91+
portability" under Future Work for how this restriction could be
92+
relaxed.
93+
94+
### Special registers (sregs)
95+
96+
The vCPU special registers are persisted because the guest init
97+
code sets up a GDT, IDT, TSS, and segment descriptors that differ
98+
from `standard_64bit_defaults`. Without the captured sregs, the guest
99+
triple-faults on dispatch. Specifically, the guest init sets:
100+
101+
- cs/ds/es/fs/gs/ss with proper selectors, limits, and granularity
102+
- GDT and IDT base/limit pointing into guest high memory
103+
- TSS (task register) with a valid base, selector, and limit
104+
- LDT marked as unusable
105+
106+
All fields widened to u64 LE: 8 segment regs x 13 fields + 2 table
107+
regs x 2 fields + 7 control regs + 4 interrupt bitmap = 119 u64s
108+
(952 bytes). Always written; ignored on load when `has_sregs = 0`.
109+
110+
### What is NOT persisted
111+
112+
| Field | Reason |
113+
|---|---|
114+
| `sandbox_id` | Process-local counter; fresh ID assigned on load |
115+
| `LoadInfo` | Debug-only; reconstructible from ELF if needed |
116+
| `regions` | Always empty after snapshot (absorbed into memory) |
117+
| Runtime config | Defaults used at load time |
118+
| Host function defs | Deferred to a follow-up PR |
119+
120+
### What IS persisted
121+
122+
The memory blob contains **only the snapshot region**: guest code,
123+
PEB, heap, init data, and page tables (`ReadonlySharedMemory`).
124+
125+
The **scratch region** is recreated fresh on load via
126+
`ExclusiveSharedMemory::new()`, then initialized by
127+
`update_scratch_bookkeeping()` (copies page tables from snapshot to
128+
scratch, writes I/O buffer metadata).
129+
130+
---
131+
132+
## Saving and Loading
133+
134+
### `Snapshot::to_file(&self, path)` / `Snapshot::from_file(path)`
135+
136+
Manual binary serialization via `SnapshotPreamble` + `SnapshotHeaderV1`
137+
structs with `write_to` / `read_from` methods, followed by the raw
138+
memory blob and trailing padding. `from_file` maps the memory blob
139+
via `ReadonlySharedMemory::from_file(&file, offset, len)`.
140+
`from_file_unchecked` skips the blake3 hash verification for trusted
141+
environments.
142+
143+
On load, the header is validated in order: magic, format version,
144+
architecture, ABI version, hypervisor tag. Any mismatch produces a
145+
descriptive error.
146+
147+
### `ReadonlySharedMemory::from_file(file, offset, len)`
148+
149+
Cross-platform entry point that dispatches to platform-specific
150+
implementations:
151+
152+
- **Linux** (`from_file_linux`): Allocates anonymous `PROT_NONE`
153+
region (with guard pages), then `MAP_FIXED` the file content over
154+
the usable portion with `PROT_READ | PROT_WRITE` + `MAP_PRIVATE`.
155+
KVM/MSHV need writable host mappings for CoW page fault handling.
156+
`HostMapping::Drop` calls `munmap` on the full region.
157+
158+
- **Windows** (`from_file_windows`): `CreateFileMappingA(PAGE_READONLY)`
159+
+ `MapViewOfFile(FILE_MAP_READ)` covering the full file (header +
160+
blob + trailing padding). The header becomes the leading guard page
161+
and the trailing padding becomes the trailing guard page, both via
162+
`VirtualProtect(PAGE_NOACCESS)`. The `HostMapping` carries the file
163+
mapping handle for the surrogate process. `HostMapping::Drop` calls
164+
`UnmapViewOfFile` + `CloseHandle`.
165+
166+
Both paths produce a `HostMapping` with the standard layout:
167+
`ptr` = start of first guard page, `size` = guard + usable + guard.
168+
`base_ptr() = ptr + PAGE_SIZE`, `mem_size() = size - 2*PAGE_SIZE`.
169+
170+
### `MultiUseSandbox::from_snapshot(snapshot: Arc<Snapshot>)`
171+
172+
Creates a sandbox bypassing `UninitializedSandbox` and `evolve()`:
173+
174+
1. Create default `FunctionRegistry`
175+
2. Build `SandboxConfiguration` from snapshot layout fields
176+
3. `SandboxMemoryManager::from_snapshot()` - clones the
177+
`ReadonlySharedMemory`, creates fresh scratch
178+
4. `mgr.build()` - splits into host/guest views, runs
179+
`update_scratch_bookkeeping()`
180+
5. `setup_signal_handlers()` (Linux only - VCPU interrupt signaling)
181+
6. `set_up_hypervisor_partition()` - creates VM (KVM/MSHV on Linux,
182+
WHP on Windows), maps slot 0 (snapshot) and slot 1 (scratch)
183+
7. `vm.initialise()` - runs guest init if `NextAction::Initialise`,
184+
no-op if `NextAction::Call`
185+
8. For post-init snapshots, `vm.apply_sregs()` applies captured
186+
sregs (sets sregs + pending TLB flush, no redundant GPR/debug/FPU
187+
resets)
188+
9. Returns `MultiUseSandbox`
189+
190+
Host functions are not yet supported when loading from snapshot.
191+
A `SnapshotLoader` builder with `.with_host_function()` is planned
192+
as future work.
193+
194+
### Supporting changes
195+
196+
- `SandboxMemoryLayout` simplified to 9 `pub(crate)` fields with
197+
computed `#[inline]` offset methods; `new()` takes
198+
`SandboxConfiguration`, `code_size`, `init_data_size`,
199+
`init_data_permissions`
200+
- `HyperlightPEB::write_to()` and `GuestMemoryRegion::write_to()`
201+
added to `hyperlight_common`
202+
- `HyperlightVm::apply_sregs()` added to `hyperlight_vm/x86_64.rs`
203+
for efficient sreg restore without redundant register resets
204+
205+
---
206+
207+
## Files
208+
209+
| File | Purpose |
210+
|---|---|
211+
| `src/hyperlight_host/src/sandbox/snapshot.rs` | File format types, `to_file`, `from_file`, `from_file_unchecked`, sregs serialization, `HypervisorTag`, 10 tests |
212+
| `src/hyperlight_host/src/sandbox/initialized_multi_use.rs` | `MultiUseSandbox::from_snapshot(Arc<Snapshot>)` (cross-platform) |
213+
| `src/hyperlight_host/src/mem/shared_mem.rs` | `ReadonlySharedMemory::from_file()` (cross-platform dispatch to `from_file_linux` / `from_file_windows`) |
214+
| `src/hyperlight_host/src/mem/memory_region.rs` | `SurrogateMapping` routing for `Snapshot` regions |
215+
| `src/hyperlight_host/src/mem/layout.rs` | Simplified to 9 fields, computed offset methods, `write_peb()` uses `HyperlightPEB::write_to()` |
216+
| `src/hyperlight_common/src/mem.rs` | `HyperlightPEB::write_to()`, `GuestMemoryRegion::write_to()` |
217+
| `src/hyperlight_host/src/hypervisor/hyperlight_vm/x86_64.rs` | `apply_sregs()` method |
218+
| `src/hyperlight_host/benches/benchmarks.rs` | `snapshot_files` benchmark group |
219+
220+
---
221+
222+
## Tests
223+
224+
All in `snapshot_file_tests` module inside `snapshot.rs`:
225+
226+
1. `from_snapshot_in_memory` - pre-init snapshot (Initialise entrypoint)
227+
2. `from_snapshot_post_init_in_memory` - post-init snapshot (Call
228+
entrypoint)
229+
3. `round_trip_save_load_call` - save post-init snapshot, load from
230+
file, create sandbox, call guest function
231+
4. `hash_verification_detects_corruption` - corrupt memory blob byte,
232+
verify load fails
233+
5. `arch_mismatch_rejected` - modify arch tag, verify load fails
234+
6. `format_version_mismatch_rejected` - modify version, verify load
235+
fails with "convertible" hint
236+
7. `abi_version_mismatch_rejected` - modify ABI version, verify load
237+
fails with "regenerated" hint
238+
8. `restore_from_loaded_snapshot` - load, mutate, snapshot, mutate,
239+
restore, verify
240+
9. `multiple_sandboxes_from_same_file` - two sandboxes from same file,
241+
verify independence
242+
10. `snapshot_then_save_round_trip` - load, mutate, save, load again,
243+
verify mutated state persisted
244+
245+
---
246+
247+
## Benchmarks
248+
249+
Benchmark group `snapshot_files` with 5 benchmarks per size (default,
250+
small/8MB, medium/64MB, large/256MB):
251+
252+
- `save_snapshot` - `snapshot.to_file()`
253+
- `load_snapshot` - `Snapshot::from_file()` (mmap + hash verify)
254+
- `cold_start_via_evolve` - `new()` + `evolve()` + `call("Echo")`
255+
- `cold_start_via_snapshot` - `from_file()` + `from_snapshot()`
256+
+ `call("Echo")`
257+
- `cold_start_via_snapshot_unchecked` - same with `from_file_unchecked()`
258+
259+
---
260+
261+
## Results (Linux/KVM)
262+
263+
All three paths measure end-to-end wall-clock time from zero state to
264+
a completed guest function call (`Echo("hello\n") -> "hello\n"`).
265+
Each path includes creating the VM, mapping memory, and dispatching
266+
one guest call.
267+
268+
- **evolve path**: parse ELF, build page tables, create VM, run guest
269+
init code, call guest function
270+
- **snapshot path (verified)**: open file, read header, mmap memory
271+
blob from file at page-aligned offset, hash-verify entire blob,
272+
create VM from snapshot, call guest function
273+
- **snapshot path (unverified)**: same but skip hash verification
274+
275+
| Heap size | evolve path | snapshot (verified) | snapshot (unverified) | Speedup (unverified vs evolve) |
276+
|---|---|---|---|---|
277+
| 128 KB (default) | 3.09 ms | 2.32 ms | 2.24 ms | 1.4x |
278+
| 8 MB | 7.29 ms | 4.91 ms | 2.39 ms | 3.1x |
279+
| 64 MB | 24.1 ms | 22.3 ms | 2.74 ms | 8.8x |
280+
| 256 MB | 78.9 ms | 57.3 ms | 2.64 ms | 30x |
281+
282+
The unverified snapshot path is constant time (~3 ms) regardless of
283+
snapshot size because the mmap is lazy - pages are only faulted in as
284+
the guest touches them. Hash verification dominates for larger
285+
snapshots since it touches the entire memory blob.
286+
287+
---
288+
289+
## Future Work
290+
291+
- **`SnapshotLoader` builder**: Replace `from_snapshot(snapshot)`
292+
with a builder that takes `.with_host_function()`,
293+
`.with_interrupt_retry_delay()`, validates host functions at
294+
`build()`.
295+
- **Host function defs in file format**: Serialize function signatures
296+
into the snapshot file, validate on load
297+
- **Typed error variants**: `SnapshotVersionMismatch`, etc.
298+
- **Feature-gate support**: `nanvix-unstable`, `crashdump`, `gdb` cfgs
299+
- **Single-mmap loading**: mmap the entire snapshot file once and parse
300+
the header from the mapped bytes instead of `read()` + separate mmap.
301+
Requires refactoring `HostMapping` guard page assumptions. Saves ~1 us
302+
per load (negligible vs ~3 ms total), but simplifies the I/O path.
303+
- **Fuzz target**: Fuzz `from_file` with arbitrary bytes
304+
- **CLI tool**: `hl snap bake?`
305+
- **CoW overlay layers**
306+
- **Cross-hypervisor snapshot portability**: The `hypervisor_tag`
307+
rejects cross-hypervisor loads because segment register hidden-cache
308+
fields differ between KVM, MSHV, and WHP. Could potentially be
309+
relaxed in the future (needs sregs normalization and maybe more).
310+
- **Huge page support**: The 4 KB header is sufficient for transparent
311+
huge pages via `madvise(MADV_HUGEPAGE)`. Explicit `MAP_HUGETLB`
312+
would require a 2 MB-aligned blob offset; the `memory_offset` field
313+
already supports this without a format version bump.
314+
- **OCI distribution**
315+
- **Malicious header hardening**: The header is currently trusted after
316+
magic/version/arch/ABI/hypervisor validation. A crafted snapshot
317+
file could supply out-of-range layout fields (e.g. huge heap_size,
318+
memory_size larger than the file, overlapping regions) that cause
319+
excessive allocation, out-of-bounds access, or other misbehavior.
320+
The blake3 hash covers the memory blob but not the header itself.
321+
Consider: validating header fields against sane bounds, hashing the
322+
full header, and fuzzing `from_file` with arbitrary bytes.

src/hyperlight_common/src/mem.rs

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,23 @@ pub struct GuestMemoryRegion {
2828
pub ptr: u64,
2929
}
3030

31+
impl GuestMemoryRegion {
32+
/// Size of a serialized `GuestMemoryRegion` in bytes.
33+
pub const SERIALIZED_SIZE: usize = core::mem::size_of::<Self>();
34+
35+
/// Write this region's fields in native-endian byte order to `buf`.
36+
/// Returns `Ok(())` on success, or `Err` if `buf` is too small.
37+
pub fn write_to(&self, buf: &mut [u8]) -> Result<(), &'static str> {
38+
if buf.len() < Self::SERIALIZED_SIZE {
39+
return Err("buffer too small for GuestMemoryRegion");
40+
}
41+
let s = core::mem::size_of::<u64>();
42+
buf[..s].copy_from_slice(&self.size.to_ne_bytes());
43+
buf[s..s * 2].copy_from_slice(&self.ptr.to_ne_bytes());
44+
Ok(())
45+
}
46+
}
47+
3148
/// Maximum length of a file mapping label (excluding null terminator).
3249
pub const FILE_MAPPING_LABEL_MAX_LEN: usize = 63;
3350

@@ -80,3 +97,28 @@ pub struct HyperlightPEB {
8097
#[cfg(feature = "nanvix-unstable")]
8198
pub file_mappings: GuestMemoryRegion,
8299
}
100+
101+
impl HyperlightPEB {
102+
/// Write the PEB fields in native-endian byte order to `buf`.
103+
/// The buffer must be at least `size_of::<HyperlightPEB>()` bytes.
104+
/// Returns `Err` if the buffer is too small.
105+
pub fn write_to(&self, buf: &mut [u8]) -> Result<(), &'static str> {
106+
if buf.len() < core::mem::size_of::<Self>() {
107+
return Err("buffer too small for HyperlightPEB");
108+
}
109+
let regions = [
110+
&self.input_stack,
111+
&self.output_stack,
112+
&self.init_data,
113+
&self.guest_heap,
114+
#[cfg(feature = "nanvix-unstable")]
115+
&self.file_mappings,
116+
];
117+
let mut offset = 0;
118+
for region in regions {
119+
region.write_to(&mut buf[offset..])?;
120+
offset += GuestMemoryRegion::SERIALIZED_SIZE;
121+
}
122+
Ok(())
123+
}
124+
}

0 commit comments

Comments
 (0)