Skip to content

Commit 585daca

Browse files
committed
Update docs to include some diagrams and current behaviors
Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
1 parent 34db0d8 commit 585daca

File tree

1 file changed

+141
-49
lines changed

1 file changed

+141
-49
lines changed

docs/paging-development-notes.md

Lines changed: 141 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -2,66 +2,153 @@
22

33
When running on a Type 1 hypervisor, servicing a Stage 2 translation
44
page fault is relatively quite expensive, since it requires quite a
5-
lot of context switches. To help alleviate this, Hyperlight uses an
6-
unusual design in which the guest is aware of readonly snapshot from
5+
lot of context switches. To help alleviate this, Hyperlight uses a
6+
design in which the guest is aware of a readonly snapshot from
77
which it is being run, and manages its own copy-on-write.
88

99
Because of this, there are two very fundamental regions of the guest
10-
physical address space, which are always populated: one, at the very
11-
bottom of memory, is a (hypervisor-enforced) readonly mapping of the
12-
base snapshot from which this guest is being evolved. Another, at the
13-
top of memory, is simply a large bag of blank pages: scratch memory
14-
into which this VM can write.
10+
physical address space, which are always populated: one, near the
11+
bottom of memory (starting at GPA `0x1000`), is a
12+
(hypervisor-enforced) readonly mapping of the base snapshot from which
13+
this guest is being evolved. Another, at the top of memory, is simply
14+
a large bag of blank pages: scratch memory into which this VM can
15+
write.
16+
17+
```
18+
Guest Physical Address Space (GPA)
19+
20+
+-------------------------------+ MAX_GPA
21+
| Exn Stack, Bookkeeping |
22+
| (scratch size, allocator |
23+
| state, reserved PT slot) |
24+
+-------------------------------+
25+
| Free Scratch Memory |
26+
+-------------------------------+
27+
| Output Data |
28+
+-------------------------------+
29+
| Input Data |
30+
+-------------------------------+
31+
| |
32+
| (unmapped — no RAM |
33+
| backing these addrs) |
34+
| |
35+
+-------------------------------+
36+
| |
37+
| Snapshot (RO / CoW on write) |
38+
| Guest Page Tables |
39+
| Init Data |
40+
| Guest Heap |
41+
| PEB |
42+
| Guest Binary |
43+
| |
44+
+-------------------------------+ 0x1000
45+
| (null guard page) |
46+
+-------------------------------+ 0x0000
47+
```
48+
49+
1550

1651
## The scratch map
1752

1853
Whenever the guest needs to write to a page in the snapshot region, it
1954
will need to copy it into a page in the scratch region, and change the
20-
original virtual address to point to the new page. The page table
55+
original virtual address to point to the new page.
56+
57+
```
58+
CoW page fault flow:
59+
60+
BEFORE (guest writes to CoW page -> fault)
61+
62+
PTE for VA 0x5000:
63+
+----------+-----+-----+
64+
| GPA | CoW | R/O | Points to snapshot page
65+
| 0x5000 | 1 | 1 |
66+
+----------+-----+-----+
67+
|
68+
v
69+
Snapshot region (readonly)
70+
+--------------------+
71+
| original content | GPA 0x5000
72+
+--------------------+
73+
74+
AFTER (fault handler resolves)
75+
76+
1. Allocate fresh page from scratch (bump allocator)
77+
2. Copy snapshot page -> new scratch page
78+
3. Update PTE to point to scratch page
79+
80+
PTE for VA 0x5000:
81+
+----------+-----+-----+
82+
| GPA | CoW | R/W | Points to scratch page
83+
| 0xf_ff.. | 0 | 1 |
84+
+----------+-----+-----+
85+
|
86+
v
87+
Scratch region (writable)
88+
+--------------------+
89+
| copied content | (new GPA in scratch)
90+
+--------------------+
91+
92+
Snapshot page at GPA 0x5000 is untouched.
93+
```
94+
95+
The page table
2196
entries to do this will likely need to be copied themselves, and so a
2297
ready supply of already-mapped scratch pages to use for replacement
23-
page tables is needed. Currently, the guest accomplishes this by
24-
keeping an identity mapping of the entire scratch memory around.
98+
page tables is set up by the Host. The guest keeps a mapping of the entire scratch
99+
physical memory into virtual memory at a fixed offset
100+
(`scratch_base_gva - scratch_base_gpa`), so that any scratch physical
101+
address can be accessed by adding this offset.
25102

26103
The host and the guest need to agree on the location of this mapping,
27104
so that (a) the host can create it when first setting up a blank guest
28105
and (b) the host can ignore it when taking a snapshot (see below).
29106

30-
Currently, the host always creates the scratch map at the top of
31-
virtual memory. In the future, we may add support for a guest to
107+
The host creates the scratch map at the top of virtual memory
108+
(`MAX_GVA - scratch_size + 1`) and at the top of physical memory
109+
(`MAX_GPA - scratch_size + 1`). In the future, we may add support for a guest to
32110
request that it be moved.
33111

34112
## The snapshot mapping
35113

36-
Do we actually need to have a physmap type mapping of the entire
37-
snapshot memory? We only really use it when copying from it in which
38-
case we ought to have the VA that we need to copy from already. There
39-
is one major exception to this, which is the page tables
40-
themselves. The page tables themselves must be mapped at some VA so
41-
that we can copy them.
114+
The snapshot page tables must be mapped at some virtual address so
115+
that the guest can read and copy them during CoW operations.
116+
Today, the host simply
117+
copies the page tables into scratch when restoring a sandbox, and the
118+
guest works on those scratch copies directly.
119+
120+
## Top-of-scratch metadata layout
42121

43-
Setting this VA statically on the host is a bit annoying, since we are
44-
already using the top of memory for the scratchmap. Unfortunately,
45-
since the size of the page tables changes as the sandbox evolves
46-
through e.g. snapshot/restore, we cannot preallocate it...
122+
The top page of the scratch region contains structured metadata at
123+
fixed offsets down from the top:
47124

48-
Let's just be stupid and leave them at 0xffff_0000_0000_0000 for now.
125+
| Offset from top | Field |
126+
|-----------------|--------------------------------------|
127+
| `0x08` | Scratch size (`u64`) |
128+
| `0x10` | Allocator state (`u64`) |
129+
| `0x18` | Reserved snapshot PT base (`u64`) |
130+
| `0x20` | Exception stack starts here |
131+
132+
These offsets are defined as `SCRATCH_TOP_*` constants in
133+
`hyperlight_common::layout`.
49134

50135
## The physical page allocator
51136

52137
The host needs to be able to reset the state of the physical page
53-
allocator when resuming from a snapshot. Currently, we use a simple
54-
bump allocator as a physical page allocator, with no support for free,
138+
allocator when resuming from a snapshot. We use a simple bump
139+
allocator as a physical page allocator, with no support for free,
55140
since pages not in use will automatically be omitted from a snapshot.
56-
Therefore, the allocator state is nothing but a single `u64` that
57-
tracks the address of the first free page. This `u64` will always be
58-
located at the top of scratch physical memory.
141+
The allocator state is a single `u64` tracking the address of the
142+
first free page, located at offset `0x10` from the top of scratch
143+
(see layout above). The guest advances it atomically via `lock xadd`.
59144

60145
## The guest exception stack
61146

62147
Similarly, the guest needs a stack that is always writable, in order
63-
to be able to take exceptions to it. The remainder of the top page of
64-
the scratch memory is used for this.
148+
to be able to take exceptions to it. The exception stack begins at
149+
offset `0x20` from the top of the scratch region (below the metadata
150+
fields described above) and grows downward through the remainder of
151+
the top page.
65152

66153
## Taking a snapshot
67154

@@ -90,20 +177,26 @@ calls, i.e. there may be no calls in flight at the time of
90177
snapshotting. This is not enforced, but odd things may happen if it is
91178
violated.
92179

93-
When a snapshot is taken, any outstanding buffers which the guest has
94-
indicated it is waiting for the host to write to will be moved to the
95-
bottom of the new scratch region and zeroed.
96-
97-
Q: how will the guest know about this? Maybe A: The guest nominates a
98-
virtual address that it wants to have this sort of bookkeeping
99-
information mapped at, and the snapshot creation process treats that
100-
address specially writing out a manifest
101-
102-
Q: how do we want to manage buffer
103-
allocation/freeing/reallocation/etc? Maybe A: for now we will mostly
104-
ignore because we only need 1-2 buffers inflight at a time. We can
105-
emulate the current paradigm by recreating a new buffer out of the
106-
free space in the original buffer on call, etc etc.
180+
I/O buffers are statically allocated at the bottom of the scratch
181+
region:
182+
183+
```
184+
+-------------------------------------------+ (top of scratch)
185+
| Exn Stack, Bookkeeping |
186+
| (scratch size, allocator state, |
187+
| reserved PT base) |
188+
+-------------------------------------------+
189+
| Free Scratch Memory |
190+
+-------------------------------------------+
191+
| Output Data |
192+
+-------------------------------------------+
193+
| Input Data |
194+
+-------------------------------------------+ (scratch base)
195+
```
196+
197+
The minimum scratch size (`min_scratch_size()`) accounts for these
198+
buffers plus overhead for the Task State Segment (TSS), Interrupt Descriptor Table (IDT), page table CoW, a minimal
199+
non-exception stack, and the exception stack and metadata.
107200

108201
## Creating a fresh guest
109202

@@ -113,11 +206,10 @@ which simply map the segments of that ELF to the appropriate places in
113206
virtual memory. If the ELF has segments whose virtual addresses
114207
overlap with the scratch map, an error will be returned.
115208

116-
The initial stack pointer will point to the top of the second-highest
117-
page of the scratch map, but this should usually be changed by early
118-
init code in the guest, since it will otherwise be difficult to detect
119-
collisions between the guest stack and the scratch physical page
120-
allocator.
209+
In the current startup path, the host enters the guest with
210+
`RSP` pointing to the exception stack. Early guest init then
211+
allocates the main stack at `MAIN_STACK_TOP_GVA`, switches to it,
212+
and continues generic initialization.
121213

122214
# Architecture-specific details of virtual memory setup
123215

0 commit comments

Comments
 (0)