Skip to content

Commit 7642bee

Browse files
committed
Speedup identity, urandom, and clock_gettime
This introduces an EL1-only shim_data block holding a host-published cache: identity slots (pid/ppid/uid/euid/gid/egid/tid), urandom-eligible fd bitmap, a 4 KiB urandom ring with head/tail/lock, and a 32-bit attention bitmask. The EL1 shim assembly serves identity and urandom 1-byte reads inline without trapping to the host; the existing HVC #5 forwarder is taken only when attention is raised, when a non-urandom fd is consulted, or when the ring needs a host-side refill. Measured at 1 M iterations under the new tests/bench-hot-syscalls.c : getpid/getppid/getuid/geteuid/getgid/getegid/gettid : 47 ns/op clock_gettime via __kernel_clock_gettime vDSO : 3.7 ns/op read(/dev/urandom, 1 byte) : 134 ns/op clock_gettime via SVC fallback : 2056 ns/op The vDSO clock_gettime trampoline now seeds CLOCK_{MONOTONIC,REALTIME} anchors back-to-back from a single SVC fallback, so the fast path serves either clockid after one warm-up call. The X9/ELR_EL1 gate runs before the host wall-clock samples so the anchor inherits no positive bias from the seeding round trip. Integrity boundary around the new cache: - The shim_data block is mapped MEM_PERM_RW_EL1_ONLY (AP[2:1]=00) by both bootstrap and execve so EL0 cannot read or store the bytes directly. /proc/self/maps reports PROT_NONE for [shim-data] to match what guest dereferences would observe. - gva_translate_perm refuses MEM_PERM_EL1_ONLY descriptors on guest-behalf access in both the L2 block and L3 page walk paths. read(fd, shim_data_gva, n) now returns EFAULT instead of letting the host spoof the cache. - elf_map_segments takes an explicit infra reserve range and rejects PT_PHDR copies or PT_LOAD segments whose page-aligned write extent intersects it, closing a host-side overwrite path through the ELF loader that bypassed page-table permissions. - A new EL1 data-abort recover handler in shim.S catches strb faults inside named urandom write ranges (caused by a racing EL0 munmap or mprotect), drops the inner exception frame, releases the ring lock, and returns EFAULT to EL0. Cred publish is bracketed so concurrent fast-path readers see a consistent snapshot. The attention word splits into ATTN_BIT_SIGTIMER (0x1), ATTN_BIT_CRED (0x2), and ATTN_BIT_TRACE (0x4). CRED_BRACKETED ORs the CRED bit, runs the setuid/setgid mutator, publishes the four cred slots, then ANDs the CRED bit off. shim_globals_attn_or uses __ATOMIC_SEQ_CST so the mutator's publish stores cannot become globally visible before the attention bit on weakly-ordered ARM64; the AND clear stays __ATOMIC_RELEASE because release pairs with the shim LDAR for the publish-then-clear order. vdso_attention_or mirrors the same ordering. Signal and itimer path support the lane discipline: - attention_guest is now _Atomic so signal_init's NULL clear during the execve reset window pairs with attention_raise's acquire load on any sibling thread. - signal_set_itimer writes expiry and interval before the release store of .active, matching the field order already used by the virt and prof setters. Consumers that ACQUIRE-load .active without holding sig_lock now never observe armed=true with stale fields. - New signal_attention_needed() OR-reads the three guest itimer .active fields plus an unblocked-deliverable signal hint so the HVC epilogue's recompute decides accurately whether the next call may stay on the fast path. The fd-table publication paths that feed the urandom bitmap are serialized so a pathological sibling close+reopen on the same guest fd cannot make the EL1 fast path consult a stale bit: - fd_refresh_urandom_bitmap snapshots (type, linux_flags) AND publishes the bitmap bit inside the same fd_lock critical section. - fd_alloc_opened_host and duplicate_guest_fd install linux_flags, dir, seals, and the urandom bit only after re-acquiring fd_lock and confirming the slot's (type, host_fd) tuple still matches the just- allocated values. On mismatch (the slot was reallocated by a sibling) the install is skipped and any cloned DIR* is closed to avoid a leak. - The host-side urandom cache replaces its single global mutex with a per-fd lock embedded in urandom_cache_t, initialized by io_init() from syscall_init. Concurrent urandom reads on different fds no longer serialize on one mutex. - sys_readv on /dev/urandom now triggers shim_globals_refill_urandom_ring on the slow path, matching sys_read so readv consumers do not leave the shim ring drained.
1 parent a24fc53 commit 7642bee

41 files changed

Lines changed: 3503 additions & 302 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Makefile

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ SRCS := \
2323
core/elf.c \
2424
core/stack.c \
2525
core/vdso.c \
26+
core/shim-globals.c \
2627
core/bootstrap.c \
2728
core/rosetta.c \
2829
core/sysroot.c \
@@ -160,6 +161,24 @@ $(BUILD_DIR)/test-pthread: tests/test-pthread.c | $(BUILD_DIR)
160161
@echo " CROSS $< (with -lpthread)"
161162
$(Q)$(CROSS_COMPILE)gcc -D_GNU_SOURCE -static -O2 -o $@ $< -lpthread
162163

164+
# test-shim-cred-race spawns a pthread reader while the main thread
165+
# toggles setresuid; the reader spins on the identity fast path.
166+
$(BUILD_DIR)/test-shim-cred-race: tests/test-shim-cred-race.c | $(BUILD_DIR)
167+
@echo " CROSS $< (with -lpthread)"
168+
$(Q)$(CROSS_COMPILE)gcc -D_GNU_SOURCE -static -O2 -o $@ $< -lpthread
169+
170+
# test-shim-urandom-smp spawns N pthreads racing on a shared FD_URANDOM
171+
# slot to exercise the shim's LDXR/STXR head-advance under contention.
172+
$(BUILD_DIR)/test-shim-urandom-smp: tests/test-shim-urandom-smp.c | $(BUILD_DIR)
173+
@echo " CROSS $< (with -lpthread)"
174+
$(Q)$(CROSS_COMPILE)gcc -D_GNU_SOURCE -static -O2 -o $@ $< -lpthread
175+
176+
# test-shim-urandom-toctou races mprotect(PROT_NONE) against urandom
177+
# reads to exercise the EL1 data abort recovery path. Needs pthreads.
178+
$(BUILD_DIR)/test-shim-urandom-toctou: tests/test-shim-urandom-toctou.c | $(BUILD_DIR)
179+
@echo " CROSS $< (with -lpthread)"
180+
$(Q)$(CROSS_COMPILE)gcc -D_GNU_SOURCE -static -O2 -o $@ $< -lpthread
181+
163182
# test-fuse-basic runs a guest daemon thread and consumer in one process
164183
$(BUILD_DIR)/test-fuse-basic: tests/test-fuse-basic.c | $(BUILD_DIR)
165184
@echo " CROSS $< (with -lpthread)"

src/core/bootstrap.c

Lines changed: 88 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020

2121
#include "core/bootstrap.h"
2222
#include "core/rosetta.h"
23+
#include "core/shim-globals.h"
2324
#include "core/stack.h"
2425
#include "core/startup-trace.h"
2526
#include "core/vdso.h"
@@ -31,6 +32,7 @@
3132
#include "syscall/internal.h"
3233
#include "syscall/path.h"
3334
#include "syscall/proc.h"
35+
#include "syscall/signal.h"
3436

3537
#include "debug/log.h"
3638

@@ -95,20 +97,25 @@ static void register_elf_segment_regions(guest_t *g,
9597
}
9698
}
9799

98-
/* Publish shim, shim-data, heap, stack-guard, and stack regions to the
100+
/* Publish shim, shim-data, heap, stack-guard, and stack regions to
99101
* /proc/self/maps view, and invalidate the null page and stack-guard PTEs.
100-
* Shared by guest_bootstrap_prepare and guest_bootstrap_rosetta_post_reset;
101-
* the caller registers ELF or rosetta segments separately because those
102-
* differ between aarch64 and rosetta guests.
102+
* Shared by guest_bootstrap_prepare and guest_bootstrap_rosetta_post_reset; the
103+
* caller registers ELF or rosetta segments separately because those differ
104+
* between aarch64 and rosetta guests.
103105
*/
104106
static void register_runtime_regions(guest_t *g, size_t shim_bin_len)
105107
{
106108
guest_region_add(g, g->shim_base, g->shim_base + shim_bin_len,
107109
LINUX_PROT_READ | LINUX_PROT_EXEC, LINUX_MAP_PRIVATE, 0,
108110
"[shim]");
111+
/* shim_data is mapped privileged-only (AP[2:1]=00) in the page tables; the
112+
* EL1 shim has full RW but EL0 cannot read or write. Report PROT_NONE in
113+
* /proc/self/maps so guest tooling treats it as inaccessible, matching what
114+
* dereferencing the GVA actually does (translation fault -> EL0 SIGSEGV
115+
* path).
116+
*/
109117
guest_region_add(g, g->shim_data_base, g->shim_data_base + BLOCK_2MIB,
110-
LINUX_PROT_READ | LINUX_PROT_WRITE, LINUX_MAP_PRIVATE, 0,
111-
"[shim-data]");
118+
LINUX_PROT_NONE, LINUX_MAP_PRIVATE, 0, "[shim-data]");
112119

113120
if (g->brk_base < g->brk_current) {
114121
guest_region_add(g, g->brk_base, g->brk_current,
@@ -247,8 +254,11 @@ static bool load_interpreter(guest_t *g,
247254
}
248255

249256
boot->interp_base = g->interp_base;
257+
uint64_t infra_lo = g->interp_base - INFRA_RESERVE;
258+
uint64_t infra_hi = g->interp_base;
250259
if (elf_map_segments(&boot->interp_info, boot->interp_resolved,
251-
g->host_base, g->guest_size, boot->interp_base) < 0) {
260+
g->host_base, g->guest_size, boot->interp_base,
261+
infra_lo, infra_hi) < 0) {
252262
log_error("failed to map interpreter segments");
253263
if (interp_host_temp)
254264
unlink(boot->interp_resolved);
@@ -278,20 +288,28 @@ static bool build_boot_regions(mem_region_t *regions,
278288
*/
279289
if (!append_boot_region(regions, nregions, g->shim_base,
280290
g->shim_base + shim_bin_len, MEM_PERM_RX) ||
291+
/* shim_data is EL1-only: the guest must not directly read or write the
292+
* identity cache, attention flag, urandom bitmap, or ring, any of which
293+
* would let it spoof its own syscall results. The EL1 shim itself has
294+
* full RW. /proc/self/maps still lists [shim-data] (region tracking is
295+
* independent of EL0 access), but EL0 dereferences fault to the SIGSEGV
296+
* path.
297+
*/
281298
!append_boot_region(regions, nregions, g->shim_data_base,
282-
g->shim_data_base + BLOCK_2MIB, MEM_PERM_RW) ||
299+
g->shim_data_base + BLOCK_2MIB,
300+
MEM_PERM_RW_EL1_ONLY) ||
283301
!append_boot_region(regions, nregions, VDSO_BASE, VDSO_BASE + VDSO_SIZE,
284302
MEM_PERM_RX)) {
285303
return false;
286304
}
287305

288-
/* Rosetta guests never load the x86_64 ELF or its interpreter into
289-
* guest memory; rosetta itself reads the target via fd 3 once it is
290-
* running. Adding those segments to the page-table builder would emit
291-
* ghost L2/L3 entries at the binary's x86_64 link address (typically
292-
* 0x400000) pointing into uninitialized primary-buffer GPAs. The
293-
* rosetta image's own segments are registered by rosetta_prepare's
294-
* separate region append in the bootstrap caller.
306+
/* Rosetta guests never load the x86_64 ELF or its interpreter into guest
307+
* memory; rosetta itself reads the target via fd 3 once it is running.
308+
* Adding those segments to the page-table builder would emit ghost L2/L3
309+
* entries at the binary's x86_64 link address (typically 0x400000) pointing
310+
* into uninitialized primary-buffer GPAs. The rosetta image's own segments
311+
* are registered by rosetta_prepare's separate region append in the
312+
* bootstrap caller.
295313
*/
296314
if (!g->is_rosetta) {
297315
if (!append_elf_segment_regions(regions, nregions, &boot->elf_info,
@@ -370,12 +388,12 @@ int guest_bootstrap_prepare(guest_t *g,
370388
(unsigned long long) boot->elf_info.load_max,
371389
want_rosetta ? "x86_64-via-rosetta" : "aarch64");
372390

373-
/* Rosetta is statically linked at 0x800000000000 (128 TiB), beyond the
374-
* 36 and 40-bit IPA ranges. Request 48-bit IPA up-front so the
375-
* page-table builder can reach the rosetta segments. HVF clamps to its
376-
* supported size; on M1 hosts the upstream hyper-linux audit confirms
377-
* 48 is honoured even though the auto-detect default returns 36, so
378-
* the request is non-fatal in either direction.
391+
/* Rosetta is statically linked at 0x800000000000 (128 TiB), beyond the 36
392+
* and 40-bit IPA ranges. Request 48-bit IPA up-front so the page-table
393+
* builder can reach the rosetta segments. HVF clamps to its supported size;
394+
* on M1 hosts the upstream hyper-linux audit confirms 48 is honoured even
395+
* though the auto-detect default returns 36, so the request is non-fatal in
396+
* either direction.
379397
*/
380398
uint32_t req_ipa = want_rosetta ? 48 : 0;
381399
t0 = startup_trace_now_ns();
@@ -397,8 +415,8 @@ int guest_bootstrap_prepare(guest_t *g,
397415
if (want_rosetta) {
398416
/* Rosetta path: no x86_64 ELF segments are loaded into guest memory
399417
* (rosetta itself does that lazily once it starts running). brk and
400-
* stack use the same defaults the aarch64 path falls back to when
401-
* the binary sits at low VAs; the x86_64 binary's load_max would be
418+
* stack use the same defaults the aarch64 path falls back to when the
419+
* binary sits at low VAs; the x86_64 binary's load_max would be
402420
* meaningless here because nothing of it actually lives in primary
403421
* buffer GPA space.
404422
*/
@@ -412,8 +430,11 @@ int guest_bootstrap_prepare(guest_t *g,
412430
boot->elf_load_base =
413431
(boot->elf_info.e_type == ET_DYN) ? PIE_LOAD_BASE : 0;
414432
t0 = startup_trace_now_ns();
433+
uint64_t infra_lo = g->interp_base - INFRA_RESERVE;
434+
uint64_t infra_hi = g->interp_base;
415435
if (elf_map_segments(&boot->elf_info, elf_host_path, g->host_base,
416-
g->guest_size, boot->elf_load_base) < 0) {
436+
g->guest_size, boot->elf_load_base, infra_lo,
437+
infra_hi) < 0) {
417438
log_error("failed to map ELF segments");
418439
return -1;
419440
}
@@ -664,9 +685,49 @@ int guest_bootstrap_create_vcpu(guest_t *g,
664685
HV_CHECK(hv_vcpu_set_sys_reg(vcpu, HV_SYS_REG_SP_EL0, sp_ipa));
665686
HV_CHECK(hv_vcpu_set_sys_reg(vcpu, HV_SYS_REG_SP_EL1, el1_sp));
666687

667-
/* CNTKCTL_EL1.EL0VCTEN | EL0PCTEN: allow EL0 to read CNTVCT_EL0 /
668-
* CNTPCT_EL0. Required by the vDSO clock_gettime fast path (and is the
669-
* default on native Linux), without which the guest gets 0 back from MRS.
688+
/* Round-trip a sentinel through TPIDR_EL1 before installing the real
689+
* value. Validates only the hv_vcpu_{set,get}_sys_reg pre-run round
690+
* trip, not preservation across hv_vcpu_run -- the test-shim-identity
691+
* microbench is the end-to-end check for that.
692+
*/
693+
if (shim_globals_self_test(vcpu) < 0)
694+
return -1;
695+
/* TPIDR_EL1 -> shim_globals base, CONTEXTIDR_EL1 -> tid (== pid for the
696+
* initial main thread). gettid fast path reads CONTEXTIDR_EL1 directly.
697+
*/
698+
if (shim_globals_install_per_vcpu(vcpu, g, proc_get_pid()) < 0)
699+
return -1;
700+
701+
/* Zero the shim-globals region and publish the initial identity so the very
702+
* first getpid / getuid / etc. SVC #0 hits the cache instead of returning
703+
* the all-zero seed. Future setuid/setgid paths refresh creds via
704+
* cred_publish_after; fork-child has its own publish on the inherited
705+
* identity.
706+
*/
707+
shim_globals_init(g);
708+
shim_globals_set_trace_enabled(g, verbose);
709+
shim_globals_publish_pid(g, proc_get_pid(), proc_get_ppid());
710+
shim_globals_publish_creds(g, proc_get_uid(), proc_get_euid(),
711+
proc_get_gid(), proc_get_egid());
712+
/* Pre-fill the entropy ring so the first read(/dev/urandom) from the guest
713+
* is served by the shim fast path with no cold-start HVC for refill.
714+
*/
715+
shim_globals_refill_urandom_ring(g);
716+
/* Register the singleton guest pointer so signal_queue and the itimer
717+
* setters can raise the attention flag without threading g through every
718+
* call site. signal_init clears this defensively; the first registration
719+
* must run after both proc_init and shim_globals_init.
720+
*/
721+
signal_set_shim_globals_guest(g);
722+
/* Same singleton pattern but for the fd-table hooks that update the urandom
723+
* bitmap. Must run before any FD_URANDOM-typed slot is allocated; bootstrap
724+
* finishes before any guest syscall runs.
725+
*/
726+
shim_globals_set_singleton(g);
727+
728+
/* CNTKCTL_EL1.EL0VCTEN | EL0PCTEN: allow EL0 to read {CNTVCT,CNTPCT}_EL0.
729+
* Required by the vDSO clock_gettime fast path (and is the default on
730+
* native Linux), without which the guest gets 0 back from MRS.
670731
*/
671732
HV_CHECK(hv_vcpu_set_sys_reg(vcpu, HV_SYS_REG_CNTKCTL_EL1, 0x3ULL));
672733

src/core/elf.c

Lines changed: 44 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -208,8 +208,16 @@ int elf_map_segments(const elf_info_t *info,
208208
const char *path,
209209
void *guest_base,
210210
uint64_t guest_size,
211-
uint64_t load_base)
211+
uint64_t load_base,
212+
uint64_t infra_lo,
213+
uint64_t infra_hi)
212214
{
215+
/* Half-open intersection test for [a, a+alen) and [b, b+blen). When
216+
* infra_lo == infra_hi the caller opted out (early bring-up before
217+
* guest_t is wired up); the host-side writes that follow still get
218+
* the existing guest_size bound check.
219+
*/
220+
bool infra_active = infra_lo < infra_hi;
213221
FILE *f = fopen(path, "rb");
214222
if (!f) {
215223
perror(path);
@@ -264,6 +272,17 @@ int elf_map_segments(const elf_info_t *info,
264272
fclose(f);
265273
return -1;
266274
}
275+
if (infra_active && phdr_dest < infra_hi &&
276+
phdr_dest + ph_total > infra_lo) {
277+
log_error(
278+
"%s: program headers at 0x%llx overlap infra reserve "
279+
"[0x%llx, 0x%llx)",
280+
path, (unsigned long long) phdr_dest, (unsigned long long) infra_lo,
281+
(unsigned long long) infra_hi);
282+
free(ph_buf);
283+
fclose(f);
284+
return -1;
285+
}
267286
memcpy((uint8_t *) guest_base + phdr_dest, ph_buf, ph_total);
268287

269288
/* Copy PT_LOAD contents after AT_PHDR is in place; ET_DYN segments are
@@ -308,15 +327,34 @@ int elf_map_segments(const elf_info_t *info,
308327
return -1;
309328
}
310329

311-
/* Zero the full page-aligned segment extent, not only p_memsz.
312-
* Linux guarantees zero-filled tail bytes in the last mapped page,
313-
* and some dynamic linkers allocate from that page tail before they
314-
* request more memory. Leaving stale bytes there leaks state across
315-
* execve and corrupts the new image.
330+
/* The host memset zeros PAGE_ALIGN_UP(memsz) bytes, not just memsz,
331+
* so the infra-overlap check has to use the same rounded extent.
332+
* Without the rounding here, a segment that ends just below
333+
* infra_lo passes the check and still spills up to PAGE_SIZE-1
334+
* bytes of zero into the infra reserve via the page tail.
316335
*/
317336
uint64_t zero_len = PAGE_ALIGN_UP(memsz);
318337
if (gpa + zero_len > guest_size)
319338
zero_len = guest_size - gpa;
339+
if (infra_active && gpa < infra_hi && gpa + zero_len > infra_lo) {
340+
log_error(
341+
"%s: segment at 0x%llx+0x%llx (zero-extent 0x%llx) overlaps "
342+
"infra reserve [0x%llx, 0x%llx)",
343+
path, (unsigned long long) gpa, (unsigned long long) memsz,
344+
(unsigned long long) zero_len, (unsigned long long) infra_lo,
345+
(unsigned long long) infra_hi);
346+
free(ph_buf);
347+
fclose(f);
348+
return -1;
349+
}
350+
351+
/* Zero the full page-aligned segment extent (zero_len computed above
352+
* with guest_size and infra_reserve checks). Linux guarantees
353+
* zero-filled tail bytes in the last mapped page, and some dynamic
354+
* linkers allocate from that page tail before they request more
355+
* memory. Leaving stale bytes there leaks state across execve and
356+
* corrupts the new image.
357+
*/
320358
memset((uint8_t *) guest_base + gpa, 0, zero_len);
321359

322360
/* Overlay initialized bytes after zeroing so BSS and page tail remain

src/core/elf.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,13 +109,20 @@ int elf_load(const char *path, elf_info_t *info);
109109
* Also copies program headers into guest memory for AT_PHDR.
110110
* load_base is added to all virtual addresses (0 for ET_EXEC at link addr,
111111
* non-zero for ET_DYN loaded at a chosen base).
112+
* infra_lo and infra_hi delimit the runtime infra reserve (page-table pool,
113+
* shim text, shim_data, vDSO). Any PT_LOAD or PT_PHDR copy whose destination
114+
* intersects [infra_lo, infra_hi) is rejected: those writes go through
115+
* host_base directly and would otherwise bypass the EL1-only page-table
116+
* protection on shim_data. Pass 0,0 only when the guest_t is not yet built.
112117
* Returns 0 on success, -1 on failure.
113118
*/
114119
int elf_map_segments(const elf_info_t *info,
115120
const char *path,
116121
void *guest_base,
117122
uint64_t guest_size,
118-
uint64_t load_base);
123+
uint64_t load_base,
124+
uint64_t infra_lo,
125+
uint64_t infra_hi);
119126

120127
/* Resolve a PT_INTERP path against a sysroot directory.
121128
* Tries three strategies:

0 commit comments

Comments
 (0)