Skip to content

Commit 8e3baf4

Browse files
committed
Cut dynamic-linker startup syscalls
The dynamic-linker bring-up storm was the largest remaining startup band after pull request #34. Adding a per-syscall histogram pointed at the sidecar walker as the openat dominant cost (61% of getent startup), the per-call path_translation_t memset as the second source, and the opened_fd_type fstat as a small but real per-open round-trip. src/debug/syscall-hist.[ch]: opt-in histogram via ELFUSE_STARTUP_TRACE=syscalls (or =all alongside the existing step trace). Lock-free atomic counters per Linux syscall number, sorted total-ns descending in the dump. Records freeze on the first successful execve so steady-state traffic does not pollute the startup picture. Fork children disable the histogram explicitly because they resume from a parent snapshot, not a fresh bring-up. src/syscall/sidecar.c: First a per-directory absence cache keyed by (st_dev, st_ino, mtime, ctime) so the walker can skip the openat for .elfuse-sidecar-index when a recent fstat on the same dirfd already saw ENOENT. The mtime/ctime in the key closes ABA naturally and makes a cross-process index publish observable without explicit invalidation. Second a cached sysroot dirfd handed out as fcntl(F_DUPFD_CLOEXEC, 0) so each translated absolute path saves the ~30 us open(sysroot) round-trip and the dup carries CLOEXEC across any racing posix_spawn. src/syscall/path.c: drop the per-call zero-init of path_translation_t. The struct is ~12 KiB (24 metadata bytes plus three LINUX_PATH_MAX buffers) and the buffers are read-after- written by their respective resolvers. memset of all three was the dominant remaining cost after the sidecar caches. src/core/elf.c: skip the redundant memset of the file-data range in elf_map_segments. The loader previously zeroed the full page-aligned segment extent before issuing fread; now only the BSS portion plus page padding (filesz to zero_len) is zeroed. src/syscall/fs.c: skip opened_fd_type fstat when neither O_PATH nor O_DIRECTORY is set. Dynamic-linker opens are overwhelmingly regular files where the type is already implied. The corner where a guest opens a directory without O_DIRECTORY and then issues getdents now returns ENOTDIR; glibc fdopendir has required O_DIRECTORY since 2009 and the test corpus does not exercise the corner. src/core/startup-trace.h: env parsing extended to comma-separated tokens (steps, syscalls, all); legacy =1 keeps enabling steps only so existing scripts keep working. Measurement: 30-run distributions under ELFUSE_STARTUP_TRACE=syscalls, warm cache: bench-hot-guard-glibc startup syscalls: 5.225 ms baseline (single sample) -> 1.33 ms p50 (p25 1.21, p75 1.55, stdev 0.45, n=30) 3.9x bench openat per-call: 135 us baseline -> 33.4 us p50 (p25 32.4, p75 35.8, stdev 7.1, n=30) 4.0x getent passwd root startup syscalls: 7.478 ms baseline -> 2.22 ms p50 (p25 2.10, p75 2.28, stdev 0.27, n=30) 3.4x getent openat per-call: 230 us baseline -> 52.9 us p50 (p25 51.5, p75 55.1, stdev 2.2, n=30) 4.3x End-to-end wall-clock for getent: 14.6 ms p50 (p25 14.3, p75 15.1, stdev 1.18, n=30). Bench guardrail steady-state: static getpid 74 ns, clock_gettime 6.7 ns, urandom1 153 ns; dynamic-glibc getpid 53 ns, clock_gettime 6.4 ns, urandom1 142 ns. All under ceilings. The original baselines were single first-run samples; their variance band was not measured, so the speedup ratios are best-effort relative to the cited starting point. Lazy FD_REGULAR to FD_DIR promotion in sys_getdents64 was attempted but dropped after both reviewers flagged a HIGH-severity ABA hole: a sibling close+reopen between the probe and the install could land the original directory's DIR* onto a fresh regular file's slot. The fix path (fd-slot generation counter or stat+inode comparison under fd_lock) was invasive enough that the lazy promotion did not pay for its complexity.
1 parent ed1811b commit 8e3baf4

11 files changed

Lines changed: 665 additions & 15 deletions

File tree

Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,8 @@ SRCS := \
6666
debug/gdbstub.c \
6767
debug/gdbstub-reg.c \
6868
debug/gdbstub-rsp.c \
69-
debug/log.c
69+
debug/log.c \
70+
debug/syscall-hist.c
7071

7172
SRCS := $(addprefix src/,$(SRCS))
7273
OBJS := $(patsubst src/%.c,$(BUILD_DIR)/%.o,$(SRCS))

src/core/elf.c

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -348,18 +348,16 @@ int elf_map_segments(const elf_info_t *info,
348348
return -1;
349349
}
350350

351-
/* Zero the full page-aligned segment extent (zero_len computed above
352-
* with guest_size and infra_reserve checks). Linux guarantees
353-
* zero-filled tail bytes in the last mapped page, and some dynamic
354-
* linkers allocate from that page tail before they request more
355-
* memory. Leaving stale bytes there leaks state across execve and
356-
* corrupts the new image.
351+
/* Zero only the tail beyond filesz: the BSS portion [filesz, memsz)
352+
* plus the page-padding [memsz, zero_len) that Linux guarantees clean
353+
* for dynamic linkers allocating from the last mapped page's tail.
354+
* Skipping the file-data range avoids writing zeros that the fread
355+
* below would immediately overwrite; for typical shared libraries that
356+
* is a hundreds-of-KiB win per segment.
357357
*/
358-
memset((uint8_t *) guest_base + gpa, 0, zero_len);
358+
if (zero_len > filesz)
359+
memset((uint8_t *) guest_base + gpa + filesz, 0, zero_len - filesz);
359360

360-
/* Overlay initialized bytes after zeroing so BSS and page tail remain
361-
* zero-filled.
362-
*/
363361
if (filesz > 0) {
364362
if (fseek(f, (long) ph->p_offset, SEEK_SET) != 0) {
365363
log_error("%s: seek failed for segment at 0x%llx", path,

src/core/startup-trace.h

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,15 @@
99
* static inline so each translation unit can use them without pulling in a
1010
* separate object; the getenv check resolves once per translation unit but
1111
* the resolution itself is idempotent.
12+
*
13+
* Accepted env values:
14+
* unset, "", "0" -> all tracing off
15+
* "1", "steps" -> per-step VM bring-up timings (this header)
16+
* "syscalls" -> per-syscall histogram (debug/syscall-hist.c)
17+
* "all" -> both, comma-separated tokens also accepted
18+
* "1" is preserved as a legacy alias for "steps" so old scripts keep
19+
* working. The histogram mode never enables the step tracer and vice
20+
* versa, so a user can ask for one without paying for the other.
1221
*/
1322

1423
#ifndef ELFUSE_STARTUP_TRACE_H
@@ -30,10 +39,36 @@
3039
static pthread_once_t startup_trace_once = PTHREAD_ONCE_INIT;
3140
static bool startup_trace_value;
3241

42+
static inline bool startup_trace_env_has(const char *env, const char *tok)
43+
{
44+
if (!env || !env[0])
45+
return false;
46+
size_t toklen = strlen(tok);
47+
const char *p = env;
48+
while (*p) {
49+
const char *comma = strchr(p, ',');
50+
size_t len = comma ? (size_t) (comma - p) : strlen(p);
51+
if (len == toklen && memcmp(p, tok, toklen) == 0)
52+
return true;
53+
if (!comma)
54+
break;
55+
p = comma + 1;
56+
}
57+
return false;
58+
}
59+
3360
static inline void startup_trace_resolve(void)
3461
{
3562
const char *v = getenv("ELFUSE_STARTUP_TRACE");
36-
startup_trace_value = v && v[0] && strcmp(v, "0") != 0;
63+
if (!v || !v[0] || strcmp(v, "0") == 0)
64+
return;
65+
/* "1" is the historical knob: enable steps only. */
66+
if (strcmp(v, "1") == 0) {
67+
startup_trace_value = true;
68+
return;
69+
}
70+
if (startup_trace_env_has(v, "steps") || startup_trace_env_has(v, "all"))
71+
startup_trace_value = true;
3772
}
3873

3974
static inline bool startup_trace_enabled(void)

0 commit comments

Comments
 (0)