This document describes the internal design of kbox: how syscalls are routed through the dispatch engine, the key subsystems that back the guest's view of the kernel, and the ABI translation layer that bridges LKL's asm-generic headers with native host structures.
For a high-level overview of the three interception tiers and the project's goals, see the top-level README.
Every intercepted syscall is dispatched to one of three dispositions:
- LKL forward (~100 handlers, of which 74 live in
DISPATCH_FORWARD_TABLEand the rest are inline cases inkbox_dispatch_requestfor legacy stat variants, identity, execve, and similar): filesystem operations (open, read, write, stat, getdents, mkdir, unlink, rename), metadata (chmod, chown, utimensat), identity (getuid, setuid, getgroups), and networking (socket, connect). In seccomp mode, the supervisor reads arguments from tracee memory viaprocess_vm_readvand writes results viaprocess_vm_writev. In trap/rewrite mode, guest memory is accessed directly viamemcpy(same address space) withsigsetjmp-based fault recovery that returns-EFAULTfor unmapped pointers. An FD-local stat cache (16 entries, round-robin) avoids repeated LKL inode lookups for fstat. - Host CONTINUE (~50 entries): scheduling (sched_yield,
sched_setscheduler), signals (rt_sigaction, kill, tgkill), memory
management (mprotect, brk, munmap, mremap, madvise), I/O
multiplexing (poll, ppoll, pselect6), threading (futex, clone,
set_tid_address, rseq), time (nanosleep, clock_nanosleep), and more.
In seccomp mode, the kernel replays the syscall. In trap/rewrite
mode, the guest-thread local fast-path
(
kbox_dispatch_try_local_fast_pathinsrc/seccomp-dispatch.c) returns CONTINUE directly without touching the service thread for about 40 entries: brk, futex, rseq, set_tid_address, set_robust_list, munmap, mremap, membarrier, madvise, wait4, waitid, exit, exit_group, rt_sigreturn, rt_sigaltstack, setitimer/getitimer, setpgid/getpgid, getsid/setsid, fork, vfork, the full sched_* family, getrlimit, getrusage, ppoll, pselect6, poll, nanosleep, clock_nanosleep, statfs, and sysinfo.mmap,epoll_*, and other syscalls that need W^X enforcement, shadow validation, or FD-table gating are excluded from the fast-path and go through full dispatch (forward_mmap,forward_fd_gated_continue). - Emulated: process identity (getpid returns 1, gettid returns 1),
uname (synthetic LKL values), getrandom (LKL
/dev/urandom), clock_gettime/gettimeofday (host clock, direct passthrough for latency).
All three tiers share the same dispatch engine (kbox_dispatch_request).
The kbox_syscall_request abstraction decouples the dispatch logic
from the notification transport: seccomp notifications, SIGSYS signal
info, and rewrite trampoline calls all produce the same request struct.
Unknown syscalls receive ENOSYS. Over 50 dangerous syscalls (mount,
reboot, init_module, bpf, ptrace, etc.) are rejected with EPERM
directly in the BPF filter before reaching the supervisor.
Maintains a mapping from guest FD numbers to LKL-internal FDs. Three
ranges back the table: low FDs (0..1023) populated by dup2/dup3 and
stdio-compatible redirection, mid FDs (1024..32767) for tracked
host-passthrough descriptors, and high FDs (32768..36863) for normal
LKL allocation, capped at KBOX_FD_TABLE_MAX=4096 entries. This split
avoids collisions between host-kernel FDs (pipes, inherited
descriptors, eventfds) and LKL-managed FDs.
The supervisor mirrors selected guest file opens into host-visible
memfds so that native mmap and other host-side loaders work without
LKL involvement. This is essential for dynamic linking: the ELF
loader maps .text and .rodata segments via mmap, which requires a
real host FD. Three flavors coexist:
- Read-only sealed shadows: when the guest opens a regular file
O_RDONLY, the supervisor copies the LKL file contents into a sealed memfd and hands the memfd number to the tracee. These are point-in-time snapshots with no write-back, capped at 256MB. The host kernel handles subsequent reads, fstats, and mmaps directly viaCONTINUE. - Writeback shadows: writable opens (
O_RDWR/O_WRONLY) get a shadow that the supervisor still backs with a memfd, but the entry is markedshadow_writeback(fd-table.h) and the supervisor flushes dirty pages back to the LKL file insync_shadow_writebackon close, fsync, and similar synchronization points (dispatch-misc.c:736,759).active_writeback_shadowsis tracked on the supervisor context so the dispatcher knows when writes are outstanding. - Path shadow cache: an 8-entry cache
(
KBOX_PATH_SHADOW_CACHE_MAXinseccomp.h) reuses memfds across repeated reads or stats of the same path, avoiding redundant LKL copies for hot read paths like libc and ld-musl. The cache is invalidated whenever the guest writes to a path that could overlap a cached entry.
Lexical normalization with 6 escape-prevention checks. Paths starting
with /proc, /sys, /dev are routed to the host kernel via
CONTINUE. Everything else goes through LKL. The normalizer handles
.. traversal, double slashes, and symlink-based escape attempts
(/proc/self/root, /proc/<pid>/cwd).
Binaries are extracted from the LKL filesystem into memfds for
fexecve. For dynamically-linked binaries, the PT_INTERP segment
names an interpreter (e.g., /lib/ld-musl-x86_64.so.1) that does not
exist on the host. The supervisor extracts the interpreter into a
second memfd and patches PT_INTERP in the main binary to
/proc/self/fd/N. The host kernel resolves this during
load_elf_binary, before close-on-exec runs.
pipe()/pipe2() create real host pipes injected into the tracee via
SECCOMP_IOCTL_NOTIF_ADDFD. No LKL involvement; the host kernel
manages fork inheritance and close semantics natively. This is why
shell pipelines work: both parent and child share real pipe FDs that
the host kernel handles.
For direct binary commands, kbox loads the guest ELF into the current
process via a userspace loader (7 modules: entry, handoff, image,
layout, launch, stack, transfer). A BPF filter traps guest-range
instruction pointers via SECCOMP_RET_TRAP, delivering SIGSYS. The
signal handler saves/restores the FS base (FSGSBASE instructions on
kernel 5.9+, arch_prctl fallback) so kbox and guest each use their own
TLS. A service thread runs the full dispatch; the handler captures the
request and spins until the result is ready, keeping heap-allocating
code out of signal context. arch_prctl(SET_FS) is intercepted to
maintain dual TLS state.
Scans executable PT_LOAD segments for syscall instructions and patches them to branch directly into dispatch trampolines, eliminating the SIGSYS signal overhead for patched sites.
On aarch64, SVC #0 (4 bytes, fixed-width) is replaced with a B
branch to a per-site trampoline allocated past the segment end. The
trampoline saves registers, loads the origin address, and calls the C
dispatch function directly on the guest thread. No signal frame, no
service thread context switch. Veneer pages with
LDR x16, [PC+8]; BR x16 indirect stubs bridge sites beyond the
±128MB B-instruction range, with slot reuse to avoid wasting a full
page per veneer. This is why aarch64 rewrite achieves ~3us stat (vs
22us in seccomp): the dispatch runs in-process with LKL serving from
the inode cache.
On x86_64, an instruction-boundary-aware length decoder
(x86-decode.c) walks true instruction boundaries, eliminating false
matches of 0F 05/0F 34 bytes inside immediates, displacements, and
SIB encodings. Only 8-byte wrapper sites (mov $NR, %eax; syscall; ret) are patched to jmp rel32 targeting a wrapper trampoline that
encodes the syscall number and origin address. Bare 2-byte syscall
instructions are not rewritten because the only same-width replacement
(call *%rax, FF D0) would jump to the syscall number in RAX rather
than a code address. Unpatched sites fall through to the SIGSYS trap
path. However, the guest-thread local fast-path
(kbox_dispatch_try_local_fast_path) handles ~40 high-frequency
syscalls (futex, brk, poll/ppoll/pselect6, munmap, mremap, madvise,
sched_yield, the rest of the sched_* family, wait4/waitid, fork/vfork,
setpgid/getpgid, statfs, sysinfo, clock_nanosleep, etc.) directly on
the guest thread without any service-thread IPC, giving trap mode a
measurable advantage over seccomp for operations surrounded by these
host-kernel calls. mmap and epoll_* are deliberately excluded
because they need W^X enforcement, shadow validation, or FD-table
gating that the fast-path cannot perform.
On riscv64, the rewrite mode is not available now.
Each site is classified as WRAPPER (simple syscall; ret pattern,
eligible for inline virtualized return: getpid=1, gettid=1, getppid=0)
or COMPLEX (result consumed internally by helpers like raise() that
feed gettid into tgkill; must use full dispatch). An origin map
validates dispatch calls against known rewrite sites and carries the
per-site classification. During re-exec (trap_userspace_exec), the
rewrite runtime is re-installed on the new binary. Multi-threaded
guests (CLONE_THREAD) are blocked in trap/rewrite mode; use
--syscall-mode=seccomp for threaded workloads.
LKL is built as ARCH=lkl, which uses asm-generic headers. On x86_64,
struct stat differs between asm-generic (128 bytes, st_mode at
offset 16) and the native layout (144 bytes, st_mode at offset 24).
Reading st_mode from an LKL-filled buffer using a host struct stat
reads st_uid instead. kbox uses struct kbox_lkl_stat matching the
asm-generic layout, with field-by-field conversion via
kbox_lkl_stat_to_host() before writing to tracee memory. Compile-time
_Static_assert checks enforce struct sizes and critical field
offsets.
seccomp args[] zero-extends 32-bit values: fd=-1 becomes
0x00000000FFFFFFFF, not 0xFFFFFFFFFFFFFFFF. All handlers extracting
signed arguments (AT_FDCWD, MAP_ANONYMOUS fd) truncate to 32 bits
before sign-extending: (long)(int)(uint32_t)args[N].
On aarch64, four O_* flags differ between the host and asm-generic:
O_DIRECTORY, O_NOFOLLOW, O_DIRECT, O_LARGEFILE. The dispatch
layer translates these bidirectionally.
The dashboard (--web flag, KBOX_HAS_WEB=1 build) reads LKL's
in-process /proc directly. Implementation notes:
- Telemetry is currently driven exclusively by the seccomp supervisor
loop.
kbox_web_record_syscallis only called fromsrc/seccomp-supervisor.c:423, andkbox_web_tick()runs from the same supervisor poll loop. Trap and rewrite modes do not yet emit per-syscall events or drive the periodic sampler -- the supervisor context is initialized (so the HTTP server stays up) but the event ring stays empty for those modes. Pin--syscall-mode=seccompif you need the full audit trail. - The telemetry sampler runs on the seccomp supervisor's poll timeout
(100ms tick), reading LKL
/proc/stat,/proc/meminfo,/proc/vmstat,/proc/loadavgviakbox_lkl_openat/kbox_lkl_read. A bounded per-tick parsing budget prevents expensive/procparsing from starving seccomp dispatch. Per-type softirq totals come from thesoftirqline in/proc/stat, not/proc/softirqs. - The HTTP server runs in a dedicated pthread with its own epoll set.
Shared state (snapshots, event ring) is protected by a single mutex.
Counter fields use
atomic_intfor cross-thread flags. - The event ring buffer holds 1024 entries split into 768 for sampled routine events (1% probabilistic sampling for high-frequency syscalls like read/write) and 256 reserved for errors and rare events (execve, clone, exit -- always captured). Events are sequence-numbered to prevent SSE duplicate delivery.
- Dispatch instrumentation in seccomp mode adds ~25ns overhead per
intercepted syscall (one
clock_gettime(CLOCK_MONOTONIC)call before and after dispatch). - All frontend assets (Chart.js 4.4.7, vanilla JS, CSS) are compiled
into the binary via
xxd -iat build time. No CDN, no npm, no runtime file I/O. The entire dashboard is self-contained in the kbox binary. - When neither
--webnor--trace-format jsonis passed, the observability subsystem is completely inert -- no threads, no sockets, no overhead.--trace-format jsoninitializes the telemetry context (so the supervisor records events to stderr) but skips the HTTP server thread and socket setup. WhenKBOX_HAS_WEBis not set at build time, the web code compiles to empty translation units.