A fast path that intercepts calls to musl's __syscall_cancel from
inside libc syscall wrappers and replaces them with a direct dispatch
into kbox's rewrite trampoline. The win is concentrated on aarch64
open+close, which is the only remaining non-near-native row in the
rewrite baseline (47.3us vs ~1.5us for the other rewritten
syscalls; see TODO.md). The reason it lags is that musl's open()
wrapper does not call the kernel directly: it goes through
__syscall_cancel, and the rewriter's per-SVC-site patcher only sees
the SVC inside __syscall_cancel itself, where the wrapper-class
contract does not hold.
The patch redirects the call site (the bl __syscall_cancel
instruction inside the wrapper function) instead of the SVC, so the
fast-path executes with the wrapper's calling convention -- not the
generic SVC trampoline's.
Static-musl's __syscall_cancel(a, b, c, d, e, f, nr) (see
musl/src/thread/__syscall_cancel.c):
- Loads
self->cancel(the calling thread's pthread_cancel state). - If a cancellation has been requested AND the thread is in
PTHREAD_CANCEL_ENABLEAND the cancel type is asynchronous, it raises the cancel and never returns from the call. - Otherwise it issues the syscall (the SVC that the SVC-site
rewriter currently patches) and returns the raw kernel result
(
-4095..-1encodes errno; everything else is the return value). - After the SVC, it re-checks the cancel state for deferred cancellation and may run the cancel handlers if a cancel arrived during the syscall.
The wrapper's calling convention on aarch64:
| Reg | Meaning |
|---|---|
x0..x5 |
syscall args a..f |
x6 |
syscall number nr |
x0 |
return value (raw kernel result) |
x7..x18 |
caller-clobbered scratch |
x19+ |
callee-saved (preserved) |
The key facts for kbox:
- errno is not set inside
__syscall_cancel; the libc wrapper (e.g.open()) inspects the return and sets errno. - The result register (
x0) and the dispatch register (x6) are exactly whatkbox_syscall_rewrite_aarch64_dispatchalready consumes via the existing cancel trampoline (kbox_syscall_rewrite_aarch64_cancel_entryinsrc/rewrite.c). - All FD bookkeeping happens inside the dispatch path; the cancel fast path is just an alternate entry to the same dispatch and inherits the existing forwarder semantics unchanged.
Cancellation-point semantics. By bypassing __syscall_cancel, the
fast path skips the pre- and post-syscall pthread cancel checks. For
a single-threaded program no cancellation can ever be pending, so
this is a no-op. For a multi-threaded program with a thread that has
been pthread_cancel'd, the bypassed open() will not act as a
cancellation point and will not raise the cancel until the next real
cancellation point.
This is the only correctness gap. errno propagation, FD bookkeeping,
register preservation, and unwind metadata are all unaffected (we
return to bl_pc + 4, exactly where the BL would have returned, with
x0 set the same way).
Two conditions must both hold at install time for a cancel-style BL
site to be promoted. The gate is stored in
kbox_rewrite_runtime::cancel_promote_allowed, computed once during
kbox_rewrite_runtime_install() and consulted from
rewrite_runtime_should_patch_site().
launch->interp_elf == NULL && launch->interp_elf_len == 0. For a
dynamic binary, libc (and therefore the clone wrapper that
pthread_create depends on) lives in an interpreter-loaded DSO that
the main-ELF scan cannot see. A dynamic program could also dlopen
a DSO that spins up threads at runtime, which is not detectable
statically at all. Promotion for dynamic binaries is unsafe, so the
gate rejects them outright. This also means the cancel-wrapper fast
path only benefits static-musl binaries today; dynamic programs stay
on the existing forwarder path.
kbox_rewrite_has_wrapper_syscalls(main_elf, ..., {clone, fork, vfork, clone3}) returns 0. Because the binary is static (condition
1), libc is part of main_elf — there is no separate interpreter
ELF to scan. Any pthread_create → libc clone wrapper compiles
down to a mov x8, #220; svc 0 site inside main_elf's text
segment, and the wrapper-number scanner catches it. Scanning
main_elf alone is therefore sufficient to cover the embedded libc
in a static build; this is the important invariant that makes the
gate sound.
Rationale: no fork-family sites in the main (= only) ELF implies the program cannot create additional threads, which implies pthread cancellation cannot be pending on any thread, which makes the cancel bypass a strict no-op.
This gate rejects multi-threaded programs that never actually call
pthread_cancel — those would also be safe to promote, but proving
it statically is unreliable. The gate costs nothing on the
bench-test target (single-threaded) and trivially preserves
correctness for everything else by leaving the existing forwarder
path in place.
-
A static program that invokes
clone/clone3viasyscall(3)(register-indirectmov x8, x_reg; svc 0) slips past the wrapper-number scanner, which only matches the literal-immediate patternmovz x8, #nr; svc 0. The gate would approve such a binary even though it can actually create threads. This pattern is very rare in practice — no test binary in the tree exercises it, and musl's ownpthread_createpath uses the immediate form — but it is a known unsoundness that would need a stronger static analysis to close. Not introduced by this series; the underlying scanner predates it. -
The shared-libc musl
__syscall_cancelcalling convention differs from the static one (nr inx0vsx6). Even if condition 1 were relaxed, the current BL-site detector would not recognize dynamic-musl call sites. Out of scope for this fast path.
Pattern (aarch64, walk the segment one 4-byte instruction at a time):
mov{z} x6, #imm16 ; syscall number into x6
... within 32 bytes ...
bl <target> ; opcode 0x94XXXXXX
The intermediate instructions are arbitrary (typically arg setup and
maybe a mov x?, x6). The 32-byte horizon is a safety bound to keep
the heuristic local; in practice the BL is one or two instructions
after the mov.
We deliberately do not match plain b (opcode 0x14XXXXXX).
A plain b would be a tail call, which has different return
semantics: after the rewrite trampoline executes, control falls
through to b_pc + 4, but a tail-call site has no meaningful "next
instruction" -- the surrounding function expects never to come back.
Restricting to BL sidesteps this entire class of misanalysis.
We also do not validate that the BL target is the actual
__syscall_cancel symbol. Doing so would require a dynsym/symtab
walk and would only work for symbol-bearing static-musl binaries.
The mov x6, #nr constraint is already a strong structural filter
(x6 is not used as an arg register by the kernel ABI on aarch64), and
the single-thread gate makes any false positive a no-op anyway.
The patch is the same B-relative-to-trampoline encoding the rewriter already uses for SVC sites:
[bl <target>] -> [b <trampoline_slot>]
The trampoline slot (AARCH64_REWRITE_SLOT_SIZE = 32 bytes) is
emitted by write_aarch64_trampoline with
wrapper_kind = SYSCALL_CANCEL, which causes it to point at
kbox_syscall_rewrite_aarch64_cancel_entry (versus the regular
kbox_syscall_rewrite_aarch64_entry for SVC sites). The cancel
entry differs from the regular entry in exactly one line: it loads
nr from saved x6 (offset +40) instead of saved x8 (offset
+56). Everything else -- register save/restore, the call into
kbox_syscall_rewrite_aarch64_dispatch, and the resume sequence --
is identical.
After dispatch, the cancel entry executes:
add x16, x19, #4 ; x19 holds origin = bl_pc, so x16 = bl_pc + 4
br x16
This resumes the wrapper at the instruction after the BL, with x0
holding the kernel result, which is exactly the state the wrapper
expects after a normal return from __syscall_cancel.
x30 (LR) is restored to whatever the BL site's caller had stored
before the call. We do not update it to bl_pc + 4 even though a
real BL would have. This is fine: AAPCS64 treats x30 as
caller-clobbered across a call, and the wrapper's prologue has
already saved the function's own return address; nothing in the
wrapper body reads x30 between the BL and the function epilogue.
If the BL is more than ±128 MiB from the trampoline page, the
existing veneer fallback in kbox_rewrite_runtime_install bridges
the gap exactly the same way it does for out-of-range SVC sites.
Unit (tests/unit/test-rewrite.c):
- Encodes a synthetic
mov x6, #57; bl ...(close) and amov x6, #56; bl ...(openat) inside an aarch64 ELF segment; assertsanalyze_segmentemits a planned site at the BL withwidth=4andoriginalmatching the BL bytes. - Asserts
kbox_rewrite_encode_patchfor the BL site emits a 4-byte B withimm26pointing at the trampoline. - Asserts that with
kbox_rewrite_has_fork_sitestrue on the same ELF,cancel_promote_allowedwould be 0 and the install path would skip the cancel-kind sites (validated throughrewrite_runtime_should_patch_site).
Integration: bench-test under --syscall-mode=rewrite on lima
(correctness) and on arm (perf delta on open+close).
Before (real Arm64, bench-test 1000, BUILD=release, from TODO.md):
| Syscall | rewrite |
|---|---|
| open+close | 47.3us |
Target: pull open+close into the same ~1.5 us tier as the other
rewritten rows. The numbers will be re-captured on the same arm
host with the same release build before/after this change and
recorded in the changelog.