@@ -151,12 +151,19 @@ process, providing natural process-level separation.
151151For security-critical deployments, layer additional isolation around
152152the runner process:
153153
154- ### seccomp (recommended)
154+ ### seccomp — guest-side (built-in)
155+
156+ The ` guest/harden ` package provides a built-in seccomp BPF blocklist
157+ filter for the guest VM. Enable it with ` boot.WithSeccomp(true) ` or
158+ call ` harden.ApplySeccomp() ` directly. See [ Seccomp BPF filter] ( #seccomp-bpf-filter )
159+ for the full blocklist.
160+
161+ ### seccomp — runner-side (recommended)
155162
156163Restrict the runner's syscalls to only what libkrun and gvisor-tap-vsock
157- need. This is the highest-impact hardening measure. After a guest escape,
158- the attacker lands in a seccomp jail that limits what syscalls they can
159- make.
164+ need. This is the highest-impact host-side hardening measure. After a
165+ guest escape, the attacker lands in a seccomp jail that limits what
166+ syscalls they can make.
160167
161168```
162169Runner process
@@ -205,8 +212,8 @@ lifetime with no extra coordination.
205212
206213## Guest Hardening
207214
208- The ` guest/harden ` package provides reusable kernel and capability
209- hardening for microVM init processes. It is guest-side code
215+ The ` guest/harden ` package provides reusable kernel, capability, and
216+ syscall hardening for microVM init processes. It is guest-side code
210217(` //go:build linux ` ) with no CGO or krun dependencies.
211218
212219### Recommended usage
@@ -217,6 +224,10 @@ Call the hardening functions in your guest init boot sequence:
2172242 . Call ` harden.KernelDefaults(logger) ` to apply sysctls.
2182253 . Perform all privileged operations (mounts, network config, chown).
2192264 . Call ` harden.DropBoundingCaps(keep...) ` as the last privileged step.
227+ 5 . Call ` harden.ApplySeccomp() ` as the very last hardening step.
228+
229+ Alternatively, use ` boot.WithSeccomp(true) ` to have the boot sequence
230+ apply the seccomp filter automatically after SSH starts.
220231
221232### Kernel sysctls
222233
@@ -267,6 +278,47 @@ Note: `no_new_privs` does not affect `setresuid`/`setresgid` syscalls
267278used by Go's ` SysProcAttr.Credential ` — credential switching for SSH
268279sessions continues to work after the bit is set.
269280
281+ ### Seccomp BPF filter
282+
283+ ` ApplySeccomp() ` installs a seccomp BPF blocklist filter on all OS
284+ threads. The filter uses ` SECCOMP_FILTER_FLAG_TSYNC ` to synchronize
285+ across threads and sets ` no_new_privs ` . Once applied, it is inherited
286+ by all child processes and cannot be removed.
287+
288+ The blocklist is split into two tiers:
289+
290+ ** Exploitation indicators** (` SECCOMP_RET_KILL_PROCESS ` ): syscalls that
291+ legitimate workloads never call — any attempt strongly indicates
292+ active exploitation.
293+
294+ | Syscall | Reason |
295+ | ---------| --------|
296+ | ` io_uring_setup/enter/register ` | Prolific source of kernel CVEs |
297+ | ` ptrace ` , ` process_vm_readv/writev ` | Process debugging / cross-process memory access |
298+ | ` kexec_load/file_load ` | Kernel replacement |
299+ | ` init_module/finit_module/delete_module ` | Kernel module loading |
300+ | ` bpf ` | eBPF — kernel attack surface |
301+ | ` seccomp ` | Prevent installing additional filters |
302+
303+ ** Operational blocks** (` SECCOMP_RET_ERRNO ` /` EPERM ` ): syscalls that
304+ runtimes or build tools might innocuously probe — fail gracefully.
305+
306+ | Category | Syscalls |
307+ | ----------| ----------|
308+ | Filesystem manipulation | ` mount ` , ` umount2 ` , ` pivot_root ` , ` chroot ` , ` fsopen ` , ` fsconfig ` , ` fspick ` , ` move_mount ` , ` open_tree ` |
309+ | Namespace creation | ` unshare ` , ` setns ` , ` clone ` (with ` CLONE_NEW* ` flags) |
310+ | Side-channel attacks | ` perf_event_open ` , ` userfaultfd ` |
311+ | Kernel keyring | ` add_key ` , ` request_key ` , ` keyctl ` |
312+ | Landlock | ` landlock_create_ruleset ` , ` landlock_add_rule ` , ` landlock_restrict_self ` |
313+ | Dangerous socket families | ` AF_NETLINK ` , ` AF_PACKET ` , ` AF_KEY ` , ` AF_ALG ` , ` AF_VSOCK ` |
314+
315+ ` AF_INET ` , ` AF_INET6 ` , and ` AF_UNIX ` remain allowed for normal
316+ workload operation.
317+
318+ Note: ` clone3 ` is NOT blocked because glibc 2.34+ uses it for
319+ ` fork() ` . Its namespace flags live inside a struct pointer (` arg0 ` )
320+ which seccomp BPF cannot inspect.
321+
270322### Filesystem hardening
271323
272324Consumers should lock down ` /root/ ` (mode ` 0700 ` ) after completing
0 commit comments