libbpf-tools: Fix ringbuf leaks that leak prior events to userspace#5506
libbpf-tools: Fix ringbuf leaks that leak prior events to userspace#5506vdasu wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR mitigates information leaks in several libbpf-tools eBPF programs by ensuring reserved event buffers are explicitly zero-initialized before only a subset of fields are populated and the full event struct is emitted to userspace.
Changes:
- Add a shared
zero_buf()helper incompat.bpf.hto reliably zero event buffers without relying on__builtin_memsetcodegen. - Call
zero_buf()immediately afterreserve_buf()inmountsnoop,opensnoop,filelife,oomkill, andtcppktlatBPF programs. - Ensure that inactive/unused bytes in event structs (e.g., union arms) don’t retain prior record contents.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| libbpf-tools/compat.bpf.h | Adds zero_buf() helper to clear reserved event buffers safely for BPF. |
| libbpf-tools/mountsnoop.bpf.c | Zero-initializes struct event after reserving buffer to avoid union residue leaks. |
| libbpf-tools/opensnoop.bpf.c | Zero-initializes struct event after reserving buffer before populating fields. |
| libbpf-tools/filelife.bpf.c | Zero-initializes struct event after reserving buffer prior to filling output. |
| libbpf-tools/oomkill.bpf.c | Zero-initializes struct data_t after reserving buffer prior to filling output. |
| libbpf-tools/tcppktlat.bpf.c | Zero-initializes struct event after reserving buffer before populating fields. |
| * recognition, which would otherwise re-lower this back to __builtin_memset | ||
| * (which the BPF backend cannot inline for buffers larger than ~256 bytes). */ | ||
| static __noinline void zero_buf(void *p, __u64 sz) | ||
| { | ||
| for (__u64 i = 0; i < sz; i++) | ||
| *(volatile char *)((char *)p + i) = 0; |
There was a problem hiding this comment.
@ekyooo I can commit this optimization if you think it is necessary.
| eventp = reserve_buf(sizeof(*eventp)); | ||
| if (!eventp) | ||
| goto cleanup; | ||
| zero_buf(eventp, sizeof(*eventp)); |
There was a problem hiding this comment.
For mountsnoop, userspace already accesses only the active union arm via switch(e->op), so stale bytes in inactive arms are never read. Is "previously emitted record leak" the right framing for this case?
Would it be simpler to check the return value of bpf_probe_read_user_str and skip submit_buf on failure, rather than zeroing the whole buffer? Any case where a partial event is still useful?
There was a problem hiding this comment.
Stale bytes are never read in the current userspace program. However, the ringbuf does contain stale bytes from prior events that are accessible from userspace. Any other consumer of the same ringbuf (e.g., a custom loader) sees the full record submitted to userspace, not just the bytes in the active arm.
I confirmed the leak empirically by reading the ringbuf directly (i.e., not through mountsnoop.c) on Linux 6.8. When you first reserve ringbuf memory (bpf_ringbuf_reserve(size)), the kernel returns a zeroed-out chunk of memory. However, since the ringbuf is 256 KB and each event is 8768 bytes, the buffer holds about ~29 events before the write position wraps. From event 30 onwards, each write lands on cycled slots, and the inactive-arm bytes contain fully decodable previous records. Across 200 events under a mount -t tmpfs / umount loop, 11,396 bytes of non-zero residue leaked, including PIDs, command names, and mount paths like /run/user/<UID>/credentials/mnt-test.mount, which reveal which user triggered earlier mount operations.
Regarding checking the return value of bpf_probe_read_user_str, that is good practice, but it's orthogonal to this leak. When the active arm is smaller than mount (e.g., FSMOUNT writes 12 bytes, UMOUNT writes 4104 bytes), the trailing bytes of the union (8720 - active arm size) are never targeted by any helper. Even if every bpf_probe_read_user_str succeeds and we never skip submit_buf, those bytes remain unwritten.
Partial events might be useful, since they are better than silently dropping the event. However, I believe that is a separate question since it would not fix the leak.
There was a problem hiding this comment.
I confirmed the leak empirically by reading the ringbuf directly (i.e., not through mountsnoop.c) on Linux 6.8. When you first reserve ringbuf memory (bpf_ringbuf_reserve(size)), the kernel returns a zeroed-out chunk of memory. However, since the ringbuf is 256 KB and each event is 8768 bytes, the buffer holds about ~29 events before the write position wraps. From event 30 onwards, each write lands on cycled slots, and the inactive-arm bytes contain fully decodable previous records. Across 200 events under a mount -t tmpfs / umount loop, 11,396 bytes of non-zero residue leaked, including PIDs, command names, and mount paths like /run/user//credentials/mnt-test.mount, which reveal which user triggered earlier mount operations.
Regarding checking the return value of bpf_probe_read_user_str, that is good practice, but it's orthogonal to this leak. When the active arm is smaller than mount (e.g., FSMOUNT writes 12 bytes, UMOUNT writes 4104 bytes), the trailing bytes of the union (8720 - active arm size) are never targeted by any helper.
Your observation is valid, and this seems to be an expected result of mountsnoop’s design.
Can you clarify the exact issue you are trying to solve?
- Previously emitted records were already delivered to userspace. In that sense, is this really a leak in the security sense?
- When using mountsnoop.c, inactive union fields are not read, so there is no practical issue. Do you have another real consumer besides mountsnoop that is affected?
| * subprogram shared across call sites; volatile defeats LLVM's loop-idiom | ||
| * recognition, which would otherwise re-lower this back to __builtin_memset | ||
| * (which the BPF backend cannot inline for buffers larger than ~256 bytes). */ | ||
| static __noinline void zero_buf(void *p, __u64 sz) |
There was a problem hiding this comment.
Two concerns with zero_buf():
- opensnoop is a high-frequency probe and struct event is ~8 KB — per-event zeroing may add noticeable overhead?
- The variable-bounded loop requires Linux 5.3+. Kernels below that would reject the program at load time?
There was a problem hiding this comment.
Sorry, I missed this.
- Yes, I zeroing out will add some overhead. Copilot's suggestion to optimize the for loop should speed things up significantly. Would you like me to run some tests to see the ovehead
zero_bufadds? Also, another way to speed things up would be to zero out only the hole in the struct. If we do this, we have will have optimized ringbuf zero functions for each eBPF program that leaks data. - This should not be a problem for libbpf-tools since it requires CO-RE which requires kernel version 5.4+. Also, at this point in time, it looks like most libbpf-tools programs require version 5.5+ (This issue (Allow libbpf-tools to run on old kernels #4231) seems to address the backwards compatibility problem, but at this point in time, most programs are still not backward compatible).
Five libbpf-tools (
mountsnoop,opensnoop,filelife,oomkill, andtcppktlat) callreserve_buf()to obtain an event buffer, populate a subset of the event's fields, and emit the fullsizeof(*eventp)bytes viasubmit_buf(). Neither path ofreserve_buf()(bpf_ringbuf_reserve()or the per-CPU arrayheapfallback) zero-initializes the returned memory. Any bytes the source-level path leaves unwritten retain whatever the slot held previously. This can leak previously emitted record content back to userspace on subsequent events.The
mountsnoopleak was tested on Linux 6.8. Sincestruct eventcontains a tagged union with one active arm per syscall operation, and only the arm matchingargp->opis written, the inactive union bytes contained priorMOUNTrecords'src,dest,fs, anddatapaths.A natural approach to zero the memory would be to use
__builtin_memset(eventp, 0, sizeof(*eventp)). This does not seem to work reliably for larger BPF event structs: once the size exceeds LLVM's inline-store budget, LLVM may lower the memset into a libcall. BPF programs cannot link against libc, so the BPF backend has nomemsetsymbol to resolve and compilation fails. The addedzero_buf()helper avoids this by writing the byte loop explicitly in a__noinlinesubprogram, so it lands as a single BPF-to-BPF call shared across sites, and by usingvolatilewrites so LLVM's loop-idiom recognition does not re-lower it back into__builtin_memset.