Skip to content

libbpf-tools: Fix ringbuf leaks that leak prior events to userspace#5506

Open
vdasu wants to merge 1 commit into
iovisor:masterfrom
vdasu:ringbuf_leaks
Open

libbpf-tools: Fix ringbuf leaks that leak prior events to userspace#5506
vdasu wants to merge 1 commit into
iovisor:masterfrom
vdasu:ringbuf_leaks

Conversation

@vdasu
Copy link
Copy Markdown
Contributor

@vdasu vdasu commented May 1, 2026

Five libbpf-tools (mountsnoop, opensnoop, filelife, oomkill, and tcppktlat) call reserve_buf() to obtain an event buffer, populate a subset of the event's fields, and emit the full sizeof(*eventp) bytes via submit_buf(). Neither path of reserve_buf() (bpf_ringbuf_reserve() or the per-CPU array heap fallback) zero-initializes the returned memory. Any bytes the source-level path leaves unwritten retain whatever the slot held previously. This can leak previously emitted record content back to userspace on subsequent events.

The mountsnoop leak was tested on Linux 6.8. Since struct event contains a tagged union with one active arm per syscall operation, and only the arm matching argp->op is written, the inactive union bytes contained prior MOUNT records' src, dest, fs, and data paths.

A natural approach to zero the memory would be to use __builtin_memset(eventp, 0, sizeof(*eventp)). This does not seem to work reliably for larger BPF event structs: once the size exceeds LLVM's inline-store budget, LLVM may lower the memset into a libcall. BPF programs cannot link against libc, so the BPF backend has no memset symbol to resolve and compilation fails. The added zero_buf() helper avoids this by writing the byte loop explicitly in a __noinline subprogram, so it lands as a single BPF-to-BPF call shared across sites, and by using volatile writes so LLVM's loop-idiom recognition does not re-lower it back into __builtin_memset.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates information leaks in several libbpf-tools eBPF programs by ensuring reserved event buffers are explicitly zero-initialized before only a subset of fields are populated and the full event struct is emitted to userspace.

Changes:

  • Add a shared zero_buf() helper in compat.bpf.h to reliably zero event buffers without relying on __builtin_memset codegen.
  • Call zero_buf() immediately after reserve_buf() in mountsnoop, opensnoop, filelife, oomkill, and tcppktlat BPF programs.
  • Ensure that inactive/unused bytes in event structs (e.g., union arms) don’t retain prior record contents.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
libbpf-tools/compat.bpf.h Adds zero_buf() helper to clear reserved event buffers safely for BPF.
libbpf-tools/mountsnoop.bpf.c Zero-initializes struct event after reserving buffer to avoid union residue leaks.
libbpf-tools/opensnoop.bpf.c Zero-initializes struct event after reserving buffer before populating fields.
libbpf-tools/filelife.bpf.c Zero-initializes struct event after reserving buffer prior to filling output.
libbpf-tools/oomkill.bpf.c Zero-initializes struct data_t after reserving buffer prior to filling output.
libbpf-tools/tcppktlat.bpf.c Zero-initializes struct event after reserving buffer before populating fields.

Comment thread libbpf-tools/compat.bpf.h
Comment on lines +49 to +54
* recognition, which would otherwise re-lower this back to __builtin_memset
* (which the BPF backend cannot inline for buffers larger than ~256 bytes). */
static __noinline void zero_buf(void *p, __u64 sz)
{
for (__u64 i = 0; i < sz; i++)
*(volatile char *)((char *)p + i) = 0;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekyooo I can commit this optimization if you think it is necessary.

eventp = reserve_buf(sizeof(*eventp));
if (!eventp)
goto cleanup;
zero_buf(eventp, sizeof(*eventp));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mountsnoop, userspace already accesses only the active union arm via switch(e->op), so stale bytes in inactive arms are never read. Is "previously emitted record leak" the right framing for this case?

Would it be simpler to check the return value of bpf_probe_read_user_str and skip submit_buf on failure, rather than zeroing the whole buffer? Any case where a partial event is still useful?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale bytes are never read in the current userspace program. However, the ringbuf does contain stale bytes from prior events that are accessible from userspace. Any other consumer of the same ringbuf (e.g., a custom loader) sees the full record submitted to userspace, not just the bytes in the active arm.

I confirmed the leak empirically by reading the ringbuf directly (i.e., not through mountsnoop.c) on Linux 6.8. When you first reserve ringbuf memory (bpf_ringbuf_reserve(size)), the kernel returns a zeroed-out chunk of memory. However, since the ringbuf is 256 KB and each event is 8768 bytes, the buffer holds about ~29 events before the write position wraps. From event 30 onwards, each write lands on cycled slots, and the inactive-arm bytes contain fully decodable previous records. Across 200 events under a mount -t tmpfs / umount loop, 11,396 bytes of non-zero residue leaked, including PIDs, command names, and mount paths like /run/user/<UID>/credentials/mnt-test.mount, which reveal which user triggered earlier mount operations.

Regarding checking the return value of bpf_probe_read_user_str, that is good practice, but it's orthogonal to this leak. When the active arm is smaller than mount (e.g., FSMOUNT writes 12 bytes, UMOUNT writes 4104 bytes), the trailing bytes of the union (8720 - active arm size) are never targeted by any helper. Even if every bpf_probe_read_user_str succeeds and we never skip submit_buf, those bytes remain unwritten.

Partial events might be useful, since they are better than silently dropping the event. However, I believe that is a separate question since it would not fix the leak.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed the leak empirically by reading the ringbuf directly (i.e., not through mountsnoop.c) on Linux 6.8. When you first reserve ringbuf memory (bpf_ringbuf_reserve(size)), the kernel returns a zeroed-out chunk of memory. However, since the ringbuf is 256 KB and each event is 8768 bytes, the buffer holds about ~29 events before the write position wraps. From event 30 onwards, each write lands on cycled slots, and the inactive-arm bytes contain fully decodable previous records. Across 200 events under a mount -t tmpfs / umount loop, 11,396 bytes of non-zero residue leaked, including PIDs, command names, and mount paths like /run/user//credentials/mnt-test.mount, which reveal which user triggered earlier mount operations.
Regarding checking the return value of bpf_probe_read_user_str, that is good practice, but it's orthogonal to this leak. When the active arm is smaller than mount (e.g., FSMOUNT writes 12 bytes, UMOUNT writes 4104 bytes), the trailing bytes of the union (8720 - active arm size) are never targeted by any helper.

Your observation is valid, and this seems to be an expected result of mountsnoop’s design.
Can you clarify the exact issue you are trying to solve?

  1. Previously emitted records were already delivered to userspace. In that sense, is this really a leak in the security sense?
  2. When using mountsnoop.c, inactive union fields are not read, so there is no practical issue. Do you have another real consumer besides mountsnoop that is affected?

Comment thread libbpf-tools/compat.bpf.h
* subprogram shared across call sites; volatile defeats LLVM's loop-idiom
* recognition, which would otherwise re-lower this back to __builtin_memset
* (which the BPF backend cannot inline for buffers larger than ~256 bytes). */
static __noinline void zero_buf(void *p, __u64 sz)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns with zero_buf():

  • opensnoop is a high-frequency probe and struct event is ~8 KB — per-event zeroing may add noticeable overhead?
  • The variable-bounded loop requires Linux 5.3+. Kernels below that would reject the program at load time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this.

  1. Yes, I zeroing out will add some overhead. Copilot's suggestion to optimize the for loop should speed things up significantly. Would you like me to run some tests to see the ovehead zero_buf adds? Also, another way to speed things up would be to zero out only the hole in the struct. If we do this, we have will have optimized ringbuf zero functions for each eBPF program that leaks data.
  2. This should not be a problem for libbpf-tools since it requires CO-RE which requires kernel version 5.4+. Also, at this point in time, it looks like most libbpf-tools programs require version 5.5+ (This issue (Allow libbpf-tools to run on old kernels #4231) seems to address the backwards compatibility problem, but at this point in time, most programs are still not backward compatible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants