|
| 1 | +# eBPF Stack Limit Bypass: Per-CPU Array in Practice |
| 2 | + |
| 3 | +## 1. Background |
| 4 | + |
| 5 | +### 1.1 eBPF Stack Limit |
| 6 | + |
| 7 | +eBPF programs run in kernel space. For safety, the kernel strictly limits eBPF program stack space to **512 bytes**. |
| 8 | + |
| 9 | +``` |
| 10 | +┌─────────────────────────────────────┐ |
| 11 | +│ eBPF Program Stack Space │ |
| 12 | +│ │ |
| 13 | +│ ┌─────────────────────────┐ │ |
| 14 | +│ │ Maximum 512 bytes │ │ |
| 15 | +│ │ │ │ |
| 16 | +│ │ Local variables, temp │ │ |
| 17 | +│ │ │ │ |
| 18 | +│ └─────────────────────────┘ │ |
| 19 | +│ │ |
| 20 | +│ Exceed limit → Verifier rejects │ |
| 21 | +└─────────────────────────────────────┘ |
| 22 | +``` |
| 23 | + |
| 24 | +### 1.2 Why This Limit? |
| 25 | + |
| 26 | +| Reason | Description | |
| 27 | +|--------|-------------| |
| 28 | +| **Limited kernel stack** | Kernel stack is typically 8KB-16KB, must reserve for other kernel code | |
| 29 | +| **Prevent stack overflow** | Stack overflow could crash kernel or cause security vulnerabilities | |
| 30 | +| **Predictability** | Fixed limit allows verifier to statically analyze stack usage | |
| 31 | + |
| 32 | +### 1.3 Real-World Problems |
| 33 | + |
| 34 | +In practice, eBPF programs often need to handle large data structures: |
| 35 | + |
| 36 | +```c |
| 37 | +// This will fail! |
| 38 | +SEC("tracepoint/...") |
| 39 | +int my_prog(void *ctx) { |
| 40 | + struct big_event e; // 544 bytes |
| 41 | + struct extra_buffer ex; // 768 bytes |
| 42 | + struct local_data ld; // 256 bytes |
| 43 | + // verifier rejects: total ~1568 bytes exceeds 512B limit |
| 44 | +} |
| 45 | +``` |
| 46 | +
|
| 47 | +Common scenarios requiring large buffers: |
| 48 | +- Process monitoring: store process name, path, arguments |
| 49 | +- Network analysis: store packet contents |
| 50 | +- Security auditing: collect detailed context information |
| 51 | +- File monitoring: store file paths and contents |
| 52 | +
|
| 53 | +## 2. Solution: Per-CPU Array |
| 54 | +
|
| 55 | +### 2.1 Core Idea |
| 56 | +
|
| 57 | +Store large data structures in BPF Maps instead of on the stack: |
| 58 | +
|
| 59 | +``` |
| 60 | +┌─────────────────────────────────────────────────────┐ |
| 61 | +│ Traditional Way (Fails) │ |
| 62 | +├─────────────────────────────────────────────────────┤ |
| 63 | +│ Stack allocation: │ |
| 64 | +│ struct big_event e; // 544B ─┐ │ |
| 65 | +│ struct extra_buffer ex; // 768B ├→ 1568B > 512B │ |
| 66 | +│ struct local_data ld; // 256B ─┘ │ |
| 67 | +│ │ |
| 68 | +│ Result: Verifier rejects │ |
| 69 | +└─────────────────────────────────────────────────────┘ |
| 70 | + |
| 71 | +┌─────────────────────────────────────────────────────┐ |
| 72 | +│ Per-CPU Array (Works) │ |
| 73 | +├─────────────────────────────────────────────────────┤ |
| 74 | +│ Map allocation: │ |
| 75 | +│ __u32 key = 0; // 4B ─┐ │ |
| 76 | +│ struct big_event *e = lookup(&map, &key); │→ ~12B │ |
| 77 | +│ struct extra_buffer *ex = lookup(...); │ │ |
| 78 | +│ struct local_data *ld = lookup(...); ─┘ │ |
| 79 | +│ │ |
| 80 | +│ Result: Stack usage < 512B, Verifier passes │ |
| 81 | +└─────────────────────────────────────────────────────┘ |
| 82 | +``` |
| 83 | +
|
| 84 | +### 2.2 Why Per-CPU Array? |
| 85 | +
|
| 86 | +| Map Type | Concurrency Safe | Performance | Use Case | |
| 87 | +|----------|-----------------|-------------|----------| |
| 88 | +| `BPF_MAP_TYPE_ARRAY` | Needs locking | Medium | Shared data | |
| 89 | +| `BPF_MAP_TYPE_PERCPU_ARRAY` | Naturally safe | High | Temp buffers | |
| 90 | +| `BPF_MAP_TYPE_HASH` | Needs locking | Medium | Dynamic keys | |
| 91 | +
|
| 92 | +**Per-CPU Array advantages**: |
| 93 | +- Each CPU gets an independent buffer copy |
| 94 | +- No lock contention, no cacheline bouncing |
| 95 | +- O(1) lookup time |
| 96 | +- Perfect for temporary work buffers |
| 97 | +
|
| 98 | +## 3. Implementation |
| 99 | +
|
| 100 | +### 3.1 BPF Kernel Program |
| 101 | +
|
| 102 | +```c |
| 103 | +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause |
| 104 | +#include "vmlinux.h" |
| 105 | +#include <bpf/bpf_helpers.h> |
| 106 | +
|
| 107 | +char LICENSE[] SEC("license") = "Dual BSD/GPL"; |
| 108 | +
|
| 109 | +// Large struct: event data (exceeds 512B stack limit) |
| 110 | +struct big_event { |
| 111 | + __u32 pid; |
| 112 | + __u64 timestamp; |
| 113 | + char comm[16]; |
| 114 | + char data[512]; // This field makes struct exceed 512B |
| 115 | +}; |
| 116 | +
|
| 117 | +// Per-CPU Array definition |
| 118 | +struct { |
| 119 | + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); |
| 120 | + __uint(max_entries, 1); |
| 121 | + __type(key, __u32); |
| 122 | + __type(value, struct big_event); |
| 123 | +} event_buffer SEC(".maps"); |
| 124 | +
|
| 125 | +// Ring Buffer: pass events to userspace |
| 126 | +struct { |
| 127 | + __uint(type, BPF_MAP_TYPE_RINGBUF); |
| 128 | + __uint(max_entries, 256 * 1024); |
| 129 | +} events SEC(".maps"); |
| 130 | +
|
| 131 | +SEC("tracepoint/sched/sched_process_exec") |
| 132 | +int trace_exec(struct trace_event_raw_sched_process_exec *ctx) |
| 133 | +{ |
| 134 | + struct big_event *e; |
| 135 | + __u32 key = 0; |
| 136 | +
|
| 137 | + // Get buffer from Per-CPU Array |
| 138 | + e = bpf_map_lookup_elem(&event_buffer, &key); |
| 139 | + if (!e) |
| 140 | + return 0; |
| 141 | +
|
| 142 | + // Fill event data |
| 143 | + e->pid = bpf_get_current_pid_tgid() >> 32; |
| 144 | + e->timestamp = bpf_ktime_get_ns(); |
| 145 | + bpf_get_current_comm(e->comm, sizeof(e->comm)); |
| 146 | +
|
| 147 | + // Fill data field |
| 148 | + e->data[0] = e->pid & 0xFF; |
| 149 | + e->data[1] = (e->timestamp >> 8) & 0xFF; |
| 150 | + e->data[2] = (e->pid >> 16) & 0xFF; |
| 151 | +
|
| 152 | + // Send to Ring Buffer |
| 153 | + bpf_ringbuf_output(&events, e, sizeof(*e), 0); |
| 154 | +
|
| 155 | + return 0; |
| 156 | +} |
| 157 | +``` |
| 158 | + |
| 159 | +### 3.2 Key Code Analysis |
| 160 | + |
| 161 | +#### Per-CPU Array Definition |
| 162 | + |
| 163 | +```c |
| 164 | +struct { |
| 165 | + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); |
| 166 | + __uint(max_entries, 1); // Only need 1 slot |
| 167 | + __type(key, __u32); |
| 168 | + __type(value, struct big_event); |
| 169 | +} event_buffer SEC(".maps"); |
| 170 | +``` |
| 171 | +
|
| 172 | +- `max_entries = 1`: As a temp buffer, only need one element |
| 173 | +- Each CPU automatically gets independent copy |
| 174 | +
|
| 175 | +#### Getting the Buffer |
| 176 | +
|
| 177 | +```c |
| 178 | +__u32 key = 0; |
| 179 | +struct big_event *e = bpf_map_lookup_elem(&event_buffer, &key); |
| 180 | +if (!e) |
| 181 | + return 0; // Must check for NULL |
| 182 | +``` |
| 183 | + |
| 184 | +- Always use `key = 0` |
| 185 | +- Returns pointer to current CPU's dedicated buffer |
| 186 | +- **Must** check for NULL, or verifier rejects |
| 187 | + |
| 188 | +### 3.3 Userspace Program |
| 189 | + |
| 190 | +```c |
| 191 | +#include <stdio.h> |
| 192 | +#include <signal.h> |
| 193 | +#include <bpf/libbpf.h> |
| 194 | +#include "stack_limit_bypass.skel.h" |
| 195 | + |
| 196 | +struct big_event { |
| 197 | + __u32 pid; |
| 198 | + __u64 timestamp; |
| 199 | + char comm[16]; |
| 200 | + char data[512]; |
| 201 | +}; |
| 202 | + |
| 203 | +static volatile sig_atomic_t exiting = 0; |
| 204 | + |
| 205 | +static void sig_handler(int sig) { exiting = 1; } |
| 206 | + |
| 207 | +static int handle_event(void *ctx, void *data, size_t data_sz) |
| 208 | +{ |
| 209 | + struct big_event *e = data; |
| 210 | + printf("[%llu] PID: %-6u | comm: %-16s\n", |
| 211 | + e->timestamp / 1000000, e->pid, e->comm); |
| 212 | + return 0; |
| 213 | +} |
| 214 | + |
| 215 | +int main(int argc, char **argv) |
| 216 | +{ |
| 217 | + struct stack_limit_bypass_bpf *skel; |
| 218 | + struct ring_buffer *rb = NULL; |
| 219 | + |
| 220 | + signal(SIGINT, sig_handler); |
| 221 | + signal(SIGTERM, sig_handler); |
| 222 | + |
| 223 | + // Load BPF program |
| 224 | + skel = stack_limit_bypass_bpf__open_and_load(); |
| 225 | + if (!skel) { |
| 226 | + fprintf(stderr, "Failed to load BPF program\n"); |
| 227 | + return 1; |
| 228 | + } |
| 229 | + |
| 230 | + // Attach to tracepoint |
| 231 | + stack_limit_bypass_bpf__attach(skel); |
| 232 | + |
| 233 | + // Create Ring Buffer |
| 234 | + rb = ring_buffer__new(bpf_map__fd(skel->maps.events), |
| 235 | + handle_event, NULL, NULL); |
| 236 | + |
| 237 | + printf("Monitoring process exec events... (Ctrl+C to exit)\n"); |
| 238 | + |
| 239 | + while (!exiting) { |
| 240 | + ring_buffer__poll(rb, 100); |
| 241 | + } |
| 242 | + |
| 243 | + ring_buffer__free(rb); |
| 244 | + stack_limit_bypass_bpf__destroy(skel); |
| 245 | + return 0; |
| 246 | +} |
| 247 | +``` |
| 248 | +
|
| 249 | +## 4. Build and Run |
| 250 | +
|
| 251 | +### 4.1 Normal Build |
| 252 | +
|
| 253 | +```bash |
| 254 | +cd src/19-bypass-stack-limit |
| 255 | +make clean && make |
| 256 | +sudo ./stack_limit_bypass |
| 257 | +``` |
| 258 | + |
| 259 | +Expected output: |
| 260 | + |
| 261 | +``` |
| 262 | +======================================== |
| 263 | +Per-CPU Array Demo - Bypass eBPF 512B Stack Limit |
| 264 | +======================================== |
| 265 | +Struct sizes: |
| 266 | + - big_event: 544 bytes |
| 267 | + - Total stack: ~1568 bytes (if using local variables) |
| 268 | + - eBPF limit: 512 bytes |
| 269 | +======================================== |
| 270 | +Monitoring process exec events... (Ctrl+C to exit) |
| 271 | +
|
| 272 | +[12345.678] PID: 1234 | comm: bash | data[0-3]: 0x12 0x34 0x56 0x78 |
| 273 | +``` |
| 274 | + |
| 275 | +### 4.2 Trigger Stack Limit Error (Demo) |
| 276 | + |
| 277 | +Set `BAD_EXAMPLE_STACK` to 1 in the code, or use compile flag: |
| 278 | + |
| 279 | +```bash |
| 280 | +make clean && make EXTRA_CFLAGS="-DBAD_EXAMPLE_STACK=1" |
| 281 | +sudo ./stack_limit_bypass |
| 282 | +``` |
| 283 | + |
| 284 | +Expected output: |
| 285 | + |
| 286 | +``` |
| 287 | +libbpf: prog 'trace_exec': BPF program is too large |
| 288 | +libbpf: prog 'trace_exec': -- BEGIN PROG LOAD LOG -- |
| 289 | +... |
| 290 | +combined stack size of 1568 exceeds limit 512 |
| 291 | +... |
| 292 | +Failed to load BPF program |
| 293 | +``` |
| 294 | + |
| 295 | +## 5. Stack Usage Analysis |
| 296 | + |
| 297 | +### 5.1 Struct Sizes |
| 298 | + |
| 299 | +| Struct | Size | Description | |
| 300 | +|--------|------|-------------| |
| 301 | +| `big_event` | ~544 bytes | pid(4) + timestamp(8) + comm(16) + data(512) + padding | |
| 302 | + |
| 303 | +### 5.2 Stack Usage Comparison |
| 304 | + |
| 305 | +| Method | Stack Usage | Result | |
| 306 | +|--------|-------------|--------| |
| 307 | +| Stack allocation `struct big_event e;` | 544+ bytes | Verifier rejects | |
| 308 | +| Per-CPU Array pointer | ~12 bytes | Verifier passes | |
| 309 | + |
| 310 | +## 6. Preventing Compiler Optimization |
| 311 | + |
| 312 | +When demonstrating the error case, prevent compiler from optimizing away unused stack variables: |
| 313 | + |
| 314 | +### 6.1 Memory Barrier |
| 315 | + |
| 316 | +```c |
| 317 | +#define barrier() asm volatile("" ::: "memory") |
| 318 | + |
| 319 | +struct big_event stack_event = {}; |
| 320 | +barrier(); // Tell compiler: memory may be modified, don't optimize |
| 321 | +``` |
| 322 | + |
| 323 | +### 6.2 Explicitly Use Variables |
| 324 | + |
| 325 | +```c |
| 326 | +// Ensure variables are actually used |
| 327 | +e->data[0] = pid & 0xFF; |
| 328 | +e->data[100] = (pid >> 8) & 0xFF; |
| 329 | +e->data[200] = (pid >> 16) & 0xFF; |
| 330 | +``` |
| 331 | + |
| 332 | +## 7. Best Practices |
| 333 | + |
| 334 | +### 7.1 When to Use Per-CPU Array |
| 335 | + |
| 336 | +| Scenario | Recommendation | |
| 337 | +|----------|----------------| |
| 338 | +| Temporary work buffers | Highly recommended | |
| 339 | +| Event data collection | Recommended | |
| 340 | +| Large string handling | Recommended | |
| 341 | +| Cross-CPU sharing needed | Not suitable, use regular Array | |
| 342 | + |
| 343 | +### 7.2 Usage Tips |
| 344 | + |
| 345 | +1. **Fixed key = 0**: Only need one slot for buffer |
| 346 | +2. **Must check NULL**: `bpf_map_lookup_elem` may return NULL |
| 347 | +3. **Clear before reuse**: Consider zeroing buffer to avoid stale data |
| 348 | +4. **Mind the size**: Single Per-CPU Array element also has size limits |
| 349 | + |
| 350 | +### 7.3 Common Mistakes |
| 351 | + |
| 352 | +```c |
| 353 | +// Wrong: forgot NULL check |
| 354 | +e = bpf_map_lookup_elem(&buffer, &key); |
| 355 | +e->pid = 123; // Verifier rejects! |
| 356 | + |
| 357 | +// Correct: must check |
| 358 | +e = bpf_map_lookup_elem(&buffer, &key); |
| 359 | +if (!e) return 0; |
| 360 | +e->pid = 123; // OK |
| 361 | +``` |
| 362 | + |
| 363 | +## 8. Kernel Version Compatibility |
| 364 | + |
| 365 | +| Kernel Version | Stack Limit Behavior | |
| 366 | +|----------------|---------------------| |
| 367 | +| < 5.x | Strict 512 byte limit | |
| 368 | +| 5.x+ | Supports BPF-to-BPF calls, 512B per function frame | |
| 369 | +| 6.x+ | Smarter verifier, but basic limit remains | |
| 370 | + |
| 371 | +The Per-CPU Array solution works on all kernel versions that support eBPF. |
| 372 | + |
| 373 | +## 9. Summary |
| 374 | + |
| 375 | +This lesson covered the eBPF 512-byte stack limit and its solution: |
| 376 | + |
| 377 | +1. **Problem**: eBPF program stack is limited to 512 bytes |
| 378 | +2. **Impact**: Cannot allocate large data structures on stack |
| 379 | +3. **Solution**: Use Per-CPU Array as temporary buffer |
| 380 | +4. **Benefits**: Concurrency safe, high performance, lock-free |
| 381 | + |
| 382 | +With this technique, you can freely handle large data structures in eBPF programs without stack limit constraints. |
| 383 | + |
| 384 | +## 10. References |
| 385 | + |
| 386 | +- [BPF Design Q&A - Stack Space](https://docs.kernel.org/bpf/bpf_design_QA.html) |
| 387 | +- [Per-CPU Variables](https://lwn.net/Articles/258238/) |
| 388 | +- [libbpf Documentation](https://libbpf.readthedocs.io/) |
0 commit comments