Improve Python unwinding resilience#214
Conversation
a349be0 to
f8fe5cf
Compare
| // Push the frame with the code object address so the agent can try to | ||
| // read it via /proc/pid/mem (which supports page faults unlike BPF). | ||
| // codeobject_id=0 distinguishes this from a successful read. |
There was a problem hiding this comment.
Where does the reading via /proc/pid/mem happen?
There was a problem hiding this comment.
| m := value | ||
| if m.ebpfChecksum == ebpfChecksum { | ||
| return m, nil | ||
| if ebpfChecksum != 0 { |
There was a problem hiding this comment.
When/why can this checksum be 0
There was a problem hiding this comment.
When the bpf_read fails for the py code object.
When bpf_probe_read_user fails to read a PyCodeObject (e.g. page swapped out), push the frame with codeobject_id=0 instead of aborting the unwind. This preserves the rest of the stack trace. On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping the LRU cache (no checksum to validate against) and the staleness check (no BPF reference to compare). The agent reads the code object via process_vm_readv which supports page faults, so it can succeed where BPF could not. Store the calculated checksum in the cache so subsequent frames with a real BPF checksum can match.
9c9ecc6 to
42f6c5a
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the resilience of Python stack unwinding by handling cases where eBPF fails to read PyCodeObject structures (e.g., when pages are swapped out). Instead of aborting the entire unwind, the unwinder now pushes a frame with minimal information and lets the agent retry the read using process_vm_readv, which supports page faults unlike BPF.
Changes:
- Modified eBPF unwinder to push frames with
codeobject_id=0when PyCodeObject read fails - Updated agent to skip cache lookups and staleness checks when
ebpfChecksum=0 - Agent stores calculated checksum for subsequent frame matching
Reviewed changes
Copilot reviewed 2 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| support/ebpf/python_tracer.ebpf.c | Added fallback behavior when bpf_probe_read_user fails: pushes frame with code object address and codeobject_id=0 instead of returning error |
| interpreter/python/python.go | Modified getCodeObject to skip cache and staleness checks when ebpfChecksum=0, and store calculated checksum instead of BPF-provided checksum |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ebpfChecksum != 0 { | ||
| if value, ok := p.addrToCodeObject.Get(addr); ok { | ||
| m := value | ||
| if m.ebpfChecksum == ebpfChecksum { | ||
| return m, nil | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The new behavior for handling ebpfChecksum=0 (when BPF fails to read the code object) lacks test coverage. Consider adding a unit test that verifies:
- When ebpfChecksum=0, the cache lookup is skipped
- The code object is successfully read via RemoteMemory.Read
- The calculated checksum is stored in the cache
- Subsequent lookups with the calculated checksum result in cache hits
This would help prevent regressions and document the expected behavior.
|
|
Signed-off-by: Florian Lehner <florian.lehner@elastic.co>
dca5f7a to
4a2573d
Compare
Summary
Details
PyCodeObject Recovery
When
bpf_probe_read_userfails to read a PyCodeObject (e.g., page swapped out), the unwinder now pushes the frame withcodeobject_id=0instead of aborting the entire unwind. This preserves the rest of the stack trace.On the agent side,
getCodeObjecthandlesebpfChecksum=0by:process_vm_readv, which supports page faultsImpact
🤖 Generated with Claude Code