Skip to content

Improve Python unwinding resilience#214

Merged
gnurizen merged 6 commits into
mainfrom
recover-python-frames
Feb 24, 2026
Merged

Improve Python unwinding resilience#214
gnurizen merged 6 commits into
mainfrom
recover-python-frames

Conversation

@gnurizen
Copy link
Copy Markdown
Collaborator

@gnurizen gnurizen commented Feb 20, 2026

Summary

  • Recover Python frames when BPF fails to read PyCodeObject

Details

PyCodeObject Recovery

When bpf_probe_read_user fails to read a PyCodeObject (e.g., page swapped out), the unwinder now pushes the frame with codeobject_id=0 instead of aborting the entire unwind. This preserves the rest of the stack trace.

On the agent side, getCodeObject handles ebpfChecksum=0 by:

  • Skipping the LRU cache (no checksum to validate against)
  • Skipping the staleness check (no BPF reference to compare)
  • Reading the code object via process_vm_readv, which supports page faults
  • Storing the calculated checksum in the cache for subsequent frame matching

Impact

  • More robust handling of swapped-out Python code objects

🤖 Generated with Claude Code

@gnurizen gnurizen force-pushed the recover-python-frames branch from a349be0 to f8fe5cf Compare February 20, 2026 20:07
@gnurizen gnurizen changed the title Reduce tail calls and improve Python unwinding resilience Improve Python unwinding resilience Feb 20, 2026
@gnurizen gnurizen requested a review from umanwizard February 20, 2026 20:32
Comment on lines +133 to +135
// Push the frame with the code object address so the agent can try to
// read it via /proc/pid/mem (which supports page faults unlike BPF).
// codeobject_id=0 distinguishes this from a successful read.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the reading via /proc/pid/mem happen?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (p *pythonInstance) getCodeObject(addr libpf.Address,

m := value
if m.ebpfChecksum == ebpfChecksum {
return m, nil
if ebpfChecksum != 0 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When/why can this checksum be 0

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the bpf_read fails for the py code object.

When bpf_probe_read_user fails to read a PyCodeObject (e.g. page
swapped out), push the frame with codeobject_id=0 instead of aborting
the unwind. This preserves the rest of the stack trace.

On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping
the LRU cache (no checksum to validate against) and the staleness
check (no BPF reference to compare). The agent reads the code object
via process_vm_readv which supports page faults, so it can succeed
where BPF could not. Store the calculated checksum in the cache so
subsequent frames with a real BPF checksum can match.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the resilience of Python stack unwinding by handling cases where eBPF fails to read PyCodeObject structures (e.g., when pages are swapped out). Instead of aborting the entire unwind, the unwinder now pushes a frame with minimal information and lets the agent retry the read using process_vm_readv, which supports page faults unlike BPF.

Changes:

  • Modified eBPF unwinder to push frames with codeobject_id=0 when PyCodeObject read fails
  • Updated agent to skip cache lookups and staleness checks when ebpfChecksum=0
  • Agent stores calculated checksum for subsequent frame matching

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 1 comment.

File Description
support/ebpf/python_tracer.ebpf.c Added fallback behavior when bpf_probe_read_user fails: pushes frame with code object address and codeobject_id=0 instead of returning error
interpreter/python/python.go Modified getCodeObject to skip cache and staleness checks when ebpfChecksum=0, and store calculated checksum instead of BPF-provided checksum

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +457 to 464
if ebpfChecksum != 0 {
if value, ok := p.addrToCodeObject.Get(addr); ok {
m := value
if m.ebpfChecksum == ebpfChecksum {
return m, nil
}
}
}
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior for handling ebpfChecksum=0 (when BPF fails to read the code object) lacks test coverage. Consider adding a unit test that verifies:

  1. When ebpfChecksum=0, the cache lookup is skipped
  2. The code object is successfully read via RemoteMemory.Read
  3. The calculated checksum is stored in the cache
  4. Subsequent lookups with the calculated checksum result in cache hits

This would help prevent regressions and document the expected behavior.

Copilot uses AI. Check for mistakes.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Feb 24, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ gnurizen
❌ fabled
❌ florianl
You have signed the CLA already but the status is still pending? Let us recheck it.

@gnurizen gnurizen force-pushed the recover-python-frames branch from dca5f7a to 4a2573d Compare February 24, 2026 17:29
@umanwizard umanwizard self-requested a review February 24, 2026 17:33
@gnurizen gnurizen merged commit 78b78f0 into main Feb 24, 2026
22 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants