Improve Python unwinding resilience by gnurizen · Pull Request #214 · parca-dev/opentelemetry-ebpf-profiler

gnurizen · 2026-02-20T19:38:19Z

Summary

Recover Python frames when BPF fails to read PyCodeObject

Details

PyCodeObject Recovery

When bpf_probe_read_user fails to read a PyCodeObject (e.g., page swapped out), the unwinder now pushes the frame with codeobject_id=0 instead of aborting the entire unwind. This preserves the rest of the stack trace.

On the agent side, getCodeObject handles ebpfChecksum=0 by:

Skipping the LRU cache (no checksum to validate against)
Skipping the staleness check (no BPF reference to compare)
Reading the code object via process_vm_readv, which supports page faults
Storing the calculated checksum in the cache for subsequent frame matching

Impact

More robust handling of swapped-out Python code objects

🤖 Generated with Claude Code

umanwizard · 2026-02-20T22:39:18Z

+    // Push the frame with the code object address so the agent can try to
+    // read it via /proc/pid/mem (which supports page faults unlike BPF).
+    // codeobject_id=0 distinguishes this from a successful read.


Where does the reading via /proc/pid/mem happen?

opentelemetry-ebpf-profiler/interpreter/python/python.go

Line 452 in d13351c

func (p *pythonInstance) getCodeObject(addr libpf.Address,

umanwizard · 2026-02-20T22:40:42Z

-		m := value
-		if m.ebpfChecksum == ebpfChecksum {
-			return m, nil
+	if ebpfChecksum != 0 {


When/why can this checksum be 0

When the bpf_read fails for the py code object.

When bpf_probe_read_user fails to read a PyCodeObject (e.g. page swapped out), push the frame with codeobject_id=0 instead of aborting the unwind. This preserves the rest of the stack trace. On the agent side, handle ebpfChecksum=0 in getCodeObject by skipping the LRU cache (no checksum to validate against) and the staleness check (no BPF reference to compare). The agent reads the code object via process_vm_readv which supports page faults, so it can succeed where BPF could not. Store the calculated checksum in the cache so subsequent frames with a real BPF checksum can match.

Copilot

Pull request overview

This PR improves the resilience of Python stack unwinding by handling cases where eBPF fails to read PyCodeObject structures (e.g., when pages are swapped out). Instead of aborting the entire unwind, the unwinder now pushes a frame with minimal information and lets the agent retry the read using process_vm_readv, which supports page faults unlike BPF.

Changes:

Modified eBPF unwinder to push frames with codeobject_id=0 when PyCodeObject read fails
Updated agent to skip cache lookups and staleness checks when ebpfChecksum=0
Agent stores calculated checksum for subsequent frame matching

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 1 comment.

File	Description
support/ebpf/python_tracer.ebpf.c	Added fallback behavior when `bpf_probe_read_user` fails: pushes frame with code object address and `codeobject_id=0` instead of returning error
interpreter/python/python.go	Modified `getCodeObject` to skip cache and staleness checks when `ebpfChecksum=0`, and store calculated checksum instead of BPF-provided checksum

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-23T23:27:05Z

+	if ebpfChecksum != 0 {
+		if value, ok := p.addrToCodeObject.Get(addr); ok {
+			m := value
+			if m.ebpfChecksum == ebpfChecksum {
+				return m, nil
+			}
 		}
 	}


The new behavior for handling ebpfChecksum=0 (when BPF fails to read the code object) lacks test coverage. Consider adding a unit test that verifies:

When ebpfChecksum=0, the cache lookup is skipped

The code object is successfully read via RemoteMemory.Read

The calculated checksum is stored in the cache

Subsequent lookups with the calculated checksum result in cache hits

This would help prevent regressions and document the expected behavior.

CLAassistant · 2026-02-24T16:22:54Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ gnurizen
❌ fabled
❌ florianl
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

gnurizen force-pushed the recover-python-frames branch from a349be0 to f8fe5cf Compare February 20, 2026 20:07

gnurizen changed the title ~~Reduce tail calls and improve Python unwinding resilience~~ Improve Python unwinding resilience Feb 20, 2026

gnurizen requested a review from umanwizard February 20, 2026 20:32

umanwizard reviewed Feb 20, 2026

View reviewed changes

gnurizen force-pushed the recover-python-frames branch from 9c9ecc6 to 42f6c5a Compare February 21, 2026 08:49

gnurizen requested a review from Copilot February 23, 2026 23:22

Copilot started reviewing on behalf of gnurizen February 23, 2026 23:23 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

gnurizen and others added 2 commits February 23, 2026 23:28

Add unit tests

1cc2e27

reduce packages installed for integration tests

956cb0d

florianl and others added 2 commits February 24, 2026 11:38

golabels: fix race condition (open-telemetry#744)

65c6e52

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

Make kernel test improvements to distro-qemu too

4a2573d

gnurizen force-pushed the recover-python-frames branch from dca5f7a to 4a2573d Compare February 24, 2026 17:29

umanwizard self-requested a review February 24, 2026 17:33

umanwizard approved these changes Feb 24, 2026

View reviewed changes

Cache parcagpu, hitting http egress limits

06b923f

gnurizen merged commit 78b78f0 into main Feb 24, 2026
22 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Python unwinding resilience#214

Improve Python unwinding resilience#214
gnurizen merged 6 commits into
mainfrom
recover-python-frames

gnurizen commented Feb 20, 2026 •

edited

Loading

Uh oh!

umanwizard Feb 20, 2026

Uh oh!

gnurizen Feb 21, 2026

Uh oh!

umanwizard Feb 20, 2026

Uh oh!

gnurizen Feb 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 23, 2026

Uh oh!

CLAassistant commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

gnurizen commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

PyCodeObject Recovery

Impact

Uh oh!

umanwizard Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gnurizen Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

umanwizard Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gnurizen Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gnurizen commented Feb 20, 2026 •

edited

Loading

CLAassistant commented Feb 24, 2026 •

edited

Loading