pynat int by gnurizen · Pull Request #215 · parca-dev/opentelemetry-ebpf-profiler

gnurizen · 2026-02-21T10:53:49Z

Extract native unwinder functions into native_stack_trace.h
Combine python and native unwinder into single loop
Add TraceInterceptor to tracehandler
Symbolize CUDA traces before GPU timing fixup

Move defines (STACK_DELTA_INVALID, STACK_DELTA_STOP, NATIVE_FRAMES_PER_PROGRAM) and functions (push_native, bsearch_step, get_stack_delta_map, get_stack_delta, unwind_register_address, unwind_one_frame) from native_stack_trace.ebpf.c into native_stack_trace.h so they can be reused by other eBPF programs. Zero functional changes: stripping BTF metadata from the before/after blobs produces identical binaries, confirming no generated code changed.

Python, especially pytorch programs can exhaust the tail call limit by switching from python to native unwinders more than 29 times. This happens because of eval/delegation patterns where one python frame will be decorated with a couple native frames. In order to unwind these stack successfully fold the native unwinder into the python unwinder so at each frame a python or native frame can be unwound. Replace the separate walk_python_stack inner loop and outer transition loop with a single switch-in-loop structure using step_python and step_native helper functions. This reduces tail call usage from one per batch to one per loop budget exhaustion (PYTHON_NATIVE_LOOP_ITERS=9 iterations). Move native unwinder map externs (exe_id_to_*_stack_deltas, stack_delta_page_to_info, unwind_info_array) out of the TESTING_COREDUMP guard in extmaps.h so python_tracer.ebpf.c can include native_stack_trace.h. - PYTHON_NATIVE_LOOP_ITERS=9 chosen to pass BPF verifier on 5.4 kernels (ITERS=10 times out the verifier at >300s) - On failed PyCodeObject read, push frame with code object address so the agent can try via /proc/pid/mem

Add a TraceInterceptor callback that is invoked after ConvertTrace on cache-miss. When the interceptor returns true the trace is consumed (skipped for caching and reporting), allowing callers like the GPU subsystem to divert specific traces for further processing. Includes tests covering consume, pass-through, mixed, and non-caching behavior.

CUDA stack can sit at raw traces for awhile waiting for the fixer to match them with GPU timing information, during this time pointers in the raw traces could grow stale due to functional program GC'ing activation records. Avoid this by doing trace symbolizing before parking traces in the fixer maps. This has the nice side affect of removing some channel indirection and now traces so straight into the fixer maps and when matched they go straight to ReportTraceEvent. Move CUDA symbolization earlier in the pipeline: ConvertTrace now handles CUDA frames directly, and parcagpu.Start returns a TraceInterceptor instead of a filtered channel. The interceptor diverts symbolized CUDA traces into the GPU fixer post-ConvertTrace, and completed traces (with timing and kernel name) are reported directly. This eliminates the Symbolize method on the CUDA interpreter in favor of demangling in prepTrace.

gnurizen · 2026-02-23T19:45:10Z

closing, this was just and integration branch for testing...

gnurizen added 4 commits February 21, 2026 05:32

gnurizen closed this Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pynat int#215

pynat int#215
gnurizen wants to merge 4 commits into
mainfrom
pynat-int

gnurizen commented Feb 21, 2026

Uh oh!

gnurizen commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gnurizen commented Feb 21, 2026

Uh oh!

gnurizen commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant