Skip to content

pynat int#215

Closed
gnurizen wants to merge 4 commits into
mainfrom
pynat-int
Closed

pynat int#215
gnurizen wants to merge 4 commits into
mainfrom
pynat-int

Conversation

@gnurizen
Copy link
Copy Markdown
Collaborator

  • Extract native unwinder functions into native_stack_trace.h
  • Combine python and native unwinder into single loop
  • Add TraceInterceptor to tracehandler
  • Symbolize CUDA traces before GPU timing fixup

Move defines (STACK_DELTA_INVALID, STACK_DELTA_STOP,
NATIVE_FRAMES_PER_PROGRAM) and functions (push_native, bsearch_step,
get_stack_delta_map, get_stack_delta, unwind_register_address,
unwind_one_frame) from native_stack_trace.ebpf.c into
native_stack_trace.h so they can be reused by other eBPF programs.

Zero functional changes: stripping BTF metadata from the before/after
blobs produces identical binaries, confirming no generated code changed.
Python, especially pytorch programs can exhaust the tail call limit
by switching from python to native unwinders more than 29 times.
This happens because of eval/delegation patterns where one python
frame will be decorated with a couple native frames.

In order to unwind these stack successfully fold the native unwinder
into the python unwinder so at each frame a python or native frame
can be unwound.

Replace the separate walk_python_stack inner loop and outer
transition loop with a single switch-in-loop structure using
step_python and step_native helper functions. This reduces
tail call usage from one per batch to one per loop budget
exhaustion (PYTHON_NATIVE_LOOP_ITERS=9 iterations).

Move native unwinder map externs (exe_id_to_*_stack_deltas,
stack_delta_page_to_info, unwind_info_array) out of the
TESTING_COREDUMP guard in extmaps.h so python_tracer.ebpf.c
can include native_stack_trace.h.

- PYTHON_NATIVE_LOOP_ITERS=9 chosen to pass BPF verifier on
  5.4 kernels (ITERS=10 times out the verifier at >300s)
- On failed PyCodeObject read, push frame with code object
  address so the agent can try via /proc/pid/mem
Add a TraceInterceptor callback that is invoked after ConvertTrace on
cache-miss. When the interceptor returns true the trace is consumed
(skipped for caching and reporting), allowing callers like the GPU
subsystem to divert specific traces for further processing.

Includes tests covering consume, pass-through, mixed, and
non-caching behavior.
CUDA stack can sit at raw traces for awhile waiting for the fixer to
match them with GPU timing information, during this time pointers in
the raw traces could grow stale due to functional program GC'ing
activation records.  Avoid this by doing trace symbolizing before
parking traces in the fixer maps.

This has the nice side affect of removing some channel indirection
and now traces so straight into the fixer maps and when matched they
go straight to ReportTraceEvent.

Move CUDA symbolization earlier in the pipeline: ConvertTrace now
handles CUDA frames directly, and parcagpu.Start returns a
TraceInterceptor instead of a filtered channel. The interceptor
diverts symbolized CUDA traces into the GPU fixer post-ConvertTrace,
and completed traces (with timing and kernel name) are reported
directly. This eliminates the Symbolize method on the CUDA
interpreter in favor of demangling in prepTrace.
@gnurizen
Copy link
Copy Markdown
Collaborator Author

closing, this was just and integration branch for testing...

@gnurizen gnurizen closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant