Use an intercepter interface to allow GPU trace matching AFTER symbolization#212
Merged
Conversation
2438c7b to
57eab74
Compare
57eab74 to
4cd2b0b
Compare
4cd2b0b to
42bac4a
Compare
umanwizard
approved these changes
Feb 24, 2026
There was a problem hiding this comment.
Pull request overview
This PR restructures the CUDA/GPU trace pipeline so traces are symbolized before being parked for GPU timing correlation, using a new interceptor hook after ConvertTrace to divert CUDA traces into the GPU “fixer” and report completed traces directly.
Changes:
- Add
TraceInterceptorsupport totracehandlerand update call sites/tests. - Move CUDA frame handling into
ProcessManager.ConvertTrace(no longer relying on CUDA interpreterSymbolizefor demangling). - Refactor
parcagpu+ CUDA fixer to accept symbolized traces, attach timing/kernel-name later, recompute hashes, and emit completed traces.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tracehandler/tracehandler.go | Adds interceptor hook and threads it through handler construction/start. |
| tracehandler/tracehandler_test.go | Adds interceptor behavior tests and updates Start call signature. |
| processmanager/manager.go | Handles CUDA frames directly during ConvertTrace (preserving correlation ID encoding). |
| parcagpu/parcagpu.go | Reworks timing reader and returns an interceptor that diverts CUDA traces post-ConvertTrace. |
| interpreter/gpu/cuda.go | Refactors CUDA fixer to store symbolized traces, attach kernel timing/name, recompute hash, and return completed outputs. |
| internal/controller/controller.go | Updates tracehandler.Start signature usage (currently still passes nil interceptor). |
Comments suppressed due to low confidence (1)
parcagpu/parcagpu.go:79
- This select loop calls eventReader.ReadInto() in the
defaultbranch. ReadInto blocks, so the goroutine won’t service logTicker/clearTicker ticks (or ctx cancellation) while it’s blocked waiting for events. Consider using Reader.SetDeadline / timed reads, or reading perf events in a dedicated goroutine and sending them over a channel so the outer loop can select on ctx/tickers reliably.
case <-ctx.Done():
return
default:
if err := eventReader.ReadInto(&data); err != nil {
readErrorCount.Add(1)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
536ed20 to
f2d2f6c
Compare
Collaborator
Author
|
Converting back to draft as we're looking at doing the demangling server side |
89cea35 to
05dde60
Compare
Add a TraceInterceptor callback that is invoked after ConvertTrace on cache-miss. When the interceptor returns true the trace is consumed (skipped for caching and reporting), allowing callers like the GPU subsystem to divert specific traces for further processing. Includes tests covering consume, pass-through, mixed, and non-caching behavior.
CUDA stack can sit at raw traces for awhile waiting for the fixer to match them with GPU timing information, during this time pointers in the raw traces could grow stale due to functional program GC'ing activation records. Avoid this by doing trace symbolizing before parking traces in the fixer maps. This has the nice side affect of removing some channel indirection and now traces so straight into the fixer maps and when matched they go straight to ReportTraceEvent. Move CUDA symbolization earlier in the pipeline: ConvertTrace now handles CUDA frames directly, and parcagpu.Start returns a TraceInterceptor instead of a filtered channel. The interceptor diverts symbolized CUDA traces into the GPU fixer post-ConvertTrace, and completed traces (with timing and kernel name) are reported directly. This eliminates the Symbolize method on the CUDA interpreter in favor of demangling in prepTrace.
05dde60 to
c5d6fd7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symbolize GPU traces before kernel timing fixup
GPU samples can sit as raw traces for awhile waiting for the fixer to
match them with GPU timing information, during this time pointers in
the raw traces could grow stale due to functional program GC'ing
activation records. Avoid this by doing trace symbolizing before
parking traces in the fixer maps.
This has the nice side affect of removing some channel indirection
and now traces go straight into the fixer maps and when matched they
go straight to ReportTraceEvent.
Move CUDA symbolization earlier in the pipeline: ConvertTrace now
handles CUDA frames directly, and parcagpu.Start returns a
TraceInterceptor instead of a filtered channel. The interceptor
diverts symbolized CUDA traces into the GPU fixer post-ConvertTrace,
and completed traces (with timing and kernel name) are reported
directly. This eliminates the Symbolize method on the CUDA
interpreter in favor of demangling in prepTrace.