Skip to content

Use an intercepter interface to allow GPU trace matching AFTER symbolization#212

Merged
gnurizen merged 6 commits into
mainfrom
cuda-sym-first
Feb 27, 2026
Merged

Use an intercepter interface to allow GPU trace matching AFTER symbolization#212
gnurizen merged 6 commits into
mainfrom
cuda-sym-first

Conversation

@gnurizen
Copy link
Copy Markdown
Collaborator

@gnurizen gnurizen commented Feb 20, 2026

Symbolize GPU traces before kernel timing fixup

GPU samples can sit as raw traces for awhile waiting for the fixer to
match them with GPU timing information, during this time pointers in
the raw traces could grow stale due to functional program GC'ing
activation records. Avoid this by doing trace symbolizing before
parking traces in the fixer maps.

This has the nice side affect of removing some channel indirection
and now traces go straight into the fixer maps and when matched they
go straight to ReportTraceEvent.

Move CUDA symbolization earlier in the pipeline: ConvertTrace now
handles CUDA frames directly, and parcagpu.Start returns a
TraceInterceptor instead of a filtered channel. The interceptor
diverts symbolized CUDA traces into the GPU fixer post-ConvertTrace,
and completed traces (with timing and kernel name) are reported
directly. This eliminates the Symbolize method on the CUDA
interpreter in favor of demangling in prepTrace.

@gnurizen gnurizen changed the title cuda sym first Use an intercepter interface to allow CUDA trace matching AFTER symbolization and simplify excess channeling Feb 20, 2026
@gnurizen gnurizen changed the title Use an intercepter interface to allow CUDA trace matching AFTER symbolization and simplify excess channeling Use an intercepter interface to allow CUDA trace matching AFTER symbolization and simplify channeling Feb 20, 2026
@gnurizen gnurizen changed the title Use an intercepter interface to allow CUDA trace matching AFTER symbolization and simplify channeling Use an intercepter interface to allow CUDA trace matching AFTER symbolization and simplify channels Feb 20, 2026
@gnurizen gnurizen changed the title Use an intercepter interface to allow CUDA trace matching AFTER symbolization and simplify channels Use an intercepter interface to allow CUDA trace matching AFTER symbolization Feb 21, 2026
@gnurizen gnurizen marked this pull request as ready for review February 23, 2026 19:43
@gnurizen gnurizen changed the title Use an intercepter interface to allow CUDA trace matching AFTER symbolization Use an intercepter interface to allow GPU trace matching AFTER symbolization Feb 23, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restructures the CUDA/GPU trace pipeline so traces are symbolized before being parked for GPU timing correlation, using a new interceptor hook after ConvertTrace to divert CUDA traces into the GPU “fixer” and report completed traces directly.

Changes:

  • Add TraceInterceptor support to tracehandler and update call sites/tests.
  • Move CUDA frame handling into ProcessManager.ConvertTrace (no longer relying on CUDA interpreter Symbolize for demangling).
  • Refactor parcagpu + CUDA fixer to accept symbolized traces, attach timing/kernel-name later, recompute hashes, and emit completed traces.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tracehandler/tracehandler.go Adds interceptor hook and threads it through handler construction/start.
tracehandler/tracehandler_test.go Adds interceptor behavior tests and updates Start call signature.
processmanager/manager.go Handles CUDA frames directly during ConvertTrace (preserving correlation ID encoding).
parcagpu/parcagpu.go Reworks timing reader and returns an interceptor that diverts CUDA traces post-ConvertTrace.
interpreter/gpu/cuda.go Refactors CUDA fixer to store symbolized traces, attach kernel timing/name, recompute hash, and return completed outputs.
internal/controller/controller.go Updates tracehandler.Start signature usage (currently still passes nil interceptor).
Comments suppressed due to low confidence (1)

parcagpu/parcagpu.go:79

  • This select loop calls eventReader.ReadInto() in the default branch. ReadInto blocks, so the goroutine won’t service logTicker/clearTicker ticks (or ctx cancellation) while it’s blocked waiting for events. Consider using Reader.SetDeadline / timed reads, or reading perf events in a dedicated goroutine and sending them over a channel so the outer loop can select on ctx/tickers reliably.
			case <-ctx.Done():
				return
			default:
				if err := eventReader.ReadInto(&data); err != nil {
					readErrorCount.Add(1)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread parcagpu/parcagpu.go
Comment thread tracehandler/tracehandler.go Outdated
Comment thread tracehandler/tracehandler.go Outdated
Comment thread internal/controller/controller.go
Comment thread interpreter/gpu/cuda.go
Comment thread interpreter/gpu/cuda.go Outdated
Comment thread interpreter/gpu/cuda.go
@gnurizen gnurizen force-pushed the cuda-sym-first branch 2 times, most recently from 536ed20 to f2d2f6c Compare February 25, 2026 15:50
@gnurizen gnurizen marked this pull request as draft February 25, 2026 19:28
@gnurizen
Copy link
Copy Markdown
Collaborator Author

Converting back to draft as we're looking at doing the demangling server side

Add a TraceInterceptor callback that is invoked after ConvertTrace on
cache-miss. When the interceptor returns true the trace is consumed
(skipped for caching and reporting), allowing callers like the GPU
subsystem to divert specific traces for further processing.

Includes tests covering consume, pass-through, mixed, and
non-caching behavior.
CUDA stack can sit at raw traces for awhile waiting for the fixer to
match them with GPU timing information, during this time pointers in
the raw traces could grow stale due to functional program GC'ing
activation records.  Avoid this by doing trace symbolizing before
parking traces in the fixer maps.

This has the nice side affect of removing some channel indirection
and now traces so straight into the fixer maps and when matched they
go straight to ReportTraceEvent.

Move CUDA symbolization earlier in the pipeline: ConvertTrace now
handles CUDA frames directly, and parcagpu.Start returns a
TraceInterceptor instead of a filtered channel. The interceptor
diverts symbolized CUDA traces into the GPU fixer post-ConvertTrace,
and completed traces (with timing and kernel name) are reported
directly. This eliminates the Symbolize method on the CUDA
interpreter in favor of demangling in prepTrace.
@gnurizen gnurizen marked this pull request as ready for review February 26, 2026 20:07
@gnurizen gnurizen requested a review from umanwizard February 26, 2026 20:08
@gnurizen gnurizen merged commit ede3a25 into main Feb 27, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants