Skip to content

Support new activity_batch parcagpucupti probe#217

Merged
gnurizen merged 1 commit into
mainfrom
batch-probe
Mar 3, 2026
Merged

Support new activity_batch parcagpucupti probe#217
gnurizen merged 1 commit into
mainfrom
batch-probe

Conversation

@gnurizen
Copy link
Copy Markdown
Collaborator

@gnurizen gnurizen commented Feb 27, 2026

In order to reduce bpf overhead send through up to 128 kernel launch
timing activities to the usdt probe. The old single shot
kernel_executed probe is still supported.

Shim changes:

parca-dev/parcagpu#14

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for a new parcagpu:activity_batch USDT probe to reduce BPF overhead by sending kernel timing activities in batches (up to 128), while keeping the legacy kernel_executed probe working.

Changes:

  • Extend CUDA USDT probe detection/attachment to prefer activity_batch when available and fall back to kernel_executed.
  • Add a new eBPF handler (cuda_activity_batch) that reads batched CUPTI activity records from user memory and emits timing events.
  • Update verifier/unit tests and program-name constants to include the new program/probe.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
support/ebpf/cupti_activity_bpf.h Introduces a minimal CUPTI kernel activity struct definition for BPF-side parsing.
support/ebpf/cuda.ebpf.c Adds per-CPU scratch map, implements activity_batch handler, and routes cookie 'b' in cuda_probe.
interpreter/gpu/cuda.go Prefers activity_batch probe in loader; adds program-name constant and cookie mapping.
interpreter/gpu/cuda_test.go Verifies the new cuda_activity_batch program exists in the compiled collection.
test/cudaverify/cuda_verifier_test.go Updates verifier tests to accept either kernel timing probe and to use explicit cookies/program names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread support/ebpf/cupti_activity_bpf.h Outdated
Comment thread interpreter/gpu/cuda.go
Comment thread support/ebpf/cuda.ebpf.c Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread interpreter/gpu/cuda.go Outdated
Comment thread processmanager/ebpf/ebpf.go
Comment thread support/ebpf/cuda.ebpf.c
Comment thread support/ebpf/cupti_activity_bpf.h Outdated
Comment thread interpreter/gpu/cuda.go Outdated
@gnurizen gnurizen requested review from brancz and umanwizard March 1, 2026 17:23
@gnurizen gnurizen marked this pull request as ready for review March 1, 2026 17:24
@gnurizen gnurizen force-pushed the batch-probe branch 2 times, most recently from f2c7a51 to 60d213f Compare March 2, 2026 21:09
In order to reduce bpf overhead send through up to 128 kernel launch
timing activities to the usdt probe.  The old single shot
kernel_executed probe is still supported.

Inline correlation and kernel_exec into cuda_probe, tail-call only
activity_batch

The unwinder is sensitive to tail calls, so minimize them: inline
cuda_correlation and cuda_kernel_exec directly into cuda_probe's switch
statement using bpf_usdt_arg() for USDT arg reading.
@gnurizen gnurizen merged commit d2eb60c into main Mar 3, 2026
42 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants