[CUDA Plugin EP] Add per-node attribution and explicit GPU→ORT event linkage to profiler#28614
Open
tianleiwu wants to merge 1 commit into
Open
[CUDA Plugin EP] Add per-node attribution and explicit GPU→ORT event linkage to profiler#28614tianleiwu wants to merge 1 commit into
tianleiwu wants to merge 1 commit into
Conversation
…age to profiler Wire up the StopEvent callback to read NODE-category ORT profiling events via the 1.25 OrtEpApi::ProfilingEvent_* accessors, capturing the event name, op_name and node_index into a correlation → OrtNodeInfo map. In EndProfiling, stamp every GPU event with ort_correlation_id (always) and ort_event_name / ort_op_name / ort_node_index (when the map lookup hits). This resolves the two known limitations in the CUDA Plugin EP profiler: - GPU→ORT event linkage was implicit (timestamp proximity only) - No per-node attribution for GPU kernel events No new ORT C API surface is required.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR enhances the CUDA Plugin EP profiling output by adding explicit GPU→ORT linkage and best-effort per-node attribution, so downstream trace consumers can reliably associate CUPTI-recorded GPU kernel events with the originating ORT graph node without relying on timestamp proximity.
Changes:
- Capture NODE-category ORT profiling metadata (
event_name,op_name,node_index) atStopEventtime and store it keyed by ORT correlation ID. - During
EndProfiling, annotate every GPU Kernel event withort_correlation_id, and additionally attachort_event_name/ort_op_name/ort_node_indexwhen attribution metadata is available. - Extend the Python profiling test and update the CUDA Plugin EP design doc to reflect the new annotation pass and attribution behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| onnxruntime/test/python/transformers/test_cuda_plugin_ep.py | Adds assertions that GPU Kernel events include ort_correlation_id and validates grouped per-node attribution fields when present. |
| onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h | Introduces OrtNodeInfo plus a mutex-protected correlation→node metadata map in the CUDA Plugin EP profiler. |
| onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc | Implements metadata capture in StopEventImpl and annotates CUPTI-derived GPU events in EndProfilingImpl. |
| docs/cuda_plugin_ep/cuda_plugin_ep_design.md | Documents the annotation pass, explicit linkage, and the per-node attribution map lifecycle with an example trace record. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements explicit GPU→ORT event linkage and per-node attribution in the CUDA Plugin EP profiler, resolving the two known profiling limitations tracked in the plugin-EP gap analysis.
Motivation
Previously, GPU kernel events emitted by the plugin EP profiler carried only CUPTI metadata (
stream,grid_*,block_*). Consumers had to rely on timestamp proximity to correlate GPU activity with ORT graph nodes — an unreliable heuristic under concurrent execution. This PR wires up theStopEventcallback to capture node identity fromOrtProfilingEventand stamps it onto GPU events duringEndProfiling.Key Changes
Plugin Profiler Header (
cuda_profiler_plugin.h)OrtNodeInfostruct holdingevent_name,op_name,node_indexstd::mutex node_info_mutex_andstd::unordered_map<uint64_t, OrtNodeInfo> correlation_to_node_toCudaPluginEpProfilerPlugin Profiler Implementation (
cuda_profiler_plugin.cc)StopEventImpl: ForOrtProfilingEventCategory_NODEevents, reads the event name andop_name/node_indexargs viaOrtEpApi::ProfilingEvent_*accessors; inserts into the correlation→node map under mutex. Accessor failures are non-fatal (releasesOrtStatus*, continues). CUPTI pop always executes.EndProfilingImpl: Swaps the map under mutex for lock-free iteration. For each GPU event, always appendsort_correlation_id; on map hit, also appendsort_event_name,ort_op_name,ort_node_index.Python Test (
test_cuda_plugin_ep.py)_run_profiling_test()to assert:Kernelevent carries a numericort_correlation_idort_event_name/ort_op_name/ort_node_indexappear as a group when presentMatMul(the test model op)Design Doc (
cuda_plugin_ep_design.md§14)StopEventmetadata and GPU→ORT linkageDesign Decisions
ProfilingEvent_*accessorsort_op_namealways means an actual ONNX oport_*-prefixed arg keys — avoids collision with existing CUPTI arg namesort_correlation_id— provides explicit linkage even on map missstd::mutex+std::unordered_map— simple, correct, low contention (StopEvent calls are serialized per-node, EndProfiling is single-threaded)Testing Notes
onnxruntime_ENABLE_CUDA_PROFILING=ONpython -m pytest onnxruntime/test/python/transformers/test_cuda_plugin_ep.py -k test_session_profiling -vort_correlation_idon all Kernel events andort_op_name=MatMulon at least one