Skip to content

[CUDA Plugin EP] Add per-node attribution and explicit GPU→ORT event linkage to profiler#28614

Open
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/cuda_plugin_ep_profiling_ort_id
Open

[CUDA Plugin EP] Add per-node attribution and explicit GPU→ORT event linkage to profiler#28614
tianleiwu wants to merge 1 commit into
mainfrom
tlwu/cuda_plugin_ep_profiling_ort_id

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Summary

Implements explicit GPU→ORT event linkage and per-node attribution in the CUDA Plugin EP profiler, resolving the two known profiling limitations tracked in the plugin-EP gap analysis.

Motivation

Previously, GPU kernel events emitted by the plugin EP profiler carried only CUPTI metadata (stream, grid_*, block_*). Consumers had to rely on timestamp proximity to correlate GPU activity with ORT graph nodes — an unreliable heuristic under concurrent execution. This PR wires up the StopEvent callback to capture node identity from OrtProfilingEvent and stamps it onto GPU events during EndProfiling.

Key Changes

Plugin Profiler Header (cuda_profiler_plugin.h)

  • Added OrtNodeInfo struct holding event_name, op_name, node_index
  • Added std::mutex node_info_mutex_ and std::unordered_map<uint64_t, OrtNodeInfo> correlation_to_node_ to CudaPluginEpProfiler

Plugin Profiler Implementation (cuda_profiler_plugin.cc)

  • StopEventImpl: For OrtProfilingEventCategory_NODE events, reads the event name and op_name/node_index args via OrtEpApi::ProfilingEvent_* accessors; inserts into the correlation→node map under mutex. Accessor failures are non-fatal (releases OrtStatus*, continues). CUPTI pop always executes.
  • EndProfilingImpl: Swaps the map under mutex for lock-free iteration. For each GPU event, always appends ort_correlation_id; on map hit, also appends ort_event_name, ort_op_name, ort_node_index.

Python Test (test_cuda_plugin_ep.py)

  • Extended _run_profiling_test() to assert:
    • Every GPU Kernel event carries a numeric ort_correlation_id
    • ort_event_name/ort_op_name/ort_node_index appear as a group when present
    • At least one attributed event maps to MatMul (the test model op)
  • Graceful skip behavior preserved when CUPTI is unavailable

Design Doc (cuda_plugin_ep_design.md §14)

  • Updated §14.4 to mention the annotation pass
  • Rewrote §14.5 table rows for StopEvent metadata and GPU→ORT linkage
  • Added new §14.6 "Per-Node Attribution" (map lifecycle, NODE-only rationale, worked JSON example)
  • Renumbered Build Configuration → §14.7, Files → §14.8

Design Decisions

  • No new ORT C API surface — reuses the 1.25 ProfilingEvent_* accessors
  • NODE-only filter — only graph-node executions populate the map; ort_op_name always means an actual ONNX op
  • ort_*-prefixed arg keys — avoids collision with existing CUPTI arg names
  • Always emit ort_correlation_id — provides explicit linkage even on map miss
  • std::mutex + std::unordered_map — simple, correct, low contention (StopEvent calls are serialized per-node, EndProfiling is single-threaded)

Testing Notes

  1. Build with onnxruntime_ENABLE_CUDA_PROFILING=ON
  2. Run: python -m pytest onnxruntime/test/python/transformers/test_cuda_plugin_ep.py -k test_session_profiling -v
  3. With CUPTI available, verify JSON output contains ort_correlation_id on all Kernel events and ort_op_name=MatMul on at least one

…age to profiler

Wire up the StopEvent callback to read NODE-category ORT profiling events via
the 1.25 OrtEpApi::ProfilingEvent_* accessors, capturing the event name, op_name
and node_index into a correlation → OrtNodeInfo map. In EndProfiling, stamp every
GPU event with ort_correlation_id (always) and ort_event_name / ort_op_name /
ort_node_index (when the map lookup hits).

This resolves the two known limitations in the CUDA Plugin EP profiler:
- GPU→ORT event linkage was implicit (timestamp proximity only)
- No per-node attribution for GPU kernel events

No new ORT C API surface is required.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the CUDA Plugin EP profiling output by adding explicit GPU→ORT linkage and best-effort per-node attribution, so downstream trace consumers can reliably associate CUPTI-recorded GPU kernel events with the originating ORT graph node without relying on timestamp proximity.

Changes:

  • Capture NODE-category ORT profiling metadata (event_name, op_name, node_index) at StopEvent time and store it keyed by ORT correlation ID.
  • During EndProfiling, annotate every GPU Kernel event with ort_correlation_id, and additionally attach ort_event_name/ort_op_name/ort_node_index when attribution metadata is available.
  • Extend the Python profiling test and update the CUDA Plugin EP design doc to reflect the new annotation pass and attribution behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Adds assertions that GPU Kernel events include ort_correlation_id and validates grouped per-node attribution fields when present.
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h Introduces OrtNodeInfo plus a mutex-protected correlation→node metadata map in the CUDA Plugin EP profiler.
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc Implements metadata capture in StopEventImpl and annotates CUPTI-derived GPU events in EndProfilingImpl.
docs/cuda_plugin_ep/cuda_plugin_ep_design.md Documents the annotation pass, explicit linkage, and the per-node attribution map lifecycle with an example trace record.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants