microsoft · tianleiwu · May 21, 2026
diff --git a/docs/cuda_plugin_ep/cuda_plugin_ep_design.md b/docs/cuda_plugin_ep/cuda_plugin_ep_design.md
@@ -872,8 +872,9 @@ The plugin API's `StartEvent`/`StopEvent` receive **absolute epoch-based** corre
 When ORT calls `EndProfiling`:
 1. CUPTI activity buffers are flushed (`cuptiActivityFlushAll`).
 2. GPU activity records are processed — kernel names, timestamps, durations, and stream/grid metadata are extracted.
-3. Events are converted to `Ort::ProfilingEvent` instances with `OrtProfilingEventCategory_KERNEL`.
-4. Events are appended to the `OrtProfilingEventsContainer` via `AddEvents`.
+3. The plugin runs an **annotation pass**: while flattening the per-correlation-ID event buckets returned by `CUPTIManager::Consume`, it stamps each GPU event with an explicit `ort_correlation_id` arg, and — when the correlation ID matches a NODE-category ORT event captured during `StopEvent` — also stamps `ort_event_name`, `ort_op_name`, and `ort_node_index`. See §14.6 for the lifecycle of the correlation-to-node map.
+4. Events are converted to `Ort::ProfilingEvent` instances with `OrtProfilingEventCategory_KERNEL`.
+5. Events are appended to the `OrtProfilingEventsContainer` via `AddEvents`.
 
 The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only, and the `PluginEpProfiler` bridge on the ORT side likewise appends EP events to ORT's profiling event collection without merge/sort by timestamp or correlation ID. Any ordering or interleaving into a global timeline is handled by downstream trace consumers.
 
@@ -883,11 +884,44 @@ The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUPro
 |--------|----------------|----------------|
 | Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge appends only, and trace consumers handle ordering |
 | Correlation IDs | Relative → absolute conversion in `GPUTracerManager::PushCorrelation` | Bridge provides absolute IDs directly; plugin pushes to CUPTI as-is |
-| `StopEvent` metadata | Ignored (just pops correlation) | ORT event metadata available; currently unused, can annotate GPU events in future |
-| GPU→ORT event linkage | Implicit via CUPTI external correlation IDs merged into timeline | GPU events carry only CUPTI metadata (`stream`, `grid_*`, `block_*`); no ORT correlation or parent identifier is attached. Downstream consumers must relate GPU kernels to ORT nodes via timestamp proximity. This is a known limitation; future work may attach `correlation_id` or parent event name via `StopEvent`'s `OrtProfilingEvent` parameter |
+| `StopEvent` metadata | Ignored (just pops correlation) | Reads category, name, `op_name`, and `node_index` via `OrtEpApi::ProfilingEvent_*` for NODE-category events and records them in a correlation → node-info map (see §14.6) |
+| GPU→ORT event linkage | Implicit via CUPTI external correlation IDs merged into timeline | Explicit — every GPU event carries `ort_correlation_id`; node-attributed events additionally carry `ort_event_name`, `ort_op_name`, and `ort_node_index` |
 | Singleton scope | Process-wide `CUPTIManager` in main ORT DLL | DLL-local `CUPTIManager` in plugin (process isolation) |
 
-### 14.6 Build Configuration
+### 14.6 Per-Node Attribution
+
+The plugin annotates GPU events with the identity of the ORT graph node that triggered them, so consumers can answer "which node ran this kernel?" without timestamp-proximity heuristics.
+
+**Map lifecycle.** `CudaPluginEpProfiler` holds an `std::unordered_map<uint64_t, OrtNodeInfo>` keyed by the absolute, epoch-based correlation ID the bridge passes to `StartEvent`/`StopEvent`. The map is guarded by `std::mutex node_info_mutex_` because ORT may execute nodes on multiple inter-op threads concurrently:
+
+1. `StopEventImpl` is called once per ORT event. For each `OrtProfilingEvent` it queries `ProfilingEvent_GetCategory`; if the category is `OrtProfilingEventCategory_NODE`, it reads the event name (`ProfilingEvent_GetName`) plus the `op_name` and `node_index` args (`ProfilingEvent_GetArgValue`) and inserts an `OrtNodeInfo` under the correlation-ID key. Accessor failures release any returned `OrtStatus*` and continue; the event simply falls back to `ort_correlation_id`-only linkage. The CUPTI external-correlation pop is always performed regardless of accessor outcome.
+2. `EndProfilingImpl` swaps the map into a local container under the mutex (so subsequent lookups during event flattening run lock-free), then iterates the per-correlation buckets returned by `CUPTIManager::Consume`. For each bucket it stringifies the correlation ID once and, for every GPU event in the bucket, appends `ort_correlation_id` plus — if the lookup hit — `ort_event_name`, `ort_op_name`, and `ort_node_index`. The arg strings are copied by `OrtEpApi::ProfilingEventsContainer_AddEvents`, so local-scope storage is sufficient.
+
+**Why NODE-only.** CUPTI external correlation IDs are pushed in `StartEvent` and popped in `StopEvent` for *all* event categories, so non-NODE events (e.g. `SESSION` or `API`) can still produce attributed GPU activity buckets. Filtering to `OrtProfilingEventCategory_NODE` in `StopEventImpl` means only graph-node executions populate the map — GPU events captured under, say, a session-init scope carry just `ort_correlation_id` and no `ort_op_name`. This keeps the annotation precise: an `ort_op_name` value always corresponds to an actual ONNX op type.
+
+**Example.** A GPU kernel event for a `MatMul` node at graph index 7 now looks like (Chrome trace JSON):
+
+```json
+{
+  "cat": "Kernel",
+  "name": "ampere_sgemm_64x32_nn",
+  "ts": 1234567,
+  "dur": 42,
+  "args": {
+    "stream": "0x55ab…",
+    "grid_x": "12",
+    "block_x": "32",
+    "ort_correlation_id": "1718000000000000007",
+    "ort_event_name": "MatMul_0_kernel_time",
+    "ort_op_name": "MatMul",
+    "ort_node_index": "7"
+  }
+}
+```
+
+The `ort_*`-prefixed keys are chosen to avoid colliding with existing CUPTI arg names (`stream`, `grid_*`, `block_*`, `name`, `correlation_id`).
+
+### 14.7 Build Configuration
 
 CUPTI profiling is conditional:
 - **CMake flag**: `onnxruntime_ENABLE_CUDA_PROFILING=ON`
@@ -897,7 +931,7 @@ CUPTI profiling is conditional:
 
 When profiling is disabled (default), `CudaEp::CreateProfiler` is set to `nullptr` and no CUPTI code is compiled.
 
-### 14.7 Files
+### 14.8 Files
 
 | File | Role |
 |------|------|

diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc b/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc
@@ -6,7 +6,10 @@
 #if defined(ENABLE_CUDA_PROFILING)
 
 #include <map>
+#include <mutex>
 #include <string>
+#include <unordered_map>
+#include <utility>
 #include <vector>
 
 namespace onnxruntime {
@@ -87,14 +90,59 @@ OrtStatus* ORT_API_CALL CudaPluginEpProfiler::StartEventImpl(
 
 /*static*/
 OrtStatus* ORT_API_CALL CudaPluginEpProfiler::StopEventImpl(
-    OrtEpProfilerImpl* /*this_ptr*/,
-    uint64_t /*ort_event_correlation_id*/,
-    const OrtProfilingEvent* /*ort_event*/) noexcept {
+    OrtEpProfilerImpl* this_ptr,
+    uint64_t ort_event_correlation_id,
+    const OrtProfilingEvent* ort_event) noexcept {
   EXCEPTION_TO_STATUS_BEGIN
+  auto* self = static_cast<CudaPluginEpProfiler*>(this_ptr);
 
+  // Always pop the CUPTI external correlation push performed in StartEvent,
+  // regardless of category — even if metadata extraction below partially fails.
   auto& manager = profiling::CUPTIManager::GetInstance();
   manager.PopCorrelation();
 
+  // For NODE_EVENT events, capture the originating node's identity now so that
+  // EndProfiling can annotate the GPU kernel/memcpy events produced under this
+  // correlation ID. Accessor failures are non-fatal: we simply skip annotation
+  // for this event and rely on ort_correlation_id alone for linkage.
+  if (ort_event != nullptr) {
+    const auto& api = self->ep_api;
+
+    OrtProfilingEventCategory category = OrtProfilingEventCategory_KERNEL;
+    if (OrtStatus* s = api.ProfilingEvent_GetCategory(ort_event, &category); s != nullptr) {
+      Ort::GetApi().ReleaseStatus(s);
+      return nullptr;
+    }
+
+    if (category == OrtProfilingEventCategory_NODE) {
+      OrtNodeInfo info;
+
+      const char* event_name = nullptr;
+      if (OrtStatus* s = api.ProfilingEvent_GetName(ort_event, &event_name); s != nullptr) {
+        Ort::GetApi().ReleaseStatus(s);
+      } else if (event_name != nullptr) {
+        info.event_name = event_name;
+      }
+
+      const char* op_name = nullptr;
+      if (OrtStatus* s = api.ProfilingEvent_GetArgValue(ort_event, "op_name", &op_name); s != nullptr) {
+        Ort::GetApi().ReleaseStatus(s);
+      } else if (op_name != nullptr) {
+        info.op_name = op_name;
+      }
+
+      const char* node_index = nullptr;
+      if (OrtStatus* s = api.ProfilingEvent_GetArgValue(ort_event, "node_index", &node_index); s != nullptr) {
+        Ort::GetApi().ReleaseStatus(s);
+      } else if (node_index != nullptr) {
+        info.node_index = node_index;
+      }
+
+      std::lock_guard<std::mutex> lock(self->node_info_mutex_);
+      self->correlation_to_node_[ort_event_correlation_id] = std::move(info);
+    }
+  }
+
   return nullptr;
   EXCEPTION_TO_STATUS_END
 }
@@ -113,22 +161,70 @@ OrtStatus* ORT_API_CALL CudaPluginEpProfiler::EndProfilingImpl(
   std::map<uint64_t, profiling::Events> event_map;
   manager.Consume(self->client_handle_, self->ort_profiling_start_, event_map);
 
+  // Snapshot the correlation→node map under lock and clear it; subsequent
+  // lookups can then run lock-free for the duration of event flattening.
+  std::unordered_map<uint64_t, OrtNodeInfo> node_info;
+  {
+    std::lock_guard<std::mutex> lock(self->node_info_mutex_);
+    node_info.swap(self->correlation_to_node_);
+  }
+
   // Flatten all GPU events and convert to OrtProfilingEvent.
   std::vector<Ort::ProfilingEvent> events;
   for (auto& kv : event_map) {
+    const uint64_t correlation_id = kv.first;
     auto& event_list = kv.second;
+
+    // Resolve ORT-side attribution for this correlation ID (if any).
+    const OrtNodeInfo* info = nullptr;
+    if (auto it = node_info.find(correlation_id); it != node_info.end()) {
+      info = &it->second;
+    }
+
+    // Stringify correlation ID once per outer iteration; storage must outlive
+    // every Ort::ProfilingEvent constructor call below. The constructor copies
+    // these strings into the container (see ProfilingEventsContainer_AddEvents),
+    // so per-record local storage would also work, but lifting it here avoids
+    // redundant work.
+    const std::string correlation_id_str = std::to_string(correlation_id);
+
     for (const auto& record : event_list) {
       // Build parallel key/value arrays to use the raw-pointer ProfilingEvent
       // constructor, avoiding a copy from InlinedHashMap to std::unordered_map.
+      // Reserve enough headroom for the CUPTI args plus up to 4 ORT annotations
+      // (ort_correlation_id always; ort_event_name / ort_op_name / ort_node_index
+      // when ORT-side metadata is available).
       InlinedVector<const char*> arg_keys;
       InlinedVector<const char*> arg_values;
-      arg_keys.reserve(record.args.size());
-      arg_values.reserve(record.args.size());
+      arg_keys.reserve(record.args.size() + 4);
+      arg_values.reserve(record.args.size() + 4);
       for (const auto& [k, v] : record.args) {
         arg_keys.push_back(k.c_str());
         arg_values.push_back(v.c_str());
       }
 
+      // Always emit ort_correlation_id so consumers can join GPU events back
+      // to ORT events even when per-node attribution wasn't captured (e.g. the
+      // event came from a non-NODE category, or StopEvent ran before the GPU
+      // activity was finalized).
+      arg_keys.push_back("ort_correlation_id");
+      arg_values.push_back(correlation_id_str.c_str());
+
+      if (info != nullptr) {
+        if (!info->event_name.empty()) {
+          arg_keys.push_back("ort_event_name");
+          arg_values.push_back(info->event_name.c_str());
+        }
+        if (!info->op_name.empty()) {
+          arg_keys.push_back("ort_op_name");
+          arg_values.push_back(info->op_name.c_str());
+        }
+        if (!info->node_index.empty()) {
+          arg_keys.push_back("ort_node_index");
+          arg_values.push_back(info->node_index.c_str());
+        }
+      }
+
       events.emplace_back(
           OrtProfilingEventCategory_KERNEL,
           record.pid,

diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h b/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h
@@ -5,13 +5,26 @@
 
 #if defined(ENABLE_CUDA_PROFILING)
 
+#include <mutex>
+#include <string>
+#include <unordered_map>
+
 #include "cuda_plugin_utils.h"
 #include "core/providers/cuda/cupti_manager.h"
 #include "core/common/gpu_profiler_common.h"
 
 namespace onnxruntime {
 namespace cuda_plugin {
 
+/// Per-node ORT profiling metadata captured during StopEvent and used in
+/// EndProfiling to annotate CUPTI-captured GPU events with explicit
+/// ORT-side attribution (node name, op type, node index).
+struct OrtNodeInfo {
+  std::string event_name;  ///< Full ORT event name (e.g. "<node>_kernel_time").
+  std::string op_name;     ///< ONNX op type for the node, if available.
+  std::string node_index;  ///< Node index in the graph as a decimal string, if available.
+};
+
 /// Plugin-side implementation of OrtEpProfilerImpl for CUDA.
 /// Delegates to CUPTIManager (within the plugin DLL) for GPU activity tracing
 /// and implements the C callback interface expected by ORT's PluginEpProfiler bridge.
@@ -20,6 +33,17 @@ struct CudaPluginEpProfiler : OrtEpProfilerImpl {
   uint64_t client_handle_ = 0;
   TimePoint ort_profiling_start_;
 
+  // Maps the absolute, epoch-based ORT event correlation ID for a NODE_EVENT
+  // (as passed to StartEvent/StopEvent) to the originating node's identity.
+  // Populated in StopEventImpl and drained in EndProfilingImpl, where the
+  // entries are joined against CUPTI-captured GPU events to attribute each
+  // GPU kernel back to a specific ORT graph node.
+  //
+  // Different ORT events may run on different threads (inter-op parallelism),
+  // so map access is protected by node_info_mutex_.
+  std::mutex node_info_mutex_;
+  std::unordered_map<uint64_t, OrtNodeInfo> correlation_to_node_;
+
   explicit CudaPluginEpProfiler(const OrtEpApi& api);
   ~CudaPluginEpProfiler();
 

diff --git a/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py b/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
@@ -2455,13 +2455,52 @@ def _run_profiling_test(self):
             # If GPU kernel events are present, validate their metadata.
             kernel_events = [e for e in profile_data if isinstance(e, dict) and e.get("cat") == "Kernel"]
             if kernel_events:
+                saw_matmul_attribution = False
                 for event in kernel_events:
                     self.assertIn("ts", event)
                     self.assertIn("dur", event)
                     self.assertGreaterEqual(event["dur"], 0)
                     args = event.get("args", {})
                     self.assertIn("stream", args, f"GPU kernel event missing 'stream': {event}")
                     self.assertIn("block_x", args, f"GPU kernel event missing 'block_x': {event}")
+
+                    # Every GPU kernel event must carry an explicit ORT correlation ID
+                    # so consumers can join it back to the originating ORT event without
+                    # relying on timestamp-proximity heuristics.
+                    self.assertIn(
+                        "ort_correlation_id",
+                        args,
+                        f"GPU kernel event missing 'ort_correlation_id': {event}",
+                    )
+                    self.assertTrue(
+                        args["ort_correlation_id"].isdigit(),
+                        f"'ort_correlation_id' must be a decimal string: {event}",
+                    )
+
+                    # Per-node attribution is best-effort (only NODE-category ORT events
+                    # populate the map). When present, validate the four annotation keys.
+                    if "ort_op_name" in args:
+                        self.assertIn("ort_event_name", args)
+                        self.assertIn("ort_node_index", args)
+                        self.assertTrue(
+                            args["ort_event_name"].endswith("_kernel_time"),
+                            f"'ort_event_name' should end with '_kernel_time': {event}",
+                        )
+                        self.assertTrue(
+                            args["ort_node_index"].isdigit(),
+                            f"'ort_node_index' must be a decimal string: {event}",
+                        )
+                        if args["ort_op_name"] == "MatMul":
+                            saw_matmul_attribution = True
+
+                # The test model contains exactly one MatMul node assigned to the plugin EP;
+                # if any per-node attribution landed, it must have been for MatMul.
+                any_op_attribution = any("ort_op_name" in (e.get("args") or {}) for e in kernel_events)
+                if any_op_attribution:
+                    self.assertTrue(
+                        saw_matmul_attribution,
+                        "Expected at least one GPU kernel event attributed to a MatMul node.",
+                    )
             else:
                 print("Note: No GPU Kernel events in profile (CUPTI may not be available).")