From 1435f28866de3076a90740a2838718ae0115dc81 Mon Sep 17 00:00:00 2001 From: Tianlei Wu Date: Thu, 21 May 2026 12:26:38 -0700 Subject: [PATCH] =?UTF-8?q?feat(cuda-plugin):=20add=20per-node=20attributi?= =?UTF-8?q?on=20and=20explicit=20GPU=E2=86=92ORT=20linkage=20to=20profiler?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wire up the StopEvent callback to read NODE-category ORT profiling events via the 1.25 OrtEpApi::ProfilingEvent_* accessors, capturing the event name, op_name and node_index into a correlation → OrtNodeInfo map. In EndProfiling, stamp every GPU event with ort_correlation_id (always) and ort_event_name / ort_op_name / ort_node_index (when the map lookup hits). This resolves the two known limitations in the CUDA Plugin EP profiler: - GPU→ORT event linkage was implicit (timestamp proximity only) - No per-node attribution for GPU kernel events No new ORT C API surface is required. --- docs/cuda_plugin_ep/cuda_plugin_ep_design.md | 46 +++++++- .../cuda/plugin/cuda_profiler_plugin.cc | 106 +++++++++++++++++- .../cuda/plugin/cuda_profiler_plugin.h | 24 ++++ .../transformers/test_cuda_plugin_ep.py | 39 +++++++ 4 files changed, 204 insertions(+), 11 deletions(-) diff --git a/docs/cuda_plugin_ep/cuda_plugin_ep_design.md b/docs/cuda_plugin_ep/cuda_plugin_ep_design.md index 15f8188505b37..78914019e252c 100644 --- a/docs/cuda_plugin_ep/cuda_plugin_ep_design.md +++ b/docs/cuda_plugin_ep/cuda_plugin_ep_design.md @@ -872,8 +872,9 @@ The plugin API's `StartEvent`/`StopEvent` receive **absolute epoch-based** corre When ORT calls `EndProfiling`: 1. CUPTI activity buffers are flushed (`cuptiActivityFlushAll`). 2. GPU activity records are processed — kernel names, timestamps, durations, and stream/grid metadata are extracted. -3. Events are converted to `Ort::ProfilingEvent` instances with `OrtProfilingEventCategory_KERNEL`. -4. Events are appended to the `OrtProfilingEventsContainer` via `AddEvents`. +3. The plugin runs an **annotation pass**: while flattening the per-correlation-ID event buckets returned by `CUPTIManager::Consume`, it stamps each GPU event with an explicit `ort_correlation_id` arg, and — when the correlation ID matches a NODE-category ORT event captured during `StopEvent` — also stamps `ort_event_name`, `ort_op_name`, and `ort_node_index`. See §14.6 for the lifecycle of the correlation-to-node map. +4. Events are converted to `Ort::ProfilingEvent` instances with `OrtProfilingEventCategory_KERNEL`. +5. Events are appended to the `OrtProfilingEventsContainer` via `AddEvents`. The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only, and the `PluginEpProfiler` bridge on the ORT side likewise appends EP events to ORT's profiling event collection without merge/sort by timestamp or correlation ID. Any ordering or interleaving into a global timeline is handled by downstream trace consumers. @@ -883,11 +884,44 @@ The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUPro |--------|----------------|----------------| | Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge appends only, and trace consumers handle ordering | | Correlation IDs | Relative → absolute conversion in `GPUTracerManager::PushCorrelation` | Bridge provides absolute IDs directly; plugin pushes to CUPTI as-is | -| `StopEvent` metadata | Ignored (just pops correlation) | ORT event metadata available; currently unused, can annotate GPU events in future | -| GPU→ORT event linkage | Implicit via CUPTI external correlation IDs merged into timeline | GPU events carry only CUPTI metadata (`stream`, `grid_*`, `block_*`); no ORT correlation or parent identifier is attached. Downstream consumers must relate GPU kernels to ORT nodes via timestamp proximity. This is a known limitation; future work may attach `correlation_id` or parent event name via `StopEvent`'s `OrtProfilingEvent` parameter | +| `StopEvent` metadata | Ignored (just pops correlation) | Reads category, name, `op_name`, and `node_index` via `OrtEpApi::ProfilingEvent_*` for NODE-category events and records them in a correlation → node-info map (see §14.6) | +| GPU→ORT event linkage | Implicit via CUPTI external correlation IDs merged into timeline | Explicit — every GPU event carries `ort_correlation_id`; node-attributed events additionally carry `ort_event_name`, `ort_op_name`, and `ort_node_index` | | Singleton scope | Process-wide `CUPTIManager` in main ORT DLL | DLL-local `CUPTIManager` in plugin (process isolation) | -### 14.6 Build Configuration +### 14.6 Per-Node Attribution + +The plugin annotates GPU events with the identity of the ORT graph node that triggered them, so consumers can answer "which node ran this kernel?" without timestamp-proximity heuristics. + +**Map lifecycle.** `CudaPluginEpProfiler` holds an `std::unordered_map` keyed by the absolute, epoch-based correlation ID the bridge passes to `StartEvent`/`StopEvent`. The map is guarded by `std::mutex node_info_mutex_` because ORT may execute nodes on multiple inter-op threads concurrently: + +1. `StopEventImpl` is called once per ORT event. For each `OrtProfilingEvent` it queries `ProfilingEvent_GetCategory`; if the category is `OrtProfilingEventCategory_NODE`, it reads the event name (`ProfilingEvent_GetName`) plus the `op_name` and `node_index` args (`ProfilingEvent_GetArgValue`) and inserts an `OrtNodeInfo` under the correlation-ID key. Accessor failures release any returned `OrtStatus*` and continue; the event simply falls back to `ort_correlation_id`-only linkage. The CUPTI external-correlation pop is always performed regardless of accessor outcome. +2. `EndProfilingImpl` swaps the map into a local container under the mutex (so subsequent lookups during event flattening run lock-free), then iterates the per-correlation buckets returned by `CUPTIManager::Consume`. For each bucket it stringifies the correlation ID once and, for every GPU event in the bucket, appends `ort_correlation_id` plus — if the lookup hit — `ort_event_name`, `ort_op_name`, and `ort_node_index`. The arg strings are copied by `OrtEpApi::ProfilingEventsContainer_AddEvents`, so local-scope storage is sufficient. + +**Why NODE-only.** CUPTI external correlation IDs are pushed in `StartEvent` and popped in `StopEvent` for *all* event categories, so non-NODE events (e.g. `SESSION` or `API`) can still produce attributed GPU activity buckets. Filtering to `OrtProfilingEventCategory_NODE` in `StopEventImpl` means only graph-node executions populate the map — GPU events captured under, say, a session-init scope carry just `ort_correlation_id` and no `ort_op_name`. This keeps the annotation precise: an `ort_op_name` value always corresponds to an actual ONNX op type. + +**Example.** A GPU kernel event for a `MatMul` node at graph index 7 now looks like (Chrome trace JSON): + +```json +{ + "cat": "Kernel", + "name": "ampere_sgemm_64x32_nn", + "ts": 1234567, + "dur": 42, + "args": { + "stream": "0x55ab…", + "grid_x": "12", + "block_x": "32", + "ort_correlation_id": "1718000000000000007", + "ort_event_name": "MatMul_0_kernel_time", + "ort_op_name": "MatMul", + "ort_node_index": "7" + } +} +``` + +The `ort_*`-prefixed keys are chosen to avoid colliding with existing CUPTI arg names (`stream`, `grid_*`, `block_*`, `name`, `correlation_id`). + +### 14.7 Build Configuration CUPTI profiling is conditional: - **CMake flag**: `onnxruntime_ENABLE_CUDA_PROFILING=ON` @@ -897,7 +931,7 @@ CUPTI profiling is conditional: When profiling is disabled (default), `CudaEp::CreateProfiler` is set to `nullptr` and no CUPTI code is compiled. -### 14.7 Files +### 14.8 Files | File | Role | |------|------| diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc b/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc index 9e8e973028f36..4cfb80f3bcb82 100644 --- a/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc +++ b/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc @@ -6,7 +6,10 @@ #if defined(ENABLE_CUDA_PROFILING) #include +#include #include +#include +#include #include namespace onnxruntime { @@ -87,14 +90,59 @@ OrtStatus* ORT_API_CALL CudaPluginEpProfiler::StartEventImpl( /*static*/ OrtStatus* ORT_API_CALL CudaPluginEpProfiler::StopEventImpl( - OrtEpProfilerImpl* /*this_ptr*/, - uint64_t /*ort_event_correlation_id*/, - const OrtProfilingEvent* /*ort_event*/) noexcept { + OrtEpProfilerImpl* this_ptr, + uint64_t ort_event_correlation_id, + const OrtProfilingEvent* ort_event) noexcept { EXCEPTION_TO_STATUS_BEGIN + auto* self = static_cast(this_ptr); + // Always pop the CUPTI external correlation push performed in StartEvent, + // regardless of category — even if metadata extraction below partially fails. auto& manager = profiling::CUPTIManager::GetInstance(); manager.PopCorrelation(); + // For NODE_EVENT events, capture the originating node's identity now so that + // EndProfiling can annotate the GPU kernel/memcpy events produced under this + // correlation ID. Accessor failures are non-fatal: we simply skip annotation + // for this event and rely on ort_correlation_id alone for linkage. + if (ort_event != nullptr) { + const auto& api = self->ep_api; + + OrtProfilingEventCategory category = OrtProfilingEventCategory_KERNEL; + if (OrtStatus* s = api.ProfilingEvent_GetCategory(ort_event, &category); s != nullptr) { + Ort::GetApi().ReleaseStatus(s); + return nullptr; + } + + if (category == OrtProfilingEventCategory_NODE) { + OrtNodeInfo info; + + const char* event_name = nullptr; + if (OrtStatus* s = api.ProfilingEvent_GetName(ort_event, &event_name); s != nullptr) { + Ort::GetApi().ReleaseStatus(s); + } else if (event_name != nullptr) { + info.event_name = event_name; + } + + const char* op_name = nullptr; + if (OrtStatus* s = api.ProfilingEvent_GetArgValue(ort_event, "op_name", &op_name); s != nullptr) { + Ort::GetApi().ReleaseStatus(s); + } else if (op_name != nullptr) { + info.op_name = op_name; + } + + const char* node_index = nullptr; + if (OrtStatus* s = api.ProfilingEvent_GetArgValue(ort_event, "node_index", &node_index); s != nullptr) { + Ort::GetApi().ReleaseStatus(s); + } else if (node_index != nullptr) { + info.node_index = node_index; + } + + std::lock_guard lock(self->node_info_mutex_); + self->correlation_to_node_[ort_event_correlation_id] = std::move(info); + } + } + return nullptr; EXCEPTION_TO_STATUS_END } @@ -113,22 +161,70 @@ OrtStatus* ORT_API_CALL CudaPluginEpProfiler::EndProfilingImpl( std::map event_map; manager.Consume(self->client_handle_, self->ort_profiling_start_, event_map); + // Snapshot the correlation→node map under lock and clear it; subsequent + // lookups can then run lock-free for the duration of event flattening. + std::unordered_map node_info; + { + std::lock_guard lock(self->node_info_mutex_); + node_info.swap(self->correlation_to_node_); + } + // Flatten all GPU events and convert to OrtProfilingEvent. std::vector events; for (auto& kv : event_map) { + const uint64_t correlation_id = kv.first; auto& event_list = kv.second; + + // Resolve ORT-side attribution for this correlation ID (if any). + const OrtNodeInfo* info = nullptr; + if (auto it = node_info.find(correlation_id); it != node_info.end()) { + info = &it->second; + } + + // Stringify correlation ID once per outer iteration; storage must outlive + // every Ort::ProfilingEvent constructor call below. The constructor copies + // these strings into the container (see ProfilingEventsContainer_AddEvents), + // so per-record local storage would also work, but lifting it here avoids + // redundant work. + const std::string correlation_id_str = std::to_string(correlation_id); + for (const auto& record : event_list) { // Build parallel key/value arrays to use the raw-pointer ProfilingEvent // constructor, avoiding a copy from InlinedHashMap to std::unordered_map. + // Reserve enough headroom for the CUPTI args plus up to 4 ORT annotations + // (ort_correlation_id always; ort_event_name / ort_op_name / ort_node_index + // when ORT-side metadata is available). InlinedVector arg_keys; InlinedVector arg_values; - arg_keys.reserve(record.args.size()); - arg_values.reserve(record.args.size()); + arg_keys.reserve(record.args.size() + 4); + arg_values.reserve(record.args.size() + 4); for (const auto& [k, v] : record.args) { arg_keys.push_back(k.c_str()); arg_values.push_back(v.c_str()); } + // Always emit ort_correlation_id so consumers can join GPU events back + // to ORT events even when per-node attribution wasn't captured (e.g. the + // event came from a non-NODE category, or StopEvent ran before the GPU + // activity was finalized). + arg_keys.push_back("ort_correlation_id"); + arg_values.push_back(correlation_id_str.c_str()); + + if (info != nullptr) { + if (!info->event_name.empty()) { + arg_keys.push_back("ort_event_name"); + arg_values.push_back(info->event_name.c_str()); + } + if (!info->op_name.empty()) { + arg_keys.push_back("ort_op_name"); + arg_values.push_back(info->op_name.c_str()); + } + if (!info->node_index.empty()) { + arg_keys.push_back("ort_node_index"); + arg_values.push_back(info->node_index.c_str()); + } + } + events.emplace_back( OrtProfilingEventCategory_KERNEL, record.pid, diff --git a/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h b/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h index 77460a40341ac..fe4e5854e2686 100644 --- a/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h +++ b/onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h @@ -5,6 +5,10 @@ #if defined(ENABLE_CUDA_PROFILING) +#include +#include +#include + #include "cuda_plugin_utils.h" #include "core/providers/cuda/cupti_manager.h" #include "core/common/gpu_profiler_common.h" @@ -12,6 +16,15 @@ namespace onnxruntime { namespace cuda_plugin { +/// Per-node ORT profiling metadata captured during StopEvent and used in +/// EndProfiling to annotate CUPTI-captured GPU events with explicit +/// ORT-side attribution (node name, op type, node index). +struct OrtNodeInfo { + std::string event_name; ///< Full ORT event name (e.g. "_kernel_time"). + std::string op_name; ///< ONNX op type for the node, if available. + std::string node_index; ///< Node index in the graph as a decimal string, if available. +}; + /// Plugin-side implementation of OrtEpProfilerImpl for CUDA. /// Delegates to CUPTIManager (within the plugin DLL) for GPU activity tracing /// and implements the C callback interface expected by ORT's PluginEpProfiler bridge. @@ -20,6 +33,17 @@ struct CudaPluginEpProfiler : OrtEpProfilerImpl { uint64_t client_handle_ = 0; TimePoint ort_profiling_start_; + // Maps the absolute, epoch-based ORT event correlation ID for a NODE_EVENT + // (as passed to StartEvent/StopEvent) to the originating node's identity. + // Populated in StopEventImpl and drained in EndProfilingImpl, where the + // entries are joined against CUPTI-captured GPU events to attribute each + // GPU kernel back to a specific ORT graph node. + // + // Different ORT events may run on different threads (inter-op parallelism), + // so map access is protected by node_info_mutex_. + std::mutex node_info_mutex_; + std::unordered_map correlation_to_node_; + explicit CudaPluginEpProfiler(const OrtEpApi& api); ~CudaPluginEpProfiler(); diff --git a/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py b/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py index c03545fc31435..cdefa877152b6 100644 --- a/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py +++ b/onnxruntime/test/python/transformers/test_cuda_plugin_ep.py @@ -2455,6 +2455,7 @@ def _run_profiling_test(self): # If GPU kernel events are present, validate their metadata. kernel_events = [e for e in profile_data if isinstance(e, dict) and e.get("cat") == "Kernel"] if kernel_events: + saw_matmul_attribution = False for event in kernel_events: self.assertIn("ts", event) self.assertIn("dur", event) @@ -2462,6 +2463,44 @@ def _run_profiling_test(self): args = event.get("args", {}) self.assertIn("stream", args, f"GPU kernel event missing 'stream': {event}") self.assertIn("block_x", args, f"GPU kernel event missing 'block_x': {event}") + + # Every GPU kernel event must carry an explicit ORT correlation ID + # so consumers can join it back to the originating ORT event without + # relying on timestamp-proximity heuristics. + self.assertIn( + "ort_correlation_id", + args, + f"GPU kernel event missing 'ort_correlation_id': {event}", + ) + self.assertTrue( + args["ort_correlation_id"].isdigit(), + f"'ort_correlation_id' must be a decimal string: {event}", + ) + + # Per-node attribution is best-effort (only NODE-category ORT events + # populate the map). When present, validate the four annotation keys. + if "ort_op_name" in args: + self.assertIn("ort_event_name", args) + self.assertIn("ort_node_index", args) + self.assertTrue( + args["ort_event_name"].endswith("_kernel_time"), + f"'ort_event_name' should end with '_kernel_time': {event}", + ) + self.assertTrue( + args["ort_node_index"].isdigit(), + f"'ort_node_index' must be a decimal string: {event}", + ) + if args["ort_op_name"] == "MatMul": + saw_matmul_attribution = True + + # The test model contains exactly one MatMul node assigned to the plugin EP; + # if any per-node attribution landed, it must have been for MatMul. + any_op_attribution = any("ort_op_name" in (e.get("args") or {}) for e in kernel_events) + if any_op_attribution: + self.assertTrue( + saw_matmul_attribution, + "Expected at least one GPU kernel event attributed to a MatMul node.", + ) else: print("Note: No GPU Kernel events in profile (CUPTI may not be available).")