Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 40 additions & 6 deletions docs/cuda_plugin_ep/cuda_plugin_ep_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -872,8 +872,9 @@ The plugin API's `StartEvent`/`StopEvent` receive **absolute epoch-based** corre
When ORT calls `EndProfiling`:
1. CUPTI activity buffers are flushed (`cuptiActivityFlushAll`).
2. GPU activity records are processed — kernel names, timestamps, durations, and stream/grid metadata are extracted.
3. Events are converted to `Ort::ProfilingEvent` instances with `OrtProfilingEventCategory_KERNEL`.
4. Events are appended to the `OrtProfilingEventsContainer` via `AddEvents`.
3. The plugin runs an **annotation pass**: while flattening the per-correlation-ID event buckets returned by `CUPTIManager::Consume`, it stamps each GPU event with an explicit `ort_correlation_id` arg, and — when the correlation ID matches a NODE-category ORT event captured during `StopEvent` — also stamps `ort_event_name`, `ort_op_name`, and `ort_node_index`. See §14.6 for the lifecycle of the correlation-to-node map.
4. Events are converted to `Ort::ProfilingEvent` instances with `OrtProfilingEventCategory_KERNEL`.
5. Events are appended to the `OrtProfilingEventsContainer` via `AddEvents`.

The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUProfilerBase::EndProfiling` does. The plugin API is append-only, and the `PluginEpProfiler` bridge on the ORT side likewise appends EP events to ORT's profiling event collection without merge/sort by timestamp or correlation ID. Any ordering or interleaving into a global timeline is handled by downstream trace consumers.

Expand All @@ -883,11 +884,44 @@ The plugin does **not** perform the post-hoc merge/sort that the in-tree `GPUPro
|--------|----------------|----------------|
| Event merge | `GPUProfilerBase::MergeEvents` interleaves GPU events into ORT's array (has known sort-order bug) | Append-only; ORT-side bridge appends only, and trace consumers handle ordering |
| Correlation IDs | Relative → absolute conversion in `GPUTracerManager::PushCorrelation` | Bridge provides absolute IDs directly; plugin pushes to CUPTI as-is |
| `StopEvent` metadata | Ignored (just pops correlation) | ORT event metadata available; currently unused, can annotate GPU events in future |
| GPU→ORT event linkage | Implicit via CUPTI external correlation IDs merged into timeline | GPU events carry only CUPTI metadata (`stream`, `grid_*`, `block_*`); no ORT correlation or parent identifier is attached. Downstream consumers must relate GPU kernels to ORT nodes via timestamp proximity. This is a known limitation; future work may attach `correlation_id` or parent event name via `StopEvent`'s `OrtProfilingEvent` parameter |
| `StopEvent` metadata | Ignored (just pops correlation) | Reads category, name, `op_name`, and `node_index` via `OrtEpApi::ProfilingEvent_*` for NODE-category events and records them in a correlation → node-info map (see §14.6) |
| GPU→ORT event linkage | Implicit via CUPTI external correlation IDs merged into timeline | Explicit — every GPU event carries `ort_correlation_id`; node-attributed events additionally carry `ort_event_name`, `ort_op_name`, and `ort_node_index` |
| Singleton scope | Process-wide `CUPTIManager` in main ORT DLL | DLL-local `CUPTIManager` in plugin (process isolation) |

### 14.6 Build Configuration
### 14.6 Per-Node Attribution

The plugin annotates GPU events with the identity of the ORT graph node that triggered them, so consumers can answer "which node ran this kernel?" without timestamp-proximity heuristics.

**Map lifecycle.** `CudaPluginEpProfiler` holds an `std::unordered_map<uint64_t, OrtNodeInfo>` keyed by the absolute, epoch-based correlation ID the bridge passes to `StartEvent`/`StopEvent`. The map is guarded by `std::mutex node_info_mutex_` because ORT may execute nodes on multiple inter-op threads concurrently:

1. `StopEventImpl` is called once per ORT event. For each `OrtProfilingEvent` it queries `ProfilingEvent_GetCategory`; if the category is `OrtProfilingEventCategory_NODE`, it reads the event name (`ProfilingEvent_GetName`) plus the `op_name` and `node_index` args (`ProfilingEvent_GetArgValue`) and inserts an `OrtNodeInfo` under the correlation-ID key. Accessor failures release any returned `OrtStatus*` and continue; the event simply falls back to `ort_correlation_id`-only linkage. The CUPTI external-correlation pop is always performed regardless of accessor outcome.
2. `EndProfilingImpl` swaps the map into a local container under the mutex (so subsequent lookups during event flattening run lock-free), then iterates the per-correlation buckets returned by `CUPTIManager::Consume`. For each bucket it stringifies the correlation ID once and, for every GPU event in the bucket, appends `ort_correlation_id` plus — if the lookup hit — `ort_event_name`, `ort_op_name`, and `ort_node_index`. The arg strings are copied by `OrtEpApi::ProfilingEventsContainer_AddEvents`, so local-scope storage is sufficient.

**Why NODE-only.** CUPTI external correlation IDs are pushed in `StartEvent` and popped in `StopEvent` for *all* event categories, so non-NODE events (e.g. `SESSION` or `API`) can still produce attributed GPU activity buckets. Filtering to `OrtProfilingEventCategory_NODE` in `StopEventImpl` means only graph-node executions populate the map — GPU events captured under, say, a session-init scope carry just `ort_correlation_id` and no `ort_op_name`. This keeps the annotation precise: an `ort_op_name` value always corresponds to an actual ONNX op type.

**Example.** A GPU kernel event for a `MatMul` node at graph index 7 now looks like (Chrome trace JSON):

```json
{
"cat": "Kernel",
"name": "ampere_sgemm_64x32_nn",
"ts": 1234567,
"dur": 42,
"args": {
"stream": "0x55ab…",
"grid_x": "12",
"block_x": "32",
"ort_correlation_id": "1718000000000000007",
"ort_event_name": "MatMul_0_kernel_time",
"ort_op_name": "MatMul",
"ort_node_index": "7"
}
}
```

The `ort_*`-prefixed keys are chosen to avoid colliding with existing CUPTI arg names (`stream`, `grid_*`, `block_*`, `name`, `correlation_id`).

### 14.7 Build Configuration

CUPTI profiling is conditional:
- **CMake flag**: `onnxruntime_ENABLE_CUDA_PROFILING=ON`
Expand All @@ -897,7 +931,7 @@ CUPTI profiling is conditional:

When profiling is disabled (default), `CudaEp::CreateProfiler` is set to `nullptr` and no CUPTI code is compiled.

### 14.7 Files
### 14.8 Files

| File | Role |
|------|------|
Expand Down
106 changes: 101 additions & 5 deletions onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@
#if defined(ENABLE_CUDA_PROFILING)

#include <map>
#include <mutex>
#include <string>
#include <unordered_map>
#include <utility>
#include <vector>

namespace onnxruntime {
Expand Down Expand Up @@ -87,14 +90,59 @@ OrtStatus* ORT_API_CALL CudaPluginEpProfiler::StartEventImpl(

/*static*/
OrtStatus* ORT_API_CALL CudaPluginEpProfiler::StopEventImpl(
OrtEpProfilerImpl* /*this_ptr*/,
uint64_t /*ort_event_correlation_id*/,
const OrtProfilingEvent* /*ort_event*/) noexcept {
OrtEpProfilerImpl* this_ptr,
uint64_t ort_event_correlation_id,
const OrtProfilingEvent* ort_event) noexcept {
EXCEPTION_TO_STATUS_BEGIN
auto* self = static_cast<CudaPluginEpProfiler*>(this_ptr);

// Always pop the CUPTI external correlation push performed in StartEvent,
// regardless of category — even if metadata extraction below partially fails.
auto& manager = profiling::CUPTIManager::GetInstance();
manager.PopCorrelation();

// For NODE_EVENT events, capture the originating node's identity now so that
// EndProfiling can annotate the GPU kernel/memcpy events produced under this
// correlation ID. Accessor failures are non-fatal: we simply skip annotation
// for this event and rely on ort_correlation_id alone for linkage.
if (ort_event != nullptr) {
const auto& api = self->ep_api;

OrtProfilingEventCategory category = OrtProfilingEventCategory_KERNEL;
if (OrtStatus* s = api.ProfilingEvent_GetCategory(ort_event, &category); s != nullptr) {
Ort::GetApi().ReleaseStatus(s);
return nullptr;
}

if (category == OrtProfilingEventCategory_NODE) {
OrtNodeInfo info;

const char* event_name = nullptr;
if (OrtStatus* s = api.ProfilingEvent_GetName(ort_event, &event_name); s != nullptr) {
Ort::GetApi().ReleaseStatus(s);
} else if (event_name != nullptr) {
info.event_name = event_name;
}

const char* op_name = nullptr;
if (OrtStatus* s = api.ProfilingEvent_GetArgValue(ort_event, "op_name", &op_name); s != nullptr) {
Ort::GetApi().ReleaseStatus(s);
} else if (op_name != nullptr) {
info.op_name = op_name;
}

const char* node_index = nullptr;
if (OrtStatus* s = api.ProfilingEvent_GetArgValue(ort_event, "node_index", &node_index); s != nullptr) {
Ort::GetApi().ReleaseStatus(s);
} else if (node_index != nullptr) {
info.node_index = node_index;
}

std::lock_guard<std::mutex> lock(self->node_info_mutex_);
self->correlation_to_node_[ort_event_correlation_id] = std::move(info);
}
}

return nullptr;
EXCEPTION_TO_STATUS_END
}
Expand All @@ -113,22 +161,70 @@ OrtStatus* ORT_API_CALL CudaPluginEpProfiler::EndProfilingImpl(
std::map<uint64_t, profiling::Events> event_map;
manager.Consume(self->client_handle_, self->ort_profiling_start_, event_map);

// Snapshot the correlation→node map under lock and clear it; subsequent
// lookups can then run lock-free for the duration of event flattening.
std::unordered_map<uint64_t, OrtNodeInfo> node_info;
{
std::lock_guard<std::mutex> lock(self->node_info_mutex_);
node_info.swap(self->correlation_to_node_);
}

// Flatten all GPU events and convert to OrtProfilingEvent.
std::vector<Ort::ProfilingEvent> events;
for (auto& kv : event_map) {
const uint64_t correlation_id = kv.first;
auto& event_list = kv.second;

// Resolve ORT-side attribution for this correlation ID (if any).
const OrtNodeInfo* info = nullptr;
if (auto it = node_info.find(correlation_id); it != node_info.end()) {
info = &it->second;
}

// Stringify correlation ID once per outer iteration; storage must outlive
// every Ort::ProfilingEvent constructor call below. The constructor copies
// these strings into the container (see ProfilingEventsContainer_AddEvents),
// so per-record local storage would also work, but lifting it here avoids
// redundant work.
const std::string correlation_id_str = std::to_string(correlation_id);

for (const auto& record : event_list) {
// Build parallel key/value arrays to use the raw-pointer ProfilingEvent
// constructor, avoiding a copy from InlinedHashMap to std::unordered_map.
// Reserve enough headroom for the CUPTI args plus up to 4 ORT annotations
// (ort_correlation_id always; ort_event_name / ort_op_name / ort_node_index
// when ORT-side metadata is available).
InlinedVector<const char*> arg_keys;
InlinedVector<const char*> arg_values;
arg_keys.reserve(record.args.size());
arg_values.reserve(record.args.size());
arg_keys.reserve(record.args.size() + 4);
arg_values.reserve(record.args.size() + 4);
for (const auto& [k, v] : record.args) {
arg_keys.push_back(k.c_str());
arg_values.push_back(v.c_str());
}

// Always emit ort_correlation_id so consumers can join GPU events back
// to ORT events even when per-node attribution wasn't captured (e.g. the
// event came from a non-NODE category, or StopEvent ran before the GPU
// activity was finalized).
arg_keys.push_back("ort_correlation_id");
arg_values.push_back(correlation_id_str.c_str());

if (info != nullptr) {
if (!info->event_name.empty()) {
arg_keys.push_back("ort_event_name");
arg_values.push_back(info->event_name.c_str());
}
if (!info->op_name.empty()) {
arg_keys.push_back("ort_op_name");
arg_values.push_back(info->op_name.c_str());
}
if (!info->node_index.empty()) {
arg_keys.push_back("ort_node_index");
arg_values.push_back(info->node_index.c_str());
}
}

events.emplace_back(
OrtProfilingEventCategory_KERNEL,
record.pid,
Expand Down
24 changes: 24 additions & 0 deletions onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,26 @@

#if defined(ENABLE_CUDA_PROFILING)

#include <mutex>
#include <string>
#include <unordered_map>

#include "cuda_plugin_utils.h"
#include "core/providers/cuda/cupti_manager.h"
#include "core/common/gpu_profiler_common.h"

namespace onnxruntime {
namespace cuda_plugin {

/// Per-node ORT profiling metadata captured during StopEvent and used in
/// EndProfiling to annotate CUPTI-captured GPU events with explicit
/// ORT-side attribution (node name, op type, node index).
struct OrtNodeInfo {
std::string event_name; ///< Full ORT event name (e.g. "<node>_kernel_time").
std::string op_name; ///< ONNX op type for the node, if available.
std::string node_index; ///< Node index in the graph as a decimal string, if available.
};

/// Plugin-side implementation of OrtEpProfilerImpl for CUDA.
/// Delegates to CUPTIManager (within the plugin DLL) for GPU activity tracing
/// and implements the C callback interface expected by ORT's PluginEpProfiler bridge.
Expand All @@ -20,6 +33,17 @@ struct CudaPluginEpProfiler : OrtEpProfilerImpl {
uint64_t client_handle_ = 0;
TimePoint ort_profiling_start_;

// Maps the absolute, epoch-based ORT event correlation ID for a NODE_EVENT
// (as passed to StartEvent/StopEvent) to the originating node's identity.
// Populated in StopEventImpl and drained in EndProfilingImpl, where the
// entries are joined against CUPTI-captured GPU events to attribute each
// GPU kernel back to a specific ORT graph node.
//
// Different ORT events may run on different threads (inter-op parallelism),
// so map access is protected by node_info_mutex_.
std::mutex node_info_mutex_;
std::unordered_map<uint64_t, OrtNodeInfo> correlation_to_node_;

explicit CudaPluginEpProfiler(const OrtEpApi& api);
~CudaPluginEpProfiler();

Expand Down
39 changes: 39 additions & 0 deletions onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
Original file line number Diff line number Diff line change
Expand Up @@ -2455,13 +2455,52 @@ def _run_profiling_test(self):
# If GPU kernel events are present, validate their metadata.
kernel_events = [e for e in profile_data if isinstance(e, dict) and e.get("cat") == "Kernel"]
if kernel_events:
saw_matmul_attribution = False
for event in kernel_events:
self.assertIn("ts", event)
self.assertIn("dur", event)
self.assertGreaterEqual(event["dur"], 0)
args = event.get("args", {})
self.assertIn("stream", args, f"GPU kernel event missing 'stream': {event}")
self.assertIn("block_x", args, f"GPU kernel event missing 'block_x': {event}")

# Every GPU kernel event must carry an explicit ORT correlation ID
# so consumers can join it back to the originating ORT event without
# relying on timestamp-proximity heuristics.
self.assertIn(
"ort_correlation_id",
args,
f"GPU kernel event missing 'ort_correlation_id': {event}",
)
self.assertTrue(
args["ort_correlation_id"].isdigit(),
f"'ort_correlation_id' must be a decimal string: {event}",
)

# Per-node attribution is best-effort (only NODE-category ORT events
# populate the map). When present, validate the four annotation keys.
if "ort_op_name" in args:
self.assertIn("ort_event_name", args)
self.assertIn("ort_node_index", args)
self.assertTrue(
args["ort_event_name"].endswith("_kernel_time"),
f"'ort_event_name' should end with '_kernel_time': {event}",
)
self.assertTrue(
args["ort_node_index"].isdigit(),
f"'ort_node_index' must be a decimal string: {event}",
)
if args["ort_op_name"] == "MatMul":
saw_matmul_attribution = True

# The test model contains exactly one MatMul node assigned to the plugin EP;
# if any per-node attribution landed, it must have been for MatMul.
any_op_attribution = any("ort_op_name" in (e.get("args") or {}) for e in kernel_events)
if any_op_attribution:
self.assertTrue(
saw_matmul_attribution,
"Expected at least one GPU kernel event attributed to a MatMul node.",
)
else:
print("Note: No GPU Kernel events in profile (CUPTI may not be available).")

Expand Down
Loading