Zero-copy I/O for plugin EPs with HOST_ACCESSIBLE memory by ericcraw · Pull Request #28037 · microsoft/onnxruntime

ericcraw · 2026-04-10T18:52:47Z

Description

Adds DevicesAreMemoryCompatible() to skip data copies between devices that share memory (CPU <-> HOST_ACCESSIBLE, or HOST_ACCESSIBLE <-> DEFAULT on the same physical device). Applied in feed/fetch copy planning and in BatchOrCopyMLValue.

Overrides GetOrtDeviceByMemType() in PluginExecutionProvider so the allocation planner routes CPU-type I/O through the HOST_ACCESSIBLE allocator when the plugin EP has registered one. This enables the planner to place intermediate tensors (CPU EP -> plugin EP boundary) in HOST_ACCESSIBLE memory, avoiding copies at the partition boundary.

Updates the in-place optimization check in the allocation planner to use UsesCpuMemory() so it recognizes HOST_ACCESSIBLE outputs as CPU-memory-compatible.

Motivation and Context

Remove unnecessary copies for non-cpu HOST_ACCESSIBLE device allocations.

Adds DevicesAreMemoryCompatible() to skip data copies between devices that share memory (CPU <-> HOST_ACCESSIBLE, or HOST_ACCESSIBLE <-> DEFAULT on the same physical device). Applied in feed/fetch copy planning and in BatchOrCopyMLValue. Overrides GetOrtDeviceByMemType() in PluginExecutionProvider so the allocation planner routes CPU-type I/O through the HOST_ACCESSIBLE allocator when the plugin EP has registered one. This enables the planner to place intermediate tensors (CPU EP -> plugin EP boundary) in HOST_ACCESSIBLE memory, avoiding copies at the partition boundary. Updates the in-place optimization check in the allocation planner to use UsesCpuMemory() so it recognises HOST_ACCESSIBLE outputs as CPU-memory-compatible.

yuslepukhin · 2026-04-16T01:20:51Z

  return provider.GetDevice().Type() == OrtDevice::CPU;
 }

+// Returns true if no data transfer is needed between the two devices.


Returns true if no data transfer is needed between the two devices.

Does alignment play a part here?
I know some devices require 4K alignment to be accessible by a device.

For allocations from the allocation planner I think that should be taken care of. For user allocated I/O tensors they should have been allocated using appropriate host accessible device allocators which would need the correctly align the backing memory.

Regardless I added a check for alignment as well

yuslepukhin · 2026-04-16T01:37:50Z

+// Returns true if no data transfer is needed between the two devices.
+// HOST_ACCESSIBLE memory is a superset — accessible by both host and device — so it can satisfy
+// DEFAULT memory requirements on the same physical device without a copy.
+static bool DevicesAreMemoryCompatible(const OrtDevice& a, const OrtDevice& b) {


The function is symmetric: DevicesAreMemoryCompatible(a, b) == DevicesAreMemoryCompatible(b, a). But the underlying property is not symmetric:

Direction Safe? Why
HOST_ACCESSIBLE → DEFAULT (same device) Yes HOST_ACCESSIBLE memory is in the device's address space; a device kernel can read/write it.

DEFAULT → HOST_ACCESSIBLE (same device) Only for device-side access DEFAULT (e.g. GPU global) memory is typically not CPU-readable. A consumer expecting CPU-readable HOST_ACCESSIBLE memory that receives a DEFAULT pointer will fault.

All current call sites operate in the kernel-execution / data-transfer-planning context where only the device side reads the memory, so the symmetry is safe today. However, the function name "DevicesAreMemoryCompatible" and its doc comment don't convey this constraint. A future caller that uses it to decide whether the CPU can access a buffer (analogous to what GetPyObjFromTensor does in PR #28038) would silently get the wrong answer.

Recommendation: Either (a) add a prominent comment that the function assumes device-side access only, or (b) make it directional (CanSourceSatisfyTarget(src, tgt)) so the caller explicitly declares the access direction. Option (b) is safer long-term.

Took this suggestion and tweaked the check a bit

yuslepukhin · 2026-04-16T01:41:44Z

+// Populate device_fetches for the output-copy path.
+// Reuses a pre-allocated user buffer when the memory is compatible (same device or HOST_ACCESSIBLE
+// <-> DEFAULT on the same physical device); otherwise inserts an empty placeholder.
+static void PopulateDeviceFetches(gsl::span<const MLValueCopyInfo> fetch_copy_info,


If fetch_copy_info.size() < fetches.size(), indexing fetch_copy_info[i] is undefined behavior. The original inline code had the same gap, but factoring it into a reusable helper makes the contract less obvious. Add:

ORT_ENFORCE(fetch_copy_info.size() >= fetches.size());

Added the enforce.

yuslepukhin · 2026-04-16T01:43:17Z

+  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
+    // Use the host-accessible allocator device if one was registered by the plugin.
+    // This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory.
+    if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) {


ep_devices_[0

This only inspects first device.
For a multi-device plugin EP, this always returns the first device's host-accessible info. Acceptable for single-device EPs (the common case), but worth a comment stating the assumption, or an ORT_ENFORCE(ep_devices_.size() <= 1) if multi-device is truly unsupported here.
However, I suspect otherwise.

Atm OpenVINO ep allows multiple devices, we probably want to add a separate api to allow the EP to tell ORT which EP device it's using. Then ORT wouldn't have to make any assumptions.

yuslepukhin · 2026-04-16T01:56:17Z

There are no unit tests for DevicesAreMemoryCompatible. Given the function has five distinct logical branches (both CPU, CPU + HOST_ACCESSIBLE same device, CPU + HOST_ACCESSIBLE different device, HOST_ACCESSIBLE ↔ DEFAULT same device, incompatible), it should have dedicated unit tests covering each case.

yuslepukhin

🕐

Copilot

Pull request overview

Adds HOST_ACCESSIBLE-aware device selection and copy-planning logic to reduce (or eliminate) unnecessary feed/fetch and boundary copies for plugin execution providers that register host-accessible memory.

Changes:

Override PluginExecutionProvider::GetOrtDeviceByMemType to route CPU I/O mem types through a registered HOST_ACCESSIBLE allocator/device.
Introduce DevicesAreMemoryCompatible() and apply it to feed/fetch copy planning and BatchOrCopyMLValue to skip transfers when devices are deemed memory-compatible.
Update allocation planner CPU-memory checks to use OrtDevice::UsesCpuMemory() (so HOST_ACCESSIBLE is treated as CPU-memory-compatible).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h	Declares `PluginExecutionProvider` override for mem-type → device mapping.
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc	Implements HOST_ACCESSIBLE routing for CPUInput/CPUOutput mem types.
onnxruntime/core/framework/utils.cc	Adds memory-compatibility logic and uses it to skip copies + reuse fetch buffers.
onnxruntime/core/framework/allocation_planner.cc	Treats HOST_ACCESSIBLE as CPU-memory-compatible in in-place planning checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T18:31:22Z

+OrtDevice PluginExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
+  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
+    // Use the host-accessible allocator device if one was registered by the plugin.
+    // This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory.
+    if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) {
+      return ep_devices_[0]->host_accessible_memory_info->device;
+    }


GetOrtDeviceByMemType uses ep_devices_[0]->host_accessible_memory_info to select the CPUInput/CPUOutput device, but the constructor only enforces consistency for device_memory_info across OrtEpDevice instances (not host_accessible_memory_info). If multiple OrtEpDevice entries are present and host_accessible_memory_info differs (or is only set on some), this will return an arbitrary device and can misroute allocations/copies. Consider validating that all host_accessible_memory_info devices are equivalent (or selecting based on the active/default device id) and failing fast if they are inconsistent.

Copilot · 2026-04-16T18:31:22Z

+OrtDevice PluginExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
+  if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
+    // Use the host-accessible allocator device if one was registered by the plugin.
+    // This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory.
+    if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) {
+      return ep_devices_[0]->host_accessible_memory_info->device;
+    }
+    return OrtDevice();
+  }
+  return GetDevice();
+}


This new override changes core allocation/copy routing for plugin EPs when a host_accessible allocator is registered, but there’s no accompanying regression test in the existing plugin EP test suite (e.g., verifying CPUInput/CPUOutput now map to HOST_ACCESSIBLE, and that feed/fetch copy planning behaves correctly). Please add a unit/integration test that exercises a plugin EP with HOST_ACCESSIBLE memory and asserts that unnecessary copies are avoided without breaking correctness.

Copilot · 2026-04-16T18:31:23Z

+  // HOST_ACCESSIBLE <-> DEFAULT: compatible only on the same physical device.
+  if ((a_is_cpu_mem != b_is_cpu_mem) &&
+      a.Type() == b.Type() &&
+      a.Vendor() == b.Vendor() &&
+      a.Id() == b.Id()) {
+    return true;
+  }


DevicesAreMemoryCompatible treats HOST_ACCESSIBLE <-> DEFAULT as copy-free when Type/Vendor/Id match, but for existing EPs (e.g., CUDA/ROCm) HOST_ACCESSIBLE commonly represents pinned host memory that is not the same address space as device DEFAULT memory. This would cause feed/fetch planning and BatchOrCopyMLValue to skip required transfers (and potentially ignore user-provided output buffers), producing incorrect results. Please restrict this optimization to an explicit “shared memory” contract (EP/device capability) or remove the HOST_ACCESSIBLE <-> DEFAULT compatibility path so DEFAULT<->HOST_ACCESSIBLE still triggers a copy for EPs that need it.

yuslepukhin · 2026-04-16T18:40:11Z

@@ -918,7 +918,7 @@ class PlannerImpl {
          // We only do it for CPU based EPs. We are not likely to encounter
          // non CPU devices here since they are already taken care of by using MemCpy nodes earlier.
          // However, we still ignore them.


This comment now appears stale.

yuslepukhin · 2026-04-16T18:41:57Z

          // non CPU devices here since they are already taken care of by using MemCpy nodes earlier.
          // However, we still ignore them.
-          if (output_device.Type() == OrtDevice::CPU) {
+          if (output_device.UsesCpuMemory()) {


if (output_device.UsesCpuMemory())

The change appears to hold water. However,
there is a slight performance regression risk with determine_device():

When the output is CUDA pinned (GPU, HOST_ACCESSIBLE, NVIDIA) and the consumer is a CPU EP node, determine_device prefers the HOST_ACCESSIBLE device over the CPU device. This means the allocation planner might now place the tensor in pinned memory instead of regular CPU memory. This is functionally correct (pinned memory is CPU-readable), but over-allocating pinned memory can degrade system performance — the NVIDIA blog explicitly warns about this: "You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system."

However, this scenario only triggers for nodes that already declare OutputMemoryType(OrtMemTypeCPUOutput) — meaning the CUDA EP already intended for this output to be in pinned memory. The allocation planner is just deciding whether to override it for the consumer. With the old code, the override was silently skipped; with the new code, determine_device still prefers the pinned location. So the actual allocation doesn't change — the pinned location was already what GetOrtDeviceByMemType returned and what would have been set by plan_.SetLocation(...) at the end.

ericcraw · 2026-04-17T00:54:43Z

Thanks for the feedback! I've ran out of time today unfortunately and I'm going to be out until Wednesday next week. Hopefully you'll hear from me again by then. 😄

yuslepukhin reviewed Apr 16, 2026

View reviewed changes

yuslepukhin requested changes Apr 16, 2026

View reviewed changes

yuslepukhin requested a review from Copilot April 16, 2026 18:23

Copilot started reviewing on behalf of yuslepukhin April 16, 2026 18:25 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

yuslepukhin reviewed Apr 16, 2026

View reviewed changes

yuslepukhin mentioned this pull request Apr 16, 2026

Python API for HOST_ACCESSIBLE OrtValue allocation #28038

Draft

First pass addressing review comments

204dcd7

ericcraw force-pushed the host-accessible-allocator branch from cf5a86b to 204dcd7 Compare April 17, 2026 00:53

Conversation

ericcraw commented Apr 10, 2026

Description

Motivation and Context

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuslepukhin Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuslepukhin Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuslepukhin commented Apr 16, 2026

Uh oh!

yuslepukhin left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericcraw commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuslepukhin Apr 16, 2026 •

edited

Loading

yuslepukhin Apr 16, 2026 •

edited

Loading