Skip to content

Zero-copy I/O for plugin EPs with HOST_ACCESSIBLE memory#28037

Draft
ericcraw wants to merge 2 commits intomicrosoft:mainfrom
ericcraw:host-accessible-allocator
Draft

Zero-copy I/O for plugin EPs with HOST_ACCESSIBLE memory#28037
ericcraw wants to merge 2 commits intomicrosoft:mainfrom
ericcraw:host-accessible-allocator

Conversation

@ericcraw
Copy link
Copy Markdown
Contributor

Description

Adds DevicesAreMemoryCompatible() to skip data copies between devices that share memory (CPU <-> HOST_ACCESSIBLE, or HOST_ACCESSIBLE <-> DEFAULT on the same physical device). Applied in feed/fetch copy planning and in BatchOrCopyMLValue.

Overrides GetOrtDeviceByMemType() in PluginExecutionProvider so the allocation planner routes CPU-type I/O through the HOST_ACCESSIBLE allocator when the plugin EP has registered one. This enables the planner to place intermediate tensors (CPU EP -> plugin EP boundary) in HOST_ACCESSIBLE memory, avoiding copies at the partition boundary.

Updates the in-place optimization check in the allocation planner to use UsesCpuMemory() so it recognizes HOST_ACCESSIBLE outputs as CPU-memory-compatible.

Motivation and Context

Remove unnecessary copies for non-cpu HOST_ACCESSIBLE device allocations.

Adds DevicesAreMemoryCompatible() to skip data copies between devices
that share memory (CPU <-> HOST_ACCESSIBLE, or HOST_ACCESSIBLE <->
DEFAULT on the same physical device). Applied in feed/fetch copy
planning and in BatchOrCopyMLValue.

Overrides GetOrtDeviceByMemType() in PluginExecutionProvider so the
allocation planner routes CPU-type I/O through the HOST_ACCESSIBLE
allocator when the plugin EP has registered one. This enables the
planner to place intermediate tensors (CPU EP -> plugin EP boundary)
in HOST_ACCESSIBLE memory, avoiding copies at the partition boundary.

Updates the in-place optimization check in the allocation planner to
use UsesCpuMemory() so it recognises HOST_ACCESSIBLE outputs as
CPU-memory-compatible.
Comment thread onnxruntime/core/framework/utils.cc Outdated
return provider.GetDevice().Type() == OrtDevice::CPU;
}

// Returns true if no data transfer is needed between the two devices.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns true if no data transfer is needed between the two devices.

Does alignment play a part here?
I know some devices require 4K alignment to be accessible by a device.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For allocations from the allocation planner I think that should be taken care of. For user allocated I/O tensors they should have been allocated using appropriate host accessible device allocators which would need the correctly align the backing memory.

Regardless I added a check for alignment as well

Comment thread onnxruntime/core/framework/utils.cc Outdated
// Returns true if no data transfer is needed between the two devices.
// HOST_ACCESSIBLE memory is a superset — accessible by both host and device — so it can satisfy
// DEFAULT memory requirements on the same physical device without a copy.
static bool DevicesAreMemoryCompatible(const OrtDevice& a, const OrtDevice& b) {
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is symmetric: DevicesAreMemoryCompatible(a, b) == DevicesAreMemoryCompatible(b, a). But the underlying property is not symmetric:

Direction Safe? Why
HOST_ACCESSIBLE → DEFAULT (same device) Yes HOST_ACCESSIBLE memory is in the device's address space; a device kernel can read/write it.

DEFAULT → HOST_ACCESSIBLE (same device) Only for device-side access DEFAULT (e.g. GPU global) memory is typically not CPU-readable. A consumer expecting CPU-readable HOST_ACCESSIBLE memory that receives a DEFAULT pointer will fault.

All current call sites operate in the kernel-execution / data-transfer-planning context where only the device side reads the memory, so the symmetry is safe today. However, the function name "DevicesAreMemoryCompatible" and its doc comment don't convey this constraint. A future caller that uses it to decide whether the CPU can access a buffer (analogous to what GetPyObjFromTensor does in PR #28038) would silently get the wrong answer.

Recommendation: Either (a) add a prominent comment that the function assumes device-side access only, or (b) make it directional (CanSourceSatisfyTarget(src, tgt)) so the caller explicitly declares the access direction. Option (b) is safer long-term.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took this suggestion and tweaked the check a bit

// Populate device_fetches for the output-copy path.
// Reuses a pre-allocated user buffer when the memory is compatible (same device or HOST_ACCESSIBLE
// <-> DEFAULT on the same physical device); otherwise inserts an empty placeholder.
static void PopulateDeviceFetches(gsl::span<const MLValueCopyInfo> fetch_copy_info,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fetch_copy_info.size() < fetches.size(), indexing fetch_copy_info[i] is undefined behavior. The original inline code had the same gap, but factoring it into a reusable helper makes the contract less obvious. Add:

ORT_ENFORCE(fetch_copy_info.size() >= fetches.size());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the enforce.

if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
// Use the host-accessible allocator device if one was registered by the plugin.
// This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory.
if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) {
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ep_devices_[0

This only inspects first device.
For a multi-device plugin EP, this always returns the first device's host-accessible info. Acceptable for single-device EPs (the common case), but worth a comment stating the assumption, or an ORT_ENFORCE(ep_devices_.size() <= 1) if multi-device is truly unsupported here.
However, I suspect otherwise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atm OpenVINO ep allows multiple devices, we probably want to add a separate api to allow the EP to tell ORT which EP device it's using. Then ORT wouldn't have to make any assumptions.

@yuslepukhin
Copy link
Copy Markdown
Member

There are no unit tests for DevicesAreMemoryCompatible. Given the function has five distinct logical branches (both CPU, CPU + HOST_ACCESSIBLE same device, CPU + HOST_ACCESSIBLE different device, HOST_ACCESSIBLE ↔ DEFAULT same device, incompatible), it should have dedicated unit tests covering each case.

Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds HOST_ACCESSIBLE-aware device selection and copy-planning logic to reduce (or eliminate) unnecessary feed/fetch and boundary copies for plugin execution providers that register host-accessible memory.

Changes:

  • Override PluginExecutionProvider::GetOrtDeviceByMemType to route CPU I/O mem types through a registered HOST_ACCESSIBLE allocator/device.
  • Introduce DevicesAreMemoryCompatible() and apply it to feed/fetch copy planning and BatchOrCopyMLValue to skip transfers when devices are deemed memory-compatible.
  • Update allocation planner CPU-memory checks to use OrtDevice::UsesCpuMemory() (so HOST_ACCESSIBLE is treated as CPU-memory-compatible).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h Declares PluginExecutionProvider override for mem-type → device mapping.
onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc Implements HOST_ACCESSIBLE routing for CPUInput/CPUOutput mem types.
onnxruntime/core/framework/utils.cc Adds memory-compatibility logic and uses it to skip copies + reuse fetch buffers.
onnxruntime/core/framework/allocation_planner.cc Treats HOST_ACCESSIBLE as CPU-memory-compatible in in-place planning checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +218 to +224
OrtDevice PluginExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
// Use the host-accessible allocator device if one was registered by the plugin.
// This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory.
if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) {
return ep_devices_[0]->host_accessible_memory_info->device;
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetOrtDeviceByMemType uses ep_devices_[0]->host_accessible_memory_info to select the CPUInput/CPUOutput device, but the constructor only enforces consistency for device_memory_info across OrtEpDevice instances (not host_accessible_memory_info). If multiple OrtEpDevice entries are present and host_accessible_memory_info differs (or is only set on some), this will return an arbitrary device and can misroute allocations/copies. Consider validating that all host_accessible_memory_info devices are equivalent (or selecting based on the active/default device id) and failing fast if they are inconsistent.

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +228
OrtDevice PluginExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const {
if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) {
// Use the host-accessible allocator device if one was registered by the plugin.
// This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory.
if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) {
return ep_devices_[0]->host_accessible_memory_info->device;
}
return OrtDevice();
}
return GetDevice();
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new override changes core allocation/copy routing for plugin EPs when a host_accessible allocator is registered, but there’s no accompanying regression test in the existing plugin EP test suite (e.g., verifying CPUInput/CPUOutput now map to HOST_ACCESSIBLE, and that feed/fetch copy planning behaves correctly). Please add a unit/integration test that exercises a plugin EP with HOST_ACCESSIBLE memory and asserts that unnecessary copies are avoided without breaking correctness.

Copilot uses AI. Check for mistakes.
Comment thread onnxruntime/core/framework/utils.cc Outdated
Comment on lines +70 to +76
// HOST_ACCESSIBLE <-> DEFAULT: compatible only on the same physical device.
if ((a_is_cpu_mem != b_is_cpu_mem) &&
a.Type() == b.Type() &&
a.Vendor() == b.Vendor() &&
a.Id() == b.Id()) {
return true;
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DevicesAreMemoryCompatible treats HOST_ACCESSIBLE <-> DEFAULT as copy-free when Type/Vendor/Id match, but for existing EPs (e.g., CUDA/ROCm) HOST_ACCESSIBLE commonly represents pinned host memory that is not the same address space as device DEFAULT memory. This would cause feed/fetch planning and BatchOrCopyMLValue to skip required transfers (and potentially ignore user-provided output buffers), producing incorrect results. Please restrict this optimization to an explicit “shared memory” contract (EP/device capability) or remove the HOST_ACCESSIBLE <-> DEFAULT compatibility path so DEFAULT<->HOST_ACCESSIBLE still triggers a copy for EPs that need it.

Copilot uses AI. Check for mistakes.
@@ -918,7 +918,7 @@ class PlannerImpl {
// We only do it for CPU based EPs. We are not likely to encounter
// non CPU devices here since they are already taken care of by using MemCpy nodes earlier.
// However, we still ignore them.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment now appears stale.

// non CPU devices here since they are already taken care of by using MemCpy nodes earlier.
// However, we still ignore them.
if (output_device.Type() == OrtDevice::CPU) {
if (output_device.UsesCpuMemory()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (output_device.UsesCpuMemory())

The change appears to hold water. However,
there is a slight performance regression risk with determine_device():

When the output is CUDA pinned (GPU, HOST_ACCESSIBLE, NVIDIA) and the consumer is a CPU EP node, determine_device prefers the HOST_ACCESSIBLE device over the CPU device. This means the allocation planner might now place the tensor in pinned memory instead of regular CPU memory. This is functionally correct (pinned memory is CPU-readable), but over-allocating pinned memory can degrade system performance — the NVIDIA blog explicitly warns about this: "You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system."

However, this scenario only triggers for nodes that already declare OutputMemoryType(OrtMemTypeCPUOutput) — meaning the CUDA EP already intended for this output to be in pinned memory. The allocation planner is just deciding whether to override it for the consumer. With the old code, the override was silently skipped; with the new code, determine_device still prefers the pinned location. So the actual allocation doesn't change — the pinned location was already what GetOrtDeviceByMemType returned and what would have been set by plan_.SetLocation(...) at the end.

@ericcraw ericcraw force-pushed the host-accessible-allocator branch from cf5a86b to 204dcd7 Compare April 17, 2026 00:53
@ericcraw
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback! I've ran out of time today unfortunately and I'm going to be out until Wednesday next week. Hopefully you'll hear from me again by then. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants