Zero-copy I/O for plugin EPs with HOST_ACCESSIBLE memory#28037
Zero-copy I/O for plugin EPs with HOST_ACCESSIBLE memory#28037ericcraw wants to merge 2 commits intomicrosoft:mainfrom
Conversation
Adds DevicesAreMemoryCompatible() to skip data copies between devices that share memory (CPU <-> HOST_ACCESSIBLE, or HOST_ACCESSIBLE <-> DEFAULT on the same physical device). Applied in feed/fetch copy planning and in BatchOrCopyMLValue. Overrides GetOrtDeviceByMemType() in PluginExecutionProvider so the allocation planner routes CPU-type I/O through the HOST_ACCESSIBLE allocator when the plugin EP has registered one. This enables the planner to place intermediate tensors (CPU EP -> plugin EP boundary) in HOST_ACCESSIBLE memory, avoiding copies at the partition boundary. Updates the in-place optimization check in the allocation planner to use UsesCpuMemory() so it recognises HOST_ACCESSIBLE outputs as CPU-memory-compatible.
| return provider.GetDevice().Type() == OrtDevice::CPU; | ||
| } | ||
|
|
||
| // Returns true if no data transfer is needed between the two devices. |
There was a problem hiding this comment.
For allocations from the allocation planner I think that should be taken care of. For user allocated I/O tensors they should have been allocated using appropriate host accessible device allocators which would need the correctly align the backing memory.
Regardless I added a check for alignment as well
| // Returns true if no data transfer is needed between the two devices. | ||
| // HOST_ACCESSIBLE memory is a superset — accessible by both host and device — so it can satisfy | ||
| // DEFAULT memory requirements on the same physical device without a copy. | ||
| static bool DevicesAreMemoryCompatible(const OrtDevice& a, const OrtDevice& b) { |
There was a problem hiding this comment.
The function is symmetric: DevicesAreMemoryCompatible(a, b) == DevicesAreMemoryCompatible(b, a). But the underlying property is not symmetric:
Direction Safe? Why
HOST_ACCESSIBLE → DEFAULT (same device) Yes HOST_ACCESSIBLE memory is in the device's address space; a device kernel can read/write it.
DEFAULT → HOST_ACCESSIBLE (same device) Only for device-side access DEFAULT (e.g. GPU global) memory is typically not CPU-readable. A consumer expecting CPU-readable HOST_ACCESSIBLE memory that receives a DEFAULT pointer will fault.
All current call sites operate in the kernel-execution / data-transfer-planning context where only the device side reads the memory, so the symmetry is safe today. However, the function name "DevicesAreMemoryCompatible" and its doc comment don't convey this constraint. A future caller that uses it to decide whether the CPU can access a buffer (analogous to what GetPyObjFromTensor does in PR #28038) would silently get the wrong answer.
Recommendation: Either (a) add a prominent comment that the function assumes device-side access only, or (b) make it directional (CanSourceSatisfyTarget(src, tgt)) so the caller explicitly declares the access direction. Option (b) is safer long-term.
There was a problem hiding this comment.
Took this suggestion and tweaked the check a bit
| // Populate device_fetches for the output-copy path. | ||
| // Reuses a pre-allocated user buffer when the memory is compatible (same device or HOST_ACCESSIBLE | ||
| // <-> DEFAULT on the same physical device); otherwise inserts an empty placeholder. | ||
| static void PopulateDeviceFetches(gsl::span<const MLValueCopyInfo> fetch_copy_info, |
There was a problem hiding this comment.
If fetch_copy_info.size() < fetches.size(), indexing fetch_copy_info[i] is undefined behavior. The original inline code had the same gap, but factoring it into a reusable helper makes the contract less obvious. Add:
ORT_ENFORCE(fetch_copy_info.size() >= fetches.size());
There was a problem hiding this comment.
Added the enforce.
| if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) { | ||
| // Use the host-accessible allocator device if one was registered by the plugin. | ||
| // This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory. | ||
| if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) { |
There was a problem hiding this comment.
This only inspects first device.
For a multi-device plugin EP, this always returns the first device's host-accessible info. Acceptable for single-device EPs (the common case), but worth a comment stating the assumption, or an ORT_ENFORCE(ep_devices_.size() <= 1) if multi-device is truly unsupported here.
However, I suspect otherwise.
There was a problem hiding this comment.
Atm OpenVINO ep allows multiple devices, we probably want to add a separate api to allow the EP to tell ORT which EP device it's using. Then ORT wouldn't have to make any assumptions.
|
There are no unit tests for DevicesAreMemoryCompatible. Given the function has five distinct logical branches (both CPU, CPU + HOST_ACCESSIBLE same device, CPU + HOST_ACCESSIBLE different device, HOST_ACCESSIBLE ↔ DEFAULT same device, incompatible), it should have dedicated unit tests covering each case. |
There was a problem hiding this comment.
Pull request overview
Adds HOST_ACCESSIBLE-aware device selection and copy-planning logic to reduce (or eliminate) unnecessary feed/fetch and boundary copies for plugin execution providers that register host-accessible memory.
Changes:
- Override
PluginExecutionProvider::GetOrtDeviceByMemTypeto route CPU I/O mem types through a registered HOST_ACCESSIBLE allocator/device. - Introduce
DevicesAreMemoryCompatible()and apply it to feed/fetch copy planning andBatchOrCopyMLValueto skip transfers when devices are deemed memory-compatible. - Update allocation planner CPU-memory checks to use
OrtDevice::UsesCpuMemory()(so HOST_ACCESSIBLE is treated as CPU-memory-compatible).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.h | Declares PluginExecutionProvider override for mem-type → device mapping. |
| onnxruntime/core/session/plugin_ep/ep_plugin_provider_interfaces.cc | Implements HOST_ACCESSIBLE routing for CPUInput/CPUOutput mem types. |
| onnxruntime/core/framework/utils.cc | Adds memory-compatibility logic and uses it to skip copies + reuse fetch buffers. |
| onnxruntime/core/framework/allocation_planner.cc | Treats HOST_ACCESSIBLE as CPU-memory-compatible in in-place planning checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| OrtDevice PluginExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const { | ||
| if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) { | ||
| // Use the host-accessible allocator device if one was registered by the plugin. | ||
| // This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory. | ||
| if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) { | ||
| return ep_devices_[0]->host_accessible_memory_info->device; | ||
| } |
There was a problem hiding this comment.
GetOrtDeviceByMemType uses ep_devices_[0]->host_accessible_memory_info to select the CPUInput/CPUOutput device, but the constructor only enforces consistency for device_memory_info across OrtEpDevice instances (not host_accessible_memory_info). If multiple OrtEpDevice entries are present and host_accessible_memory_info differs (or is only set on some), this will return an arbitrary device and can misroute allocations/copies. Consider validating that all host_accessible_memory_info devices are equivalent (or selecting based on the active/default device id) and failing fast if they are inconsistent.
| OrtDevice PluginExecutionProvider::GetOrtDeviceByMemType(OrtMemType mem_type) const { | ||
| if (mem_type == OrtMemTypeCPUInput || mem_type == OrtMemTypeCPUOutput) { | ||
| // Use the host-accessible allocator device if one was registered by the plugin. | ||
| // This avoids unnecessary copies between CPU and HOST_ACCESSIBLE memory. | ||
| if (!ep_devices_.empty() && ep_devices_[0]->host_accessible_memory_info != nullptr) { | ||
| return ep_devices_[0]->host_accessible_memory_info->device; | ||
| } | ||
| return OrtDevice(); | ||
| } | ||
| return GetDevice(); | ||
| } |
There was a problem hiding this comment.
This new override changes core allocation/copy routing for plugin EPs when a host_accessible allocator is registered, but there’s no accompanying regression test in the existing plugin EP test suite (e.g., verifying CPUInput/CPUOutput now map to HOST_ACCESSIBLE, and that feed/fetch copy planning behaves correctly). Please add a unit/integration test that exercises a plugin EP with HOST_ACCESSIBLE memory and asserts that unnecessary copies are avoided without breaking correctness.
| // HOST_ACCESSIBLE <-> DEFAULT: compatible only on the same physical device. | ||
| if ((a_is_cpu_mem != b_is_cpu_mem) && | ||
| a.Type() == b.Type() && | ||
| a.Vendor() == b.Vendor() && | ||
| a.Id() == b.Id()) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
DevicesAreMemoryCompatible treats HOST_ACCESSIBLE <-> DEFAULT as copy-free when Type/Vendor/Id match, but for existing EPs (e.g., CUDA/ROCm) HOST_ACCESSIBLE commonly represents pinned host memory that is not the same address space as device DEFAULT memory. This would cause feed/fetch planning and BatchOrCopyMLValue to skip required transfers (and potentially ignore user-provided output buffers), producing incorrect results. Please restrict this optimization to an explicit “shared memory” contract (EP/device capability) or remove the HOST_ACCESSIBLE <-> DEFAULT compatibility path so DEFAULT<->HOST_ACCESSIBLE still triggers a copy for EPs that need it.
| @@ -918,7 +918,7 @@ class PlannerImpl { | |||
| // We only do it for CPU based EPs. We are not likely to encounter | |||
| // non CPU devices here since they are already taken care of by using MemCpy nodes earlier. | |||
| // However, we still ignore them. | |||
There was a problem hiding this comment.
This comment now appears stale.
| // non CPU devices here since they are already taken care of by using MemCpy nodes earlier. | ||
| // However, we still ignore them. | ||
| if (output_device.Type() == OrtDevice::CPU) { | ||
| if (output_device.UsesCpuMemory()) { |
There was a problem hiding this comment.
The change appears to hold water. However,
there is a slight performance regression risk with determine_device():
When the output is CUDA pinned (GPU, HOST_ACCESSIBLE, NVIDIA) and the consumer is a CPU EP node, determine_device prefers the HOST_ACCESSIBLE device over the CPU device. This means the allocation planner might now place the tensor in pinned memory instead of regular CPU memory. This is functionally correct (pinned memory is CPU-readable), but over-allocating pinned memory can degrade system performance — the NVIDIA blog explicitly warns about this: "You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system."
However, this scenario only triggers for nodes that already declare OutputMemoryType(OrtMemTypeCPUOutput) — meaning the CUDA EP already intended for this output to be in pinned memory. The allocation planner is just deciding whether to override it for the consumer. With the old code, the override was silently skipped; with the new code, determine_device still prefers the pinned location. So the actual allocation doesn't change — the pinned location was already what GetOrtDeviceByMemType returned and what would have been set by plan_.SetLocation(...) at the end.
cf5a86b to
204dcd7
Compare
|
Thanks for the feedback! I've ran out of time today unfortunately and I'm going to be out until Wednesday next week. Hopefully you'll hear from me again by then. 😄 |
Description
Adds DevicesAreMemoryCompatible() to skip data copies between devices that share memory (CPU <-> HOST_ACCESSIBLE, or HOST_ACCESSIBLE <-> DEFAULT on the same physical device). Applied in feed/fetch copy planning and in BatchOrCopyMLValue.
Overrides GetOrtDeviceByMemType() in PluginExecutionProvider so the allocation planner routes CPU-type I/O through the HOST_ACCESSIBLE allocator when the plugin EP has registered one. This enables the planner to place intermediate tensors (CPU EP -> plugin EP boundary) in HOST_ACCESSIBLE memory, avoiding copies at the partition boundary.
Updates the in-place optimization check in the allocation planner to use UsesCpuMemory() so it recognizes HOST_ACCESSIBLE outputs as CPU-memory-compatible.
Motivation and Context
Remove unnecessary copies for non-cpu HOST_ACCESSIBLE device allocations.