route EthosU input/output memcpy through overridable hook#19264
route EthosU input/output memcpy through overridable hook#19264
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264
Note: Links to docs will display an error until the docs builds have been completed. ❌ 18 New Failures, 3 Unrelated FailuresAs of commit ffc9927 with merge base a7e44bf ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103455766. |
This PR needs a
|
Summary: The EthosU backend's input/output scratch shuffling currently does plain CPU std::memcpy of every input tensor into the scratch buffer and every output tensor out of it on every inference. On Cortex-M55-based firmware targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so the M55 sleeps while the transfer runs. This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy` — that the EthosU backend uses everywhere it currently calls memcpy for input/output scratch shuffling. The default (weak) implementation lives in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just calls std::memcpy, so behavior is unchanged for any consumer that doesn't override it. Firmware targets can supply a strong-symbol override (e.g. routing through a DMA engine) without touching the upstream backend code. Implementation notes: - The weak default lives in its own TU so the compiler in the call-site TUs cannot inline its body and bypass the link-time override. This is the same pattern bolt_arm_memcpy_external uses. - Three call sites updated: input scratch copy in EthosUBackend.cpp, the layout-adjustment chunk loop in EthosUBackend.cpp, and the output scratch copy in EthosUBackend_Cortex_M.cpp. Differential Revision: D103455766
ddea8da to
ffc9927
Compare
Summary:
The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.
This change introduces a thin extern-C indirection —
arm_ethos_io_memcpy— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.
Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.
Implementation notes:
TUs cannot inline its body and bypass the link-time override. This is
the same pattern bolt_arm_memcpy_external uses.
layout-adjustment chunk loop in EthosUBackend.cpp, and the output
scratch copy in EthosUBackend_Cortex_M.cpp.
Differential Revision: D103455766
cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell