Skip to content

route EthosU input/output memcpy through overridable hook#19264

Open
3l1 wants to merge 1 commit intomainfrom
export-D103455766
Open

route EthosU input/output memcpy through overridable hook#19264
3l1 wants to merge 1 commit intomainfrom
export-D103455766

Conversation

@3l1
Copy link
Copy Markdown
Contributor

@3l1 3l1 commented May 1, 2026

Summary:
The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — arm_ethos_io_memcpy
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:

  • The weak default lives in its own TU so the compiler in the call-site
    TUs cannot inline its body and bypass the link-time override. This is
    the same pattern bolt_arm_memcpy_external uses.
  • Three call sites updated: input scratch copy in EthosUBackend.cpp, the
    layout-adjustment chunk loop in EthosUBackend.cpp, and the output
    scratch copy in EthosUBackend_Cortex_M.cpp.

Differential Revision: D103455766

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

@3l1 3l1 requested a review from digantdesai as a code owner May 1, 2026 21:06
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19264

Note: Links to docs will display an error until the docs builds have been completed.

❌ 18 New Failures, 3 Unrelated Failures

As of commit ffc9927 with merge base a7e44bf (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 1, 2026

@3l1 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103455766.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary:
The EthosU backend's input/output scratch shuffling currently does plain
CPU std::memcpy of every input tensor into the scratch buffer and every
output tensor out of it on every inference. On Cortex-M55-based firmware
targets that have a DMA engine, this is a significant CPU load, inference is spent in memcpy that could instead be DMA-offloaded so
the M55 sleeps while the transfer runs.

This change introduces a thin extern-C indirection — `arm_ethos_io_memcpy`
— that the EthosU backend uses everywhere it currently calls memcpy for
input/output scratch shuffling. The default (weak) implementation lives
in a separate translation unit (EthosUBackend_IoMemcpy.cpp) and just
calls std::memcpy, so behavior is unchanged for any consumer that doesn't
override it.

Firmware targets can supply a strong-symbol override (e.g. routing
through a DMA engine) without touching the upstream backend code.

Implementation notes:
- The weak default lives in its own TU so the compiler in the call-site
  TUs cannot inline its body and bypass the link-time override. This is
  the same pattern bolt_arm_memcpy_external uses.
- Three call sites updated: input scratch copy in EthosUBackend.cpp, the
  layout-adjustment chunk loop in EthosUBackend.cpp, and the output
  scratch copy in EthosUBackend_Cortex_M.cpp.

Differential Revision: D103455766
@meta-codesync meta-codesync Bot force-pushed the export-D103455766 branch from ddea8da to ffc9927 Compare May 1, 2026 21:07
@3l1 3l1 added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label May 1, 2026
@3l1 3l1 requested a review from gggekov May 1, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported module: arm Issues related to arm backend partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant