Arm backend: Document validated TinyML models for Cortex-M

rascani · rascani · commit d43fa35582c5 · 2026-04-23T08:53:00.000-07:00
Adds a Validated Models section to the Cortex-M backend overview listing
the six models exported, INT8 quantized, and run on the Corstone-300 FVP
by CI: mv2 and ds_cnn on trunk, and mv3, mobilenet_v1_025, resnet8, and
deep_autoencoder nightly. For each model the table points at the source
file and the per-model dialect/implementation test. A short note calls
out that mobilenet_v1_025 is the MLPerf Tiny Visual Wake Words reference
model — the canonical TinyML person-detection benchmark — since that
naming is not obvious from the name.

The page also documents the bundled (.bpte) testing flow that CI uses:
aot_arm_compiler --bundleio embeds reference inputs and expected outputs
in the program, and examples/arm/run.sh drives the full export → build
→ FVP chain with Test_result PASS/FAIL self-checking, so a reader can
reproduce what trunk and nightly do.

An admonition clarifies that CI validates INT8 numerical parity between
the exported .bpte and the eager-mode quantized model, not task accuracy
(VWW / KWS / ImageNet).

This change was authored with Claude (claude-opus-4-7[1m]).
diff --git a/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md b/docs/source/backends/arm-cortex-m/arm-cortex-m-overview.md
@@ -164,3 +164,64 @@ backends/arm/scripts/run_fvp.sh --elf=build/arm_executor_runner --target=ethos-u
 ```
 
 For a complete end-to-end walkthrough including dataset setup, calibration, and result validation, see the [Cortex-M MobileNetV2 notebook](https://github.com/pytorch/executorch/blob/main/examples/arm/cortex_m_mv2_example.ipynb).
+
+## Testing with Bundled I/O
+
+The tutorial above produces a plain `.pte`. For programmatic testing,
+`aot_arm_compiler --bundleio` instead produces a bundled (`.bpte`) program
+that embeds reference inputs and expected outputs; the Cortex-M test runner
+loads the bundle via semihosting and self-checks its outputs against the
+embedded references, emitting `Test_result: PASS` or `Test_result: FAIL`
+on the UART.
+
+The driver for this flow is `examples/arm/run.sh`, which exports the model,
+builds the Cortex-M test runner, launches the Corstone-300 FVP with
+semihosting enabled, and checks the bundled output. Run it from the
+ExecuTorch repo root after `./install_executorch.sh`:
+
+```bash
+# One-time: install the Arm toolchain + FVP.
+examples/arm/setup.sh --i-agree-to-the-contained-eula
+source examples/arm/arm-scratch/setup_path.sh
+
+# Per model: export, build, and run on the FVP in one step.
+# (Quantization is the default for the cortex-m55+int8 target.)
+examples/arm/run.sh \
+    --model_name=<model> \
+    --target=cortex-m55+int8 \
+    --bundleio
+```
+
+Replace `<model>` with any of the validated-model names in the table
+below. Without `--calibration_data`, calibration falls back to the model's
+`get_example_inputs()` (random data) — enough for bundled-I/O numerical
+parity, but not for task-accuracy claims. On `Test_result: FAIL`, inspect
+the FVP UART log for the per-tensor diff; supplying a representative
+calibration dataset via `--calibration_data=<dir>` often resolves
+mismatches caused by random-input calibration.
+
+:::{important}
+Bundled I/O checks INT8 **numerical parity** between the exported `.bpte`
+and the eager-mode quantized model on reference inputs; it does not
+validate task accuracy (VWW / KWS / ImageNet).
+:::
+
+## Validated Models
+
+The following models are exported, INT8 quantized, lowered, and validated
+end-to-end on the Corstone-300 FVP:
+
+| Model              | Task                               | Input shape   | Source                                            | Test                                                     |
+|--------------------|------------------------------------|---------------|---------------------------------------------------|----------------------------------------------------------|
+| `mv2`              | Image classification               | `1x3x224x224` | `examples/models/mobilenet_v2/`                   | `backends/cortex_m/test/models/test_mobilenet_v2.py`     |
+| `mv3`              | Image classification               | `1x3x224x224` | `examples/models/mobilenet_v3/`                   | `backends/cortex_m/test/models/test_mobilenet_v3.py`     |
+| `ds_cnn`           | Keyword spotting (MLPerf Tiny)     | `1x1x49x10`   | `examples/models/mlperf_tiny/ds_cnn.py`           | `backends/cortex_m/test/models/test_ds_cnn.py`           |
+| `mobilenet_v1_025` | Visual Wake Words (MLPerf Tiny)    | `1x3x96x96`   | `examples/models/mlperf_tiny/mobilenet_v1_025.py` | `backends/cortex_m/test/models/test_mobilenet_v1_025.py` |
+| `resnet8`          | Image classification (MLPerf Tiny) | `1x3x32x32`   | `examples/models/mlperf_tiny/resnet8.py`          | `backends/cortex_m/test/models/test_resnet8.py`          |
+| `deep_autoencoder` | Anomaly detection (MLPerf Tiny)    | `1x640`       | `examples/models/mlperf_tiny/deep_autoencoder.py` | `backends/cortex_m/test/models/test_deep_autoencoder.py` |
+
+:::{note}
+`mobilenet_v1_025` is the MLPerf Tiny Visual Wake Words benchmark
+(MobileNetV1 with width multiplier 0.25) — the canonical person-detection
+reference model for TinyML.
+:::