diff --git a/docs/source/pico2_tutorial.md b/docs/source/pico2_tutorial.md index 6718e63d05a..cbe2e8aa122 100644 --- a/docs/source/pico2_tutorial.md +++ b/docs/source/pico2_tutorial.md @@ -9,6 +9,7 @@ A 28×28 MNIST digit classifier running on memory constrained, low power microco - Input: ASCII art digits (0, 1, 4, 7) - Output: Real-time predictions via USB serial - Memory: <400KB total footprint +- Two variants: FP32 (portable ops) and INT8 (CMSIS-NN accelerated) ## Prerequisites @@ -24,16 +25,32 @@ which arm-none-eabi-gcc # --> arm/arm-scratch/arm-gnu-toolchain-13.3.rel1-x86_64 ## Step 1: Generate pte from given example Model +### FP32 model (default) + - Use the [provided example model](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist.py) ```bash +cd examples/raspberry_pi/pico2 python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte ``` - **Note:** This is hand-crafted MNIST Classifier (proof-of-concept), and not production trained. This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors. +### INT8 quantized model (CMSIS-NN accelerated) + +- Use the [CMSIS-NN export script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist_cmsis.py) + +```bash +cd examples/raspberry_pi/pico2 +python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte +``` + +This uses the `CortexMQuantizer` to produce INT8 quantized ops that map to CMSIS-NN kernels on Cortex-M33. The model I/O stays float — quantize and dequantize nodes are inserted inside the graph. + ## Step 2: Build Firmware for Pico2 +### FP32 build + ```bash # Generate model (Creates balanced_tiny_mlp_mnist.pte) cd ./examples/raspberry_pi/pico2 @@ -41,12 +58,30 @@ python export_mlp_mnist.py cd - # Build Pico2 firmware (one command!) +./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte +``` + +### INT8 CMSIS-NN build + +```bash +# Generate INT8 quantized model +cd ./examples/raspberry_pi/pico2 +python export_mlp_mnist_cmsis.py +cd - -./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2 +# Build with CMSIS-NN backend +./examples/raspberry_pi/pico2/build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte ``` Output: **executorch_pico.uf2** firmware file (examples/raspberry_pi/pico2/build/) +**Script options:** +| Flag | Description | +|------|-------------| +| `--model=FILE` | Specify model file to embed (relative to pico2/) | +| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration | +| `--clean` | Clean build directories and exit; run separately before building if needed | + **Note:** '[build_firmware_pico.sh](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/build_firmware_pico.sh)' script converts given model pte to hex array and generates C code for the same via this helper [script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/pte_to_array.py). This C code is then compiled to generate final .uf2 binary which is then flashed to Pico2. ## Step 3: Flash to Pico2 @@ -72,6 +107,10 @@ screen /dev/tty.usbmodem1101 115200 Something like: +📊 Memory usage after method load: + Method allocator: 45632 / 204800 bytes used + Activation pool: 204800 bytes allocated + === Digit 7 === ############################ ############################ @@ -104,6 +143,7 @@ Something like: Input stats: 159 white pixels out of 784 total Running neural network inference... +⏱️ Inference time: 245 us ✅ Neural network results: Digit 0: 370.000 Digit 1: 0.000 @@ -116,7 +156,16 @@ Running neural network inference... Digit 8: -3.000 Digit 9: -3.000 -� PREDICTED: 7 (Expected: 7) ✅ CORRECT! +🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT! + +================================================== + +📊 Inference latency summary: + Digit 0: 312 us + Digit 1: 198 us + Digit 4: 267 us + Digit 7: 245 us + Average: 255 us ``` ## Memory Optimization Tips @@ -184,12 +233,29 @@ arm-none-eabi-objdump -t examples/raspberry_pi/pico2/build/executorch_pico.elf | arm-none-eabi-readelf -l examples/raspberry_pi/pico2/build/executorch_pico.elf ``` +## CMSIS-NN INT8 Acceleration + +The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops. + +### How it works + +1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8 +2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph +3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels +4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels + +### When to use CMSIS-NN + +- Lower latency on supported ops (linear, conv2d) +- Smaller model size (INT8 weights vs FP32) +- Trade-off: slight accuracy loss from quantization + ## Next Steps ### Scale up your deployment - Use real production trained model -- Optimize further → INT8 quantization, pruning +- Optimize further → INT8 quantization with CMSIS-NN, pruning ### Happy Inference! diff --git a/examples/raspberry_pi/pico2/README.md b/examples/raspberry_pi/pico2/README.md index cab24d58278..9b61b5fc462 100644 --- a/examples/raspberry_pi/pico2/README.md +++ b/examples/raspberry_pi/pico2/README.md @@ -82,18 +82,39 @@ This involves two steps: ### Generate your model: +**FP32 model (default):** ```bash cd examples/raspberry_pi/pico2 python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte ``` +**INT8 quantized model (CMSIS-NN accelerated):** +```bash +cd examples/raspberry_pi/pico2 +python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte +``` + ### Build firmware: +**FP32 build:** ```bash # In the dir examples/raspberry_pi/pico2 -./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2 +./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte ``` +**INT8 CMSIS-NN build:** +```bash +# In the dir examples/raspberry_pi/pico2 +./build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte +``` + +**Script options:** +| Flag | Description | +|------|-------------| +| `--model=FILE` | Specify model file to embed (relative to pico2/) | +| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration | +| `--clean` | Clean build directories and exit (run separately before building) | + ### Flash Firmware Hold the BOOTSEL button on Pico2 and connect to your computer. It mounts as `RPI-RP2`. Copy `executorch_pico.uf2` to this drive. @@ -105,10 +126,14 @@ The Pico2 LED blinks 10 times at 500ms intervals for successful execution. Via s ```bash ... ... -PREDICTED: 4 (Expected: 4) ✅ CORRECT! +🎯 PREDICTED: 4 (Expected: 4) ✅ CORRECT! ================================================== +📊 Memory usage after method load: + Method allocator: 45632 / 204800 bytes used + Activation pool: 204800 bytes allocated + === Digit 7 === ############################ ############################ @@ -141,6 +166,7 @@ PREDICTED: 4 (Expected: 4) ✅ CORRECT! Input stats: 159 white pixels out of 784 total Running neural network inference... +⏱️ Inference time: 245 us ✅ Neural network results: Digit 0: 370.000 Digit 1: 0.000 @@ -153,11 +179,18 @@ Running neural network inference... Digit 8: -3.000 Digit 9: -3.000 -� PREDICTED: 7 (Expected: 7) ✅ CORRECT! +🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT! ================================================== -🎉 All tests complete! PyTorch neural network works on Pico2! +📊 Inference latency summary: + Digit 0: 312 us + Digit 1: 198 us + Digit 4: 267 us + Digit 7: 245 us + Average: 255 us + +🎉 All tests complete! ExecuTorch inference of neural network works on Pico2! ``` ### Debugging via Serial Terminal @@ -170,4 +203,21 @@ screen /dev/tty.usbmodem1101 115200 Replace `/dev/tty.usbmodem1101` with your device path. If LED blinks 10 times at 100ms intervals, check logs for errors, but if it blinks 10 times at 500ms intervals, it is successful! -Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment! 🚀 +## CMSIS-NN INT8 Acceleration + +The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops. + +### How it works + +1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8 +2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph +3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels +4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels + +### When to use CMSIS-NN + +- Lower latency on supported ops (linear, conv2d) +- Smaller model size (INT8 weights vs FP32) +- Trade-off: slight accuracy loss from quantization + +Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment!