Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 69 additions & 3 deletions docs/source/pico2_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ A 28×28 MNIST digit classifier running on memory constrained, low power microco
- Input: ASCII art digits (0, 1, 4, 7)
- Output: Real-time predictions via USB serial
- Memory: <400KB total footprint
- Two variants: FP32 (portable ops) and INT8 (CMSIS-NN accelerated)

## Prerequisites

Expand All @@ -24,29 +25,63 @@ which arm-none-eabi-gcc # --> arm/arm-scratch/arm-gnu-toolchain-13.3.rel1-x86_64

## Step 1: Generate pte from given example Model

### FP32 model (default)

- Use the [provided example model](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist.py)

```bash
cd examples/raspberry_pi/pico2
python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
```

- **Note:** This is hand-crafted MNIST Classifier (proof-of-concept), and not production trained. This tiny MLP recognizes digits 0, 1, 4, and 7 using manually designed feature detectors.

### INT8 quantized model (CMSIS-NN accelerated)

- Use the [CMSIS-NN export script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/export_mlp_mnist_cmsis.py)

```bash
cd examples/raspberry_pi/pico2
python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte
```

This uses the `CortexMQuantizer` to produce INT8 quantized ops that map to CMSIS-NN kernels on Cortex-M33. The model I/O stays float — quantize and dequantize nodes are inserted inside the graph.

## Step 2: Build Firmware for Pico2

### FP32 build

```bash
# Generate model (Creates balanced_tiny_mlp_mnist.pte)
cd ./examples/raspberry_pi/pico2
python export_mlp_mnist.py
cd -

# Build Pico2 firmware (one command!)
./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte
```

### INT8 CMSIS-NN build

```bash
# Generate INT8 quantized model
cd ./examples/raspberry_pi/pico2
python export_mlp_mnist_cmsis.py
cd -

./examples/raspberry_pi/pico2/build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2
# Build with CMSIS-NN backend
./examples/raspberry_pi/pico2/build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte
```

Output: **executorch_pico.uf2** firmware file (examples/raspberry_pi/pico2/build/)

**Script options:**
| Flag | Description |
|------|-------------|
| `--model=FILE` | Specify model file to embed (relative to pico2/) |
| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration |
| `--clean` | Clean build directories and exit; run separately before building if needed |
Comment thread
psiddh marked this conversation as resolved.

**Note:** '[build_firmware_pico.sh](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/build_firmware_pico.sh)' script converts given model pte to hex array and generates C code for the same via this helper [script](https://github.com/pytorch/executorch/blob/main/examples/raspberry_pi/pico2/pte_to_array.py). This C code is then compiled to generate final .uf2 binary which is then flashed to Pico2.

## Step 3: Flash to Pico2
Expand All @@ -72,6 +107,10 @@ screen /dev/tty.usbmodem1101 115200

Something like:

📊 Memory usage after method load:
Method allocator: 45632 / 204800 bytes used
Activation pool: 204800 bytes allocated

=== Digit 7 ===
############################
############################
Expand Down Expand Up @@ -104,6 +143,7 @@ Something like:

Input stats: 159 white pixels out of 784 total
Running neural network inference...
⏱️ Inference time: 245 us
✅ Neural network results:
Digit 0: 370.000
Digit 1: 0.000
Expand All @@ -116,7 +156,16 @@ Running neural network inference...
Digit 8: -3.000
Digit 9: -3.000

� PREDICTED: 7 (Expected: 7) ✅ CORRECT!
🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT!

==================================================

📊 Inference latency summary:
Digit 0: 312 us
Digit 1: 198 us
Digit 4: 267 us
Digit 7: 245 us
Average: 255 us
```

## Memory Optimization Tips
Expand Down Expand Up @@ -184,12 +233,29 @@ arm-none-eabi-objdump -t examples/raspberry_pi/pico2/build/executorch_pico.elf |
arm-none-eabi-readelf -l examples/raspberry_pi/pico2/build/executorch_pico.elf
```

## CMSIS-NN INT8 Acceleration

The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops.

### How it works

1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8
2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph
3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels
4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels

### When to use CMSIS-NN

- Lower latency on supported ops (linear, conv2d)
- Smaller model size (INT8 weights vs FP32)
- Trade-off: slight accuracy loss from quantization

## Next Steps

### Scale up your deployment

- Use real production trained model
- Optimize further → INT8 quantization, pruning
- Optimize further → INT8 quantization with CMSIS-NN, pruning

### Happy Inference!

Expand Down
60 changes: 55 additions & 5 deletions examples/raspberry_pi/pico2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,18 +82,39 @@ This involves two steps:

### Generate your model:

**FP32 model (default):**
```bash
cd examples/raspberry_pi/pico2
python export_mlp_mnist.py # Creates balanced_tiny_mlp_mnist.pte
```

**INT8 quantized model (CMSIS-NN accelerated):**
```bash
cd examples/raspberry_pi/pico2
python export_mlp_mnist_cmsis.py # Creates balanced_tiny_mlp_mnist_cmsis.pte
```

### Build firmware:

**FP32 build:**
```bash
# In the dir examples/raspberry_pi/pico2
./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte # This creates executorch_pico.uf2, a firmware image for Pico2
./build_firmware_pico.sh --model=balanced_tiny_mlp_mnist.pte
```

**INT8 CMSIS-NN build:**
```bash
# In the dir examples/raspberry_pi/pico2
./build_firmware_pico.sh --cmsis --model=balanced_tiny_mlp_mnist_cmsis.pte
```

**Script options:**
| Flag | Description |
|------|-------------|
| `--model=FILE` | Specify model file to embed (relative to pico2/) |
| `--cmsis` | Build with CMSIS-NN INT8 kernels for Cortex-M33 acceleration |
| `--clean` | Clean build directories and exit (run separately before building) |
Comment thread
psiddh marked this conversation as resolved.

### Flash Firmware

Hold the BOOTSEL button on Pico2 and connect to your computer. It mounts as `RPI-RP2`. Copy `executorch_pico.uf2` to this drive.
Expand All @@ -105,10 +126,14 @@ The Pico2 LED blinks 10 times at 500ms intervals for successful execution. Via s
```bash
...
...
PREDICTED: 4 (Expected: 4) ✅ CORRECT!
🎯 PREDICTED: 4 (Expected: 4) ✅ CORRECT!

==================================================

📊 Memory usage after method load:
Method allocator: 45632 / 204800 bytes used
Activation pool: 204800 bytes allocated

=== Digit 7 ===
############################
############################
Expand Down Expand Up @@ -141,6 +166,7 @@ PREDICTED: 4 (Expected: 4) ✅ CORRECT!

Input stats: 159 white pixels out of 784 total
Running neural network inference...
⏱️ Inference time: 245 us
✅ Neural network results:
Digit 0: 370.000
Digit 1: 0.000
Expand All @@ -153,11 +179,18 @@ Running neural network inference...
Digit 8: -3.000
Digit 9: -3.000

PREDICTED: 7 (Expected: 7) ✅ CORRECT!
🎯 PREDICTED: 7 (Expected: 7) ✅ CORRECT!

==================================================

🎉 All tests complete! PyTorch neural network works on Pico2!
📊 Inference latency summary:
Digit 0: 312 us
Digit 1: 198 us
Digit 4: 267 us
Digit 7: 245 us
Average: 255 us

🎉 All tests complete! ExecuTorch inference of neural network works on Pico2!
```

### Debugging via Serial Terminal
Expand All @@ -170,4 +203,21 @@ screen /dev/tty.usbmodem1101 115200

Replace `/dev/tty.usbmodem1101` with your device path. If LED blinks 10 times at 100ms intervals, check logs for errors, but if it blinks 10 times at 500ms intervals, it is successful!

Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment! 🚀
## CMSIS-NN INT8 Acceleration

The Pico2 uses an RP2350 SoC with a Cortex-M33 core. The CMSIS-NN library provides optimized INT8 kernels that leverage the Cortex-M33's DSP instructions for faster inference compared to FP32 portable ops.

### How it works

1. `export_mlp_mnist_cmsis.py` uses `CortexMQuantizer` to quantize the model to INT8
2. The model I/O remains float — quantize/dequantize nodes are inserted inside the graph
3. `--cmsis` flag builds ExecuTorch with the Cortex-M backend and links CMSIS-NN kernels
4. At runtime, quantized linear ops dispatch to CMSIS-NN instead of portable kernels

### When to use CMSIS-NN

- Lower latency on supported ops (linear, conv2d)
- Smaller model size (INT8 weights vs FP32)
- Trade-off: slight accuracy loss from quantization

Result: A complete PyTorch → ExecuTorch → Pico2 demo MNIST deployment!
Loading