You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"This tutorial walks through the complete pipeline for deploying **SpeechNet** (a lightweight CNN for EMG-based silent speech recognition) on the **Siracusa RISC-V MCU** using **Deeploy**.\n",
10
+
"\n",
11
+
"You will learn:\n",
12
+
"1. How to define a Deeploy-friendly PyTorch model\n",
13
+
"2. How to export inference and training ONNX graphs using Onnx4Deeploy\n",
14
+
"3. How to run untiled and tiled Deeploy deployment on Siracusa (GVSoC)\n",
15
+
"4. Key design decisions and pitfalls\n",
16
+
"\n",
17
+
"**Prerequisites**: Familiarity with PyTorch, ONNX, and basic knowledge of RISC-V MCU architectures.\n",
18
+
"\n",
19
+
"**Reference**: Spacone et al., \"SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition\", arXiv: 2603.02847."
20
+
]
21
+
},
22
+
{
23
+
"cell_type": "markdown",
24
+
"metadata": {},
25
+
"source": [
26
+
"## 1. Model Architecture\n",
27
+
"\n",
28
+
"SpeechNet is a 5-block CNN processing 14-channel EMG signals:\n",
"## 2. Defining a Deeploy-Friendly PyTorch Model\n",
48
+
"\n",
49
+
"When designing a model for Deeploy deployment, follow these rules:\n",
50
+
"\n",
51
+
"### Rule 1: No dynamic ONNX ops\n",
52
+
"Avoid `torch.flatten()`, `x.size()`, `x.shape[N]` in the forward pass. These generate dynamic `Shape`/`Gather`/`Flatten` ops in ONNX that Deeploy cannot handle.\n",
"**Symptom**: Tiled training hangs — GVSoC runs but no output.\n",
326
+
"**Cause**: The Im2Col ConvGradX kernel gets `ctxtBufferSize` from full-op dimensions (e.g., 1.2 MB) but the actual L1 allocation is only ~120 KB. The kernel's `co_block` auto-tuning overestimates → L1 overflow.\n",
327
+
"**Fix**: Use the naive ConvGradX kernel (`referenceConvGradX2DTemplate`) which doesn't require im2col. Change in `Bindings.py`.\n",
"**Symptom**: `exitcode: -9` (SIGKILL) — simulations kill each other.\n",
336
+
"**Fix**: Use `PYTEST_XDIST_WORKER=<unique_id>` to isolate build directories.\n",
337
+
"\n",
338
+
"### Pitfall 5: GVSoC stdout is fully buffered\n",
339
+
"**Symptom**: Simulation runs but no printf output visible.\n",
340
+
"**Fix**: Use `--trace=cluster/pe0/insn` to force output, or use `ring_tee.py` for bounded trace capture with heartbeat monitoring."
341
+
]
342
+
},
343
+
{
344
+
"cell_type": "markdown",
345
+
"metadata": {},
346
+
"source": "## 7. Debugging with GVSoC Traces\n\nWhen a simulation hangs or produces wrong results, use GVSoC's built-in tracing:\n\n### Trace FC (fabric controller) instructions\n```bash\ngvsoc --target=siracusa --binary=<bin> --work-dir=<dir> \\\n --trace=fc/insn image flash run 2>trace_fc.txt\n```\nShows every instruction the FC executes. Useful for finding where FC is stuck (e.g., `pi_task_wait_on` = waiting for cluster, `memcpy` = initializing data).\n\n### Trace cluster PE instructions\n```bash\ngvsoc --target=siracusa --binary=<bin> --work-dir=<dir> \\\n --trace=cluster/pe0/insn image flash run 2>trace_pe0.txt\n```\nShows PE0's instructions. Look for the function name in the trace to identify which kernel is running:\n```\n125461135406: 9037685: [/chip/cluster/pe0/insn] PULP_Conv2d_Im2Col_fp32_fp32_f:0 M 1c031d58 flw ...\n```\n\n### Trace memory accesses (LSU)\n```bash\n--trace=cluster/pe0/lsu\n```\nCatches invalid memory accesses:\n```\nInvalid access (pc: 0x1c01c94c, offset: 0x3c9cf7a9, size: 0x3, is_write: 0)\n```\nThis means a kernel tried to read address `0x3c9cf7a9` which is outside L1/L2 — indicates a buffer overflow or wrong DMA offset.\n\n### Useful trace targets\n\n| Trace flag | What it shows |\n|-----------|--------------|\n| `fc/insn` | FC instruction stream |\n| `cluster/pe0/insn` | Cluster PE0 instructions |\n| `cluster/pe0/lsu` | PE0 memory load/store events |\n| `cluster/dma` | DMA transfer events |\n\n### Tips\n- Redirect trace to a file (`2>trace.txt`) — trace output goes to stderr\n- Use `timeout 30 gvsoc ...` to limit trace duration\n- Look at the **last few lines** of the trace to find where it's stuck\n- Use `llvm-objdump -d <binary>` to map PC addresses to function names"
347
+
},
348
+
{
349
+
"cell_type": "markdown",
350
+
"metadata": {},
351
+
"source": "## 8. Exercises\n\n1. **Export and deploy SpeechNet inference** on Siracusa. Compare the ONNX node count with the training graph.\n\n2. **Try `last_layer` training strategy** — only fine-tune the FC layer. Compare cycle count with full training.\n\n3. **Increase training steps** — export with `--n-batches 16` (or `--n-steps 8 --n-accum 2`). Run on GVSoC and observe how loss evolves over more steps. Does it converge?\n\n4. **Debug a hang**: Intentionally use `torch.flatten(x, 1)` in the model, export training ONNX, and observe what extra ops appear. Then fix it."
0 commit comments