Skip to content

Commit 24bae6b

Browse files
runwangdlclaude
andcommitted
docs: add SpeechNet on-device training tutorial notebook
Step-by-step tutorial covering PyTorch model design, Onnx4Deeploy export, untiled/tiled Deeploy deployment, tiling pipeline overview, common pitfalls, and GVSoC trace debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 95fef65 commit 24bae6b

1 file changed

Lines changed: 380 additions & 0 deletions

File tree

Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# SpeechNet On-Device Training Tutorial\n",
8+
"\n",
9+
"This tutorial walks through the complete pipeline for deploying **SpeechNet** (a lightweight CNN for EMG-based silent speech recognition) on the **Siracusa RISC-V MCU** using **Deeploy**.\n",
10+
"\n",
11+
"You will learn:\n",
12+
"1. How to define a Deeploy-friendly PyTorch model\n",
13+
"2. How to export inference and training ONNX graphs using Onnx4Deeploy\n",
14+
"3. How to run untiled and tiled Deeploy deployment on Siracusa (GVSoC)\n",
15+
"4. Key design decisions and pitfalls\n",
16+
"\n",
17+
"**Prerequisites**: Familiarity with PyTorch, ONNX, and basic knowledge of RISC-V MCU architectures.\n",
18+
"\n",
19+
"**Reference**: Spacone et al., \"SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition\", arXiv: 2603.02847."
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"metadata": {},
25+
"source": [
26+
"## 1. Model Architecture\n",
27+
"\n",
28+
"SpeechNet is a 5-block CNN processing 14-channel EMG signals:\n",
29+
"\n",
30+
"| Block | Conv kernel | In→Out channels | Output shape |\n",
31+
"|-------|------------|-----------------|-------------|\n",
32+
"| 0 | (1, 4) | 1 → 8 | (8, 14, 87) after AvgPool(1,8) |\n",
33+
"| 1 | (1, 16) | 8 → 16 | (16, 14, 22) after AvgPool(1,4) |\n",
34+
"| 2 | (1, 8) | 16 → 16 | (16, 14, 5) after AvgPool(1,4) |\n",
35+
"| 3 | (7, 1) | 16 → 32 | (32, 8, 5) after AvgPool(1,1) |\n",
36+
"| 4 | (7, 1) | 32 → 32 | (32, 2, 5) after AvgPool(1,1) |\n",
37+
"\n",
38+
"Followed by GlobalAvgPool → Reshape → Linear(32, 9).\n",
39+
"\n",
40+
"Total: ~15K parameters, 9 output classes (8 speech commands + rest)."
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"## 2. Defining a Deeploy-Friendly PyTorch Model\n",
48+
"\n",
49+
"When designing a model for Deeploy deployment, follow these rules:\n",
50+
"\n",
51+
"### Rule 1: No dynamic ONNX ops\n",
52+
"Avoid `torch.flatten()`, `x.size()`, `x.shape[N]` in the forward pass. These generate dynamic `Shape`/`Gather`/`Flatten` ops in ONNX that Deeploy cannot handle.\n",
53+
"\n",
54+
"**Bad:**\n",
55+
"```python\n",
56+
"x = torch.flatten(x, 1) # generates Flatten + Shape in backward\n",
57+
"```\n",
58+
"\n",
59+
"**Good:**\n",
60+
"```python\n",
61+
"x = x.reshape(1, self._fc_in) # static reshape, batch=1 for deployment\n",
62+
"```\n",
63+
"\n",
64+
"### Rule 2: Use AvgPool instead of MaxPool\n",
65+
"MaxPool gradient requires index storage. AvgPool gradient is a simple scatter-divide.\n",
66+
"\n",
67+
"### Rule 3: No Dropout\n",
68+
"Dropout is a no-op at inference and unnecessary for on-device fine-tuning."
69+
]
70+
},
71+
{
72+
"cell_type": "code",
73+
"execution_count": null,
74+
"metadata": {},
75+
"outputs": [],
76+
"source": [
77+
"import torch\n",
78+
"import torch.nn as nn\n",
79+
"from typing import Any, Dict, List, Optional\n",
80+
"\n",
81+
"\n",
82+
"class SpeechNetDeploy(nn.Module):\n",
83+
" \"\"\"Deployment-ready SpeechNet for Deeploy on PULP MCUs.\"\"\"\n",
84+
"\n",
85+
" def __init__(\n",
86+
" self,\n",
87+
" num_channels: int = 14,\n",
88+
" time_steps: int = 700,\n",
89+
" num_classes: int = 9,\n",
90+
" blocks_config: Optional[List[Dict[str, Any]]] = None,\n",
91+
" ):\n",
92+
" super().__init__()\n",
93+
" if blocks_config is None:\n",
94+
" blocks_config = [\n",
95+
" dict(out_channels=8, kernel=(1, 4), pool=(1, 8)),\n",
96+
" dict(out_channels=16, kernel=(1, 16), pool=(1, 4)),\n",
97+
" dict(out_channels=16, kernel=(1, 8), pool=(1, 4)),\n",
98+
" dict(out_channels=32, kernel=(7, 1), pool=(1, 1)),\n",
99+
" dict(out_channels=32, kernel=(7, 1), pool=(1, 1)),\n",
100+
" ]\n",
101+
"\n",
102+
" self.blocks = nn.ModuleList()\n",
103+
" in_ch = 1\n",
104+
" for cfg in blocks_config:\n",
105+
" out_ch = cfg[\"out_channels\"]\n",
106+
" k_c, k_t = cfg[\"kernel\"]\n",
107+
" pool_c, pool_t = cfg.get(\"pool\", (1, 1))\n",
108+
" layers = [\n",
109+
" nn.Conv2d(in_ch, out_ch, kernel_size=(k_c, k_t),\n",
110+
" padding=(0, k_t // 2), bias=True),\n",
111+
" nn.BatchNorm2d(out_ch),\n",
112+
" nn.ReLU(inplace=False), # inplace=False for clean ONNX\n",
113+
" nn.AvgPool2d(kernel_size=(pool_c, pool_t),\n",
114+
" stride=(pool_c, pool_t)),\n",
115+
" ]\n",
116+
" self.blocks.append(nn.Sequential(*layers))\n",
117+
" in_ch = out_ch\n",
118+
"\n",
119+
" self.global_pool = nn.AdaptiveAvgPool2d((1, 1))\n",
120+
" self._fc_in = in_ch # stored as Python int for static reshape\n",
121+
" self.fc = nn.Linear(in_ch, num_classes)\n",
122+
"\n",
123+
" def forward(self, x: torch.Tensor) -> torch.Tensor:\n",
124+
" for block in self.blocks:\n",
125+
" x = block(x)\n",
126+
" x = self.global_pool(x)\n",
127+
" # Static reshape: avoids dynamic Shape/Flatten ops in ONNX\n",
128+
" x = x.reshape(1, self._fc_in)\n",
129+
" x = self.fc(x)\n",
130+
" return x\n",
131+
"\n",
132+
"\n",
133+
"model = SpeechNetDeploy()\n",
134+
"x = torch.randn(1, 1, 14, 700)\n",
135+
"y = model(x)\n",
136+
"print(f\"Input: {x.shape} → Output: {y.shape}\")\n",
137+
"print(f\"Parameters: {sum(p.numel() for p in model.parameters()):,}\")"
138+
]
139+
},
140+
{
141+
"cell_type": "markdown",
142+
"metadata": {},
143+
"source": [
144+
"## 3. Exporting ONNX with Onnx4Deeploy\n",
145+
"\n",
146+
"Onnx4Deeploy provides a unified CLI for exporting models to ONNX format compatible with Deeploy.\n",
147+
"\n",
148+
"### 3.1 Inference Export\n",
149+
"\n",
150+
"```bash\n",
151+
"cd /path/to/Onnx4Deeploy\n",
152+
"python Onnx4Deeploy.py -model SpeechNet -mode infer\n",
153+
"```\n",
154+
"\n",
155+
"This produces:\n",
156+
"- `onnx/model/speechnet_infer/network.onnx` — inference graph (BN folded into Conv)\n",
157+
"- `onnx/model/speechnet_infer/inputs.npz` — test input\n",
158+
"- `onnx/model/speechnet_infer/outputs.npz` — reference output\n",
159+
"\n",
160+
"### 3.2 Training Export\n",
161+
"\n",
162+
"```bash\n",
163+
"python Onnx4Deeploy.py -model SpeechNet -mode train\n",
164+
"```\n",
165+
"\n",
166+
"This produces:\n",
167+
"- `onnx/model/speechnet_train/network.onnx` — training graph (forward + backward + gradient accumulation)\n",
168+
"- `onnx/model/speechnet_train/inputs.npz` — multi-batch training data\n",
169+
"- `onnx/model/speechnet_train/outputs.npz` — reference updated weights + losses\n",
170+
"- `onnx/model/speechnet_optimizer/network.onnx` — SGD optimizer graph"
171+
]
172+
},
173+
{
174+
"cell_type": "code",
175+
"execution_count": null,
176+
"metadata": {},
177+
"outputs": [],
178+
"source": [
179+
"# Verify the training ONNX graph structure\n",
180+
"import onnx\n",
181+
"from collections import Counter\n",
182+
"\n",
183+
"m = onnx.load(\"onnx/model/speechnet_train/network.onnx\")\n",
184+
"c = Counter(n.op_type for n in m.graph.node)\n",
185+
"print(f\"Total nodes: {len(m.graph.node)}\")\n",
186+
"print(f\"Forward ops: Conv={c['Conv']}, BN={c['BatchNormInternal']}, Relu={c['Relu']}, AvgPool={c['AveragePool']}\")\n",
187+
"print(f\"Backward ops: ConvGrad={c['ConvGrad']}, BNGrad={c['BatchNormalizationGrad']}, ReluGrad={c['ReluGrad']}\")\n",
188+
"print(f\"Training ops: InPlaceAccumulatorV2={c['InPlaceAccumulatorV2']}, SoftmaxCELoss={c['SoftmaxCrossEntropyLoss']}\")\n",
189+
"\n",
190+
"# Check for dynamic ops (should be 0)\n",
191+
"dynamic_ops = ['Shape', 'Flatten', 'Expand', 'Gather']\n",
192+
"bad = {op: c[op] for op in dynamic_ops if c.get(op, 0) > 0}\n",
193+
"assert not bad, f\"Dynamic ops found: {bad}\"\n",
194+
"print(\"\\n✅ Clean graph — no dynamic ops\")"
195+
]
196+
},
197+
{
198+
"cell_type": "markdown",
199+
"metadata": {},
200+
"source": [
201+
"### 3.3 Training Strategies\n",
202+
"\n",
203+
"You can control which layers are trainable:\n",
204+
"\n",
205+
"```bash\n",
206+
"# Full training (all layers)\n",
207+
"python Onnx4Deeploy.py -model SpeechNet -mode train\n",
208+
"\n",
209+
"# Last-layer only (transfer learning)\n",
210+
"python Onnx4Deeploy.py -model SpeechNet -mode train --training-strategy last_layer\n",
211+
"```\n",
212+
"\n",
213+
"The training strategy controls the backward graph size:\n",
214+
"\n",
215+
"| Strategy | Trainable params | Backward ops | Use case |\n",
216+
"|----------|-----------------|-------------|----------|\n",
217+
"| `full` | 22 | ConvGrad×5, BNGrad×5, ReluGrad×5, AvgPoolGrad×5 | Full fine-tuning |\n",
218+
"| `last_layer` | 2 (fc only) | Gemm backward only | Quick adaptation |\n",
219+
"| `custom` | User-defined | Depends on selection | Selective fine-tuning |"
220+
]
221+
},
222+
{
223+
"cell_type": "markdown",
224+
"metadata": {},
225+
"source": [
226+
"## 4. Deploying with Deeploy on Siracusa\n",
227+
"\n",
228+
"### 4.1 Environment Setup\n",
229+
"\n",
230+
"```bash\n",
231+
"# Activate the TrainDeeploy environment\n",
232+
"source /path/to/TrainDeeploy/activate_traindeeploy.sh\n",
233+
"cd TrainDeeploy/DeeployTest\n",
234+
"```\n",
235+
"\n",
236+
"### 4.2 Untiled Deployment (Smoke Test)\n",
237+
"\n",
238+
"Run the untiled version first to verify numerical correctness:\n",
239+
"\n",
240+
"```bash\n",
241+
"python deeployTrainingRunner_siracusa.py \\\n",
242+
" -t /path/to/Onnx4Deeploy/onnx/model/speechnet_train\n",
243+
"```\n",
244+
"\n",
245+
"Expected output:\n",
246+
"```\n",
247+
"=== Siracusa Training Harness (Phase 2 — with OptimizerNetwork) ===\n",
248+
"N_TRAIN_STEPS=4 N_ACCUM_STEPS=1 DATA_INPUTS=2\n",
249+
"Initializing TrainingNetwork...\n",
250+
"Initializing OptimizerNetwork...\n",
251+
"Starting training (4 optimizer steps x 1 accum steps)...\n",
252+
"update 1/4 accum 1/1 (mini-batch 0)\n",
253+
"...\n",
254+
"[loss 0] computed=2.267950 ref=2.267950 diff=0.000000 TOL=0.001000\n",
255+
"[loss 1] computed=2.498553 ref=2.498553 diff=0.000000 TOL=0.001000\n",
256+
"[loss 2] computed=2.083153 ref=2.083153 diff=0.000000 TOL=0.001000\n",
257+
"[loss 3] computed=1.905963 ref=1.905963 diff=0.000000 TOL=0.001000\n",
258+
"Errors: 0 out of 4\n",
259+
"BENCH train_cycles=285250543 opt_cycles=429083 weight_sram=61956\n",
260+
"\n",
261+
"✓ Test speechnet_train PASSED - No errors found\n",
262+
"```\n",
263+
"\n",
264+
"### 4.3 Tiled Deployment\n",
265+
"\n",
266+
"For real MCU deployment, use tiling to fit within L1 memory:\n",
267+
"\n",
268+
"```bash\n",
269+
"python deeployTrainingRunner_tiled_siracusa.py \\\n",
270+
" -t /path/to/Onnx4Deeploy/onnx/model/speechnet_train \\\n",
271+
" --l1 128000 --l2 2000000\n",
272+
"```\n",
273+
"\n",
274+
"The tiler automatically splits large activations into tiles that fit in L1 (128 KB)."
275+
]
276+
},
277+
{
278+
"cell_type": "markdown",
279+
"metadata": {},
280+
"source": [
281+
"## 5. Understanding the Tiling Pipeline\n",
282+
"\n",
283+
"Deeploy's tiling pipeline works as follows:\n",
284+
"\n",
285+
"```\n",
286+
"ONNX graph\n",
287+
"\n",
288+
"FrontEnd: graph lowering, node renaming, constant folding\n",
289+
"\n",
290+
"Parse: match each node to a NodeMapper (Parser + Bindings)\n",
291+
"\n",
292+
"Broadcast: compute/update tensor shapes\n",
293+
"\n",
294+
"TypeCheck: select the best NodeBinding (Template + TypeChecker)\n",
295+
"\n",
296+
"Bind: hoist transient buffers (e.g., im2col), set up execution blocks\n",
297+
"\n",
298+
"Tile: OR-Tools solver finds tile dimensions under L1/L2 constraints\n",
299+
"\n",
300+
"CodeGen: render C code with per-tile DMA + kernel calls\n",
301+
"\n",
302+
"Build: compile with LLVM for RISC-V\n",
303+
"\n",
304+
"Simulate: run on GVSoC cycle-accurate simulator\n",
305+
"```\n",
306+
"\n",
307+
"### Key concepts:\n",
308+
"\n",
309+
"- **TileConstraint**: Defines how each op can be tiled (which dims are free, which are pinned)\n",
310+
"- **Transient buffers**: Scratch memory needed by kernels (e.g., im2col buffer for Conv)\n",
311+
"- **Memory hierarchy**: L1 (128 KB SRAM, fast) → L2 (2 MB SRAM) → L3 (HyperFlash, slow)"
312+
]
313+
},
314+
{
315+
"cell_type": "markdown",
316+
"metadata": {},
317+
"source": [
318+
"## 6. Common Pitfalls and Solutions\n",
319+
"\n",
320+
"### Pitfall 1: `torch.flatten` generates dynamic Shape ops\n",
321+
"**Symptom**: Training graph has `Shape` + `Reshape` nodes from Flatten backward.\n",
322+
"**Fix**: Use `x.reshape(1, C)` with static dimensions.\n",
323+
"\n",
324+
"### Pitfall 2: ConvGradX Im2Col buffer exceeds L1\n",
325+
"**Symptom**: Tiled training hangs — GVSoC runs but no output.\n",
326+
"**Cause**: The Im2Col ConvGradX kernel gets `ctxtBufferSize` from full-op dimensions (e.g., 1.2 MB) but the actual L1 allocation is only ~120 KB. The kernel's `co_block` auto-tuning overestimates → L1 overflow.\n",
327+
"**Fix**: Use the naive ConvGradX kernel (`referenceConvGradX2DTemplate`) which doesn't require im2col. Change in `Bindings.py`.\n",
328+
"\n",
329+
"### Pitfall 3: ConvLayer.computeShapes corrupts bias shape\n",
330+
"**Symptom**: `TypeError: 'int' object is not iterable` during graph export.\n",
331+
"**Cause**: `inputShapes[2] = inputShapes[1][0]` sets bias shape to a scalar int instead of tuple.\n",
332+
"**Fix**: `inputShapes[2] = (inputShapes[1][0],)` in `Layers.py`.\n",
333+
"\n",
334+
"### Pitfall 4: Multiple GVSoC simulations sharing workdir\n",
335+
"**Symptom**: `exitcode: -9` (SIGKILL) — simulations kill each other.\n",
336+
"**Fix**: Use `PYTEST_XDIST_WORKER=<unique_id>` to isolate build directories.\n",
337+
"\n",
338+
"### Pitfall 5: GVSoC stdout is fully buffered\n",
339+
"**Symptom**: Simulation runs but no printf output visible.\n",
340+
"**Fix**: Use `--trace=cluster/pe0/insn` to force output, or use `ring_tee.py` for bounded trace capture with heartbeat monitoring."
341+
]
342+
},
343+
{
344+
"cell_type": "markdown",
345+
"metadata": {},
346+
"source": "## 7. Debugging with GVSoC Traces\n\nWhen a simulation hangs or produces wrong results, use GVSoC's built-in tracing:\n\n### Trace FC (fabric controller) instructions\n```bash\ngvsoc --target=siracusa --binary=<bin> --work-dir=<dir> \\\n --trace=fc/insn image flash run 2>trace_fc.txt\n```\nShows every instruction the FC executes. Useful for finding where FC is stuck (e.g., `pi_task_wait_on` = waiting for cluster, `memcpy` = initializing data).\n\n### Trace cluster PE instructions\n```bash\ngvsoc --target=siracusa --binary=<bin> --work-dir=<dir> \\\n --trace=cluster/pe0/insn image flash run 2>trace_pe0.txt\n```\nShows PE0's instructions. Look for the function name in the trace to identify which kernel is running:\n```\n125461135406: 9037685: [/chip/cluster/pe0/insn] PULP_Conv2d_Im2Col_fp32_fp32_f:0 M 1c031d58 flw ...\n```\n\n### Trace memory accesses (LSU)\n```bash\n--trace=cluster/pe0/lsu\n```\nCatches invalid memory accesses:\n```\nInvalid access (pc: 0x1c01c94c, offset: 0x3c9cf7a9, size: 0x3, is_write: 0)\n```\nThis means a kernel tried to read address `0x3c9cf7a9` which is outside L1/L2 — indicates a buffer overflow or wrong DMA offset.\n\n### Useful trace targets\n\n| Trace flag | What it shows |\n|-----------|--------------|\n| `fc/insn` | FC instruction stream |\n| `cluster/pe0/insn` | Cluster PE0 instructions |\n| `cluster/pe0/lsu` | PE0 memory load/store events |\n| `cluster/dma` | DMA transfer events |\n\n### Tips\n- Redirect trace to a file (`2>trace.txt`) — trace output goes to stderr\n- Use `timeout 30 gvsoc ...` to limit trace duration\n- Look at the **last few lines** of the trace to find where it's stuck\n- Use `llvm-objdump -d <binary>` to map PC addresses to function names"
347+
},
348+
{
349+
"cell_type": "markdown",
350+
"metadata": {},
351+
"source": "## 8. Exercises\n\n1. **Export and deploy SpeechNet inference** on Siracusa. Compare the ONNX node count with the training graph.\n\n2. **Try `last_layer` training strategy** — only fine-tune the FC layer. Compare cycle count with full training.\n\n3. **Increase training steps** — export with `--n-batches 16` (or `--n-steps 8 --n-accum 2`). Run on GVSoC and observe how loss evolves over more steps. Does it converge?\n\n4. **Debug a hang**: Intentionally use `torch.flatten(x, 1)` in the model, export training ONNX, and observe what extra ops appear. Then fix it."
352+
},
353+
{
354+
"cell_type": "markdown",
355+
"metadata": {},
356+
"source": [
357+
"## 9. Reference\n",
358+
"\n",
359+
"- [SilentWear paper](https://arxiv.org/abs/2603.02847)\n",
360+
"- [Onnx4Deeploy repo](https://github.com/runwangdl/Onnx4Deeploy) — PR #2: SpeechNet exporter\n",
361+
"- [TrainDeeploy repo](https://github.com/runwangdl/TrainDeeploy) — PR #31: SpeechNet training test\n",
362+
"- [Deeploy TileConstraint docs](../AI_AGENT/Deeploy_Basics/Deeploy_TileConstraint.md)\n",
363+
"- [Deeploy Kernel docs](../AI_AGENT/Deeploy_Basics/Deeploy_Kernel.md)"
364+
]
365+
}
366+
],
367+
"metadata": {
368+
"kernelspec": {
369+
"display_name": "Python 3",
370+
"language": "python",
371+
"name": "python3"
372+
},
373+
"language_info": {
374+
"name": "python",
375+
"version": "3.10.0"
376+
}
377+
},
378+
"nbformat": 4,
379+
"nbformat_minor": 4
380+
}

0 commit comments

Comments
 (0)