Skip to content

power collection#37

Open
runwangdl wants to merge 6 commits into
develfrom
power-collection
Open

power collection#37
runwangdl wants to merge 6 commits into
develfrom
power-collection

Conversation

@runwangdl

Copy link
Copy Markdown
Owner

Stores the flashable board build artifacts + GPIO power-measurement trigger for three GAP9 training experiments, for migration to the power-measurement platform (Nordic PPK2).

What

  • DeeployTest/PowerCollection/<Model>/build_master/ — full build dir incl. flashable ELF + board_workdir flash image
  • DeeployTest/PowerCollection/<Model>/hex/ — L3 weight payload (readfs) required to flash
  • DeeployTest/PowerCollection/flash_power.sh — gapy flash/run helper (uses test platform GAP_SDK_HOME; --power drives gapy PPK2 capture synced to the pin-89 GPIO window)
  • deeploytraintest.c — ports the inference power GPIO pulse (pin 89) into the training harness, held high across the full training loop (POWER_MEASUREMENT was inference-only before)

Configs (CI test_gap9_tiled_training_l3_singlebuffer, -s board, -DPOWER_MEASUREMENT=ON)

Model l1 cc_stack CHW board
SleepConViT 122000 4096 no 0/4, ~131.6M cyc
MCUNet 116000 8192 yes 0/4, ~717.6M cyc
TSDR 122000 4096 yes 0/4, ~904.9M cyc

Note: also carries the feat/zo-sleepconvit-training training commits (branch base).

runwangdl and others added 6 commits June 18, 2026 22:15
…onViT BP

Lets onnx4deeploy-generated FP32 training graphs that use Concat + Slice
(e.g. SleepConViT: ConvStem branch-concat + cls-token slice) run backprop on
Siracusa and tiled GAP9. Validated: SleepConViT BP 0/4 errors on both
(GAP9 tiled l1=122000/cc_stack=4096 -> train 134.6M cyc; Siracusa untiled ->
losses match PyTorch autograd to ~1e-5).

- Generic/Parsers.py: Slice default `steps` int64 (np.ones() was float64 ->
  poisoned int-only tiling math).
- PULPOpen/Bindings.py, GAP9/Bindings.py: float Concat bindings (template is
  byte-wise memcpy, dtype-agnostic; only integer bindings existed).
- PULPOpen/TileConstraints/SliceConstraint.py: int() the slice step (OR-tools
  IntExpr*float unsupported).
- TilingExtension/MemoryConstraintFlows.py: kill-set skips folded constant
  inputs (Slice starts/ends/axes/steps) instead of asserting.
- TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py: coerce numpy
  ints in hoisted value tables.
- GAP9/Tiler.py + GAP9/Platform.py: GAP9 Slice tiling-ready binding via the GAP9
  transformer/mchan (the PULP one emits unlinkable PULP mchan calls on GAP9).
- Tests/Models/Training/SleepConViT: onnx4deeploy training graph + optimizer +
  reference (4 SGD steps).
… cc_stack=4096, tol=5e-3)

The existing gap9-training-tiled-l3-singlebuffer CI job is parametrized over
L3_SINGLEBUFFER_TRAINING_MODELS, so registering SleepConViT here makes CI run
its backprop test. Validated locally: 0/4 errors (train ~134.6M cyc).
…adW Cout-full

Enable end-to-end backprop training for the TSDR spectrogram transformer on
GAP9 (tiled, L3 single-buffer). Runs Errors 0/4 vs ORT reference @ tol 5e-3
(train_cycles ~940M, opt ~12.3M).

ConvGradConstraint (CoutHWSliceStrategy): make coutHWSlice_force_cout_full
conditional on dW size. Forcing Cout-full pins the whole dW in L1, which is
infeasible for large-Cout regular convs (TSDR patch-embed dW ~123KB > L1). Now:
  - dW <= 64KB  -> keep forced Cout-full (validated Siracusa drift-fix path,
                   numerics unchanged; re-verified ResNet8 0/4 max-diff 1.1e-5,
                   MobileNetV1 0/4 max-diff 3.1e-5).
  - dW  > 64KB  -> no Cout/HW restriction; let the tiler tile Cout freely so the
                   conv is feasible (drift caught by the numerical reference).

Register TSDR in the GAP9 tiled L3 training CI (l1=122000, cc_stack=4096,
tol=5e-3, num_data_inputs=1, conv_channels_first).
Enable end-to-end backprop training for MCUNet (MnasNet-style) on GAP9, tiled
into L1=116KB (< the 128KB usable L1). Runs Errors 0/4 vs ORT reference @ tol
5e-3 (train_cycles ~714M, opt ~16M).

Root cause it fixes: PWConvGradWTileConstraint pinned the pointwise ConvGradW's
X and dY to FULL spatial (the mixed Cout+HW memset path is unimplemented, so it
fell back to Cout-only tiling). For MCUNet's 48x48 pointwise layers that means
e.g. conv2d_2 X=[16,48,48]=147KB alone exceeds L1 -> proven-infeasible tiling.

Fix: for small-dW PW convs, set coutHWSlice_force_cout_full and make the
full-spatial pin CONDITIONAL on dW>64KB. With Cout pinned full there is no Cout
tiling, so allowing H/W to tile is the SAFE "HW-only" memset-once case (mirrors
the #34 regular-ConvGradW fix) — X then tiles spatially and fits. Large-dW PW
(e.g. MobileNetV1 block_11, dW~128KB) keeps the original full-spatial path.
Tiling L1 floor for MCUNet drops ~160KB -> ~80KB. Regression: MobileNetV1 GAP9
re-verified 0/4 (274M cyc, unchanged).

MCUNet uses bias-free convs (drop conv biases) to avoid the 1-D tiled-bias
codegen path, same approach as TSDR. Registered in the GAP9 tiled L3 training CI
(l1=116000, cc_stack=8192, conv_channels_first).
…Net/TSDR training

Adds a GPIO power-measurement trigger to the GAP9 training harness and stores the
flashable build artifacts for three training experiments, built with `-s board` +
`-DPOWER_MEASUREMENT=ON` at their CI configs (test_gap9_tiled_training_l3_singlebuffer).

Per experiment (DeeployTest/PowerCollection/<Model>/):
  build_master/  full build dir incl. the flashable ELF + board_workdir flash image
  hex/           L3 weight payload written to external flash (readfs) — required to flash

flash_power.sh: gapy flash/run helper for the power-measurement platform. Flashes a
migrated build_master via the test platform's own GAP_SDK_HOME (SSBL/openocd come from
there, not the build machine — keep SDK versions aligned). `--power` runs gapy's PPK2
capture, synchronized to the pin-89 GPIO window.

Harness (deeploytraintest.c): port the inference power-measurement GPIO pulse (pin 89)
into the training harness — held high across the full training loop (fwd/bwd + optimizer)
for external power capture (PPK2). Previously POWER_MEASUREMENT was inference-only.

All validated 0/4 errors on board:
- SleepConViT  l1=122000 cc_stack=4096            train ~131.6M cyc
- MCUNet       l1=116000 cc_stack=8192 CHW        train ~717.6M cyc
- TSDR         l1=122000 cc_stack=4096 CHW        train ~904.9M cyc

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant