power collection#37
Open
runwangdl wants to merge 6 commits into
Open
Conversation
…onViT BP Lets onnx4deeploy-generated FP32 training graphs that use Concat + Slice (e.g. SleepConViT: ConvStem branch-concat + cls-token slice) run backprop on Siracusa and tiled GAP9. Validated: SleepConViT BP 0/4 errors on both (GAP9 tiled l1=122000/cc_stack=4096 -> train 134.6M cyc; Siracusa untiled -> losses match PyTorch autograd to ~1e-5). - Generic/Parsers.py: Slice default `steps` int64 (np.ones() was float64 -> poisoned int-only tiling math). - PULPOpen/Bindings.py, GAP9/Bindings.py: float Concat bindings (template is byte-wise memcpy, dtype-agnostic; only integer bindings existed). - PULPOpen/TileConstraints/SliceConstraint.py: int() the slice step (OR-tools IntExpr*float unsupported). - TilingExtension/MemoryConstraintFlows.py: kill-set skips folded constant inputs (Slice starts/ends/axes/steps) instead of asserting. - TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py: coerce numpy ints in hoisted value tables. - GAP9/Tiler.py + GAP9/Platform.py: GAP9 Slice tiling-ready binding via the GAP9 transformer/mchan (the PULP one emits unlinkable PULP mchan calls on GAP9). - Tests/Models/Training/SleepConViT: onnx4deeploy training graph + optimizer + reference (4 SGD steps).
… cc_stack=4096, tol=5e-3) The existing gap9-training-tiled-l3-singlebuffer CI job is parametrized over L3_SINGLEBUFFER_TRAINING_MODELS, so registering SleepConViT here makes CI run its backprop test. Validated locally: 0/4 errors (train ~134.6M cyc).
…adW Cout-full
Enable end-to-end backprop training for the TSDR spectrogram transformer on
GAP9 (tiled, L3 single-buffer). Runs Errors 0/4 vs ORT reference @ tol 5e-3
(train_cycles ~940M, opt ~12.3M).
ConvGradConstraint (CoutHWSliceStrategy): make coutHWSlice_force_cout_full
conditional on dW size. Forcing Cout-full pins the whole dW in L1, which is
infeasible for large-Cout regular convs (TSDR patch-embed dW ~123KB > L1). Now:
- dW <= 64KB -> keep forced Cout-full (validated Siracusa drift-fix path,
numerics unchanged; re-verified ResNet8 0/4 max-diff 1.1e-5,
MobileNetV1 0/4 max-diff 3.1e-5).
- dW > 64KB -> no Cout/HW restriction; let the tiler tile Cout freely so the
conv is feasible (drift caught by the numerical reference).
Register TSDR in the GAP9 tiled L3 training CI (l1=122000, cc_stack=4096,
tol=5e-3, num_data_inputs=1, conv_channels_first).
Enable end-to-end backprop training for MCUNet (MnasNet-style) on GAP9, tiled into L1=116KB (< the 128KB usable L1). Runs Errors 0/4 vs ORT reference @ tol 5e-3 (train_cycles ~714M, opt ~16M). Root cause it fixes: PWConvGradWTileConstraint pinned the pointwise ConvGradW's X and dY to FULL spatial (the mixed Cout+HW memset path is unimplemented, so it fell back to Cout-only tiling). For MCUNet's 48x48 pointwise layers that means e.g. conv2d_2 X=[16,48,48]=147KB alone exceeds L1 -> proven-infeasible tiling. Fix: for small-dW PW convs, set coutHWSlice_force_cout_full and make the full-spatial pin CONDITIONAL on dW>64KB. With Cout pinned full there is no Cout tiling, so allowing H/W to tile is the SAFE "HW-only" memset-once case (mirrors the #34 regular-ConvGradW fix) — X then tiles spatially and fits. Large-dW PW (e.g. MobileNetV1 block_11, dW~128KB) keeps the original full-spatial path. Tiling L1 floor for MCUNet drops ~160KB -> ~80KB. Regression: MobileNetV1 GAP9 re-verified 0/4 (274M cyc, unchanged). MCUNet uses bias-free convs (drop conv biases) to avoid the 1-D tiled-bias codegen path, same approach as TSDR. Registered in the GAP9 tiled L3 training CI (l1=116000, cc_stack=8192, conv_channels_first).
…Net/TSDR training Adds a GPIO power-measurement trigger to the GAP9 training harness and stores the flashable build artifacts for three training experiments, built with `-s board` + `-DPOWER_MEASUREMENT=ON` at their CI configs (test_gap9_tiled_training_l3_singlebuffer). Per experiment (DeeployTest/PowerCollection/<Model>/): build_master/ full build dir incl. the flashable ELF + board_workdir flash image hex/ L3 weight payload written to external flash (readfs) — required to flash flash_power.sh: gapy flash/run helper for the power-measurement platform. Flashes a migrated build_master via the test platform's own GAP_SDK_HOME (SSBL/openocd come from there, not the build machine — keep SDK versions aligned). `--power` runs gapy's PPK2 capture, synchronized to the pin-89 GPIO window. Harness (deeploytraintest.c): port the inference power-measurement GPIO pulse (pin 89) into the training harness — held high across the full training loop (fwd/bwd + optimizer) for external power capture (PPK2). Previously POWER_MEASUREMENT was inference-only. All validated 0/4 errors on board: - SleepConViT l1=122000 cc_stack=4096 train ~131.6M cyc - MCUNet l1=116000 cc_stack=8192 CHW train ~717.6M cyc - TSDR l1=122000 cc_stack=4096 CHW train ~904.9M cyc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stores the flashable board build artifacts + GPIO power-measurement trigger for three GAP9 training experiments, for migration to the power-measurement platform (Nordic PPK2).
What
DeeployTest/PowerCollection/<Model>/build_master/— full build dir incl. flashable ELF +board_workdirflash imageDeeployTest/PowerCollection/<Model>/hex/— L3 weight payload (readfs) required to flashDeeployTest/PowerCollection/flash_power.sh— gapy flash/run helper (uses test platform GAP_SDK_HOME;--powerdrives gapy PPK2 capture synced to the pin-89 GPIO window)deeploytraintest.c— ports the inference power GPIO pulse (pin 89) into the training harness, held high across the full training loop (POWER_MEASUREMENT was inference-only before)Configs (CI test_gap9_tiled_training_l3_singlebuffer, -s board, -DPOWER_MEASUREMENT=ON)
Note: also carries the feat/zo-sleepconvit-training training commits (branch base).