Commit 022f045
authored
perf(ConvGrad): GradWStrategy framework + ConvGradX/W speedups (#28)
* perf(ConvGradX): im2col+GEMM for stride=1 + loop reorder for stride>1
Replace the naive 7-deep direct loop in PULP_ConvGradX2d_fp32_fp32_fp32_CHW_Im2Col_tiled
with two optimized paths:
stride=1: custom im2col + W transpose + GEMM with Cout blocking
- Build dY_col matrix by gathering dY patches (handles padding via boundary checks)
- Transpose W[Cout,Cin,P,Q] -> W_flat[Cin, Cout*P*Q] for row-major GEMM
- GEMM with M-dimension unroll-by-4 for cache-line-friendly B access
- Internal Cout blocking when im2col buffer < full Cout*P*Q*Hin*Win
- 8-core parallel over Cin (output rows), with barriers between steps
stride>1: loop reorder (ci outermost, co innermost)
- Moves co to the innermost loop so W[co,...] and dY[co,...] are accessed
with unit stride, improving cache behavior vs the original co-outermost order
ResNet8 training results (4 steps, Siracusa 8-core, L3 mode):
Original: 1,301,512,576 cycles
Optimized: 844,186,047 cycles (1.54x speedup, -35%)
ConvGradX step-0 breakdown:
Original: 185.6 Mc (56.8% of step)
Optimized: 73.3 Mc (34.2% of step) → 2.53x on ConvGradX alone
Numerical accuracy verified: all 4 training steps within tolerance.
* perf(ConvGradW): tile Cin instead of Cout to reduce tile count
Change ConvGradW tiling policy from Cout+spatial tiling to Cin tiling:
- Old: Cout split (47+17) × Hout×Wout (8×8) = 128 tiles, GEMM K=1
- New: Cin split (22+22+20) = 3 tiles, GEMM K=Hout*Wout=64
For ResNet8 layer3_conv2 ConvGradW:
- 128 tiles → 3 tiles (71% DMA overhead eliminated)
- GEMM: [Cout, Cin_tile*P*Q] × [Cin_tile*P*Q, Hout*Wout]
with K=64 (proper reduction) instead of K=1 (outer product)
ResNet8 training (4 steps, Siracusa 8-core, L3 mode):
Before (ConvGradX opt only): 844,186,047 cycles
After (+ ConvGradW Cin tile): 546,778,941 cycles (1.54x additional)
Total vs original baseline: 2.38x speedup
* refactor(ConvGradW): introduce GradWStrategy framework + HWSlice strategy
Extract ConvGradW tiling policy/serialize into pluggable Strategy classes
so each (layer-shape regime, hardware budget) pair can pick its own tiling.
Strategy interface:
- applies(): when this strategy is feasible/preferred for a given layer
- add_constraints(): policy constraints emitted to the OR-tools tiler
- matches_solution(): recognize the strategy's signature in a tiler result
- serialize(): emit the per-tile codegen schedule
Strategies in this commit:
- CinSliceStrategy: lifts Step-2 behavior verbatim (Cout/dY full, Cin tile).
Applies when dy_bytes <= 32KB; best for small-spatial / big-channel
layers (ResNet8 deep convs).
- HWSliceStrategy: new. dY tiled over H/W, dW full as L1 accumulation
target, X derived as halo. Required for layers whose dY exceeds L1
(e.g. MobileNetV1 stem: dY = 16x96x96 = 576KB > 128KB L1).
ConvGradWTileConstraintBase.addPolicyConstraint /
serializeTilingSolution now dispatch through `cls.strategies` (an ordered
priority list). Subclasses configure their strategy set:
- ConvGradW2DTileConstraint: [CinSlice, HWSlice] (new: HW fallback)
- PWConvGradWTileConstraint: [CinSlice] (inherits default)
- DWConvGradW2DTileConstraint: [CinSlice] (inherits default)
PW / DW keep their existing addPolicyConstraint / addGeometricalConstraint
overrides (refactor of those is Commit B).
Verification: ResNet8 must reproduce 546M training cycles exactly
(CinSlice is a verbatim lift of base behavior at ee1288d).
* refactor(ConvGradW): rename HWSlice -> CoutHWSlice, free Cout/HW like devel default
The previous HWSliceStrategy pinned dy_dim_1 (Cout) to full, which forced
forward Conv's output Y to keep full Cout at every memory level. For
MobileNetV1 stem (dY = 16x96x96 = 576KB), this made the joint tiling
problem infeasible: forward Conv couldn't fit Y at L1, backward couldn't
agree on Y's tiling.
Replace with CoutHWSliceStrategy that matches devel's pre-Cin-slice
default policy:
- X Cin (dim 1) full
- dW Cin / kH / kW full, Cout (dim 0) free
- dY HW free, dY Cout free
The OR-tools tiler picks the optimal Cout / HW split per layer. dW slices
are along Cout (disjoint per Cout slab); each (ho, wo) tile inside a Cout
slab accumulates partial sums via mm_add. Serialize iterates Cout slabs
outer, HW tiles inner (port of devel's serializeTilingSolution).
Strategy selection for ConvGradW2DTileConstraint:
- dy_bytes <= 32KB -> CinSlice (big GEMM K = Hout*Wout, ResNet8 deep)
- dy_bytes > 32KB -> CoutHWSlice (MobileNetV1 stem etc., tiler picks)
PW / DW classes still inherit [CinSlice] default; their fallback handling
is Commit B.
* refactor(ConvGradW): default strategies include CoutHWSlice fallback
PW and DW classes' addPolicyConstraint call super(), which goes through
the dispatch in the refactored base. Default strategies were just
[CinSlice] — for big-dY PW/DW layers (e.g. MobileNetV1 shallow PW),
CinSlice doesn't apply and the dispatcher fell back to it anyway,
yielding infeasible constraints (dY spatial must be full).
Change base default to [CinSlice, CoutHWSlice] so PW/DW + ConvGradW2D
all share the same priority list:
- dy_bytes <= 32KB -> CinSlice (big GEMM K)
- otherwise -> CoutHWSlice (free Cout / HW)
PW's extra "dy spatial full" constraint still applies on top (its
docstring rationale: forbid HW tiling for memset correctness). The
combined effect for big-dY PW: only Cout tiling allowed, matching
devel's behavior.
MobileNetV1 now passes the tiling solver and reaches simulation; an
out-of-bound L1 bank access in sim remains to debug.
* fix(ConvGradW): restrict CinSlice to regular Conv; PW/DW use CoutHWSlice only
CinSliceStrategy was being dispatched to PW (small-dy deep PW layers) and DW
classes via base default strategies, causing L1 bank out-of-bound accesses
during MobileNetV1 sim — 1000+ OOB warnings starting at training step 1.
Root cause (verified by isolation test):
- CinSlice.serialize iterates dW Cin slices from absoluteOutputCubes[i].
rectangle.dims[1] assuming the standard [Cout, Cin/group, P, Q] layout.
- For DW: dW has layout [C, 1, P, Q]; dim_1 == 1 makes Cin slicing
degenerate, and X tile derivation breaks DW's channel semantics.
- For PW: dW is [Cout, Cin, 1, 1]; CinSlice can split Cin but PW's
extra dy_spatial/x_spatial=full constraints combined with CinSlice's
constraint set produce a tile configuration whose dW pointer reads
past the L1 bank boundary in the per-tile mm kernel.
Fix: explicit ``strategies = [CoutHWSliceStrategy]`` on PW and DW
subclasses. Regular ConvGradW2DTileConstraint keeps [CinSlice,
CoutHWSlice] for the ResNet8-style speedup.
Verification:
- ResNet8: 511M cycles (CinSlice path active for deep conv layers)
- MobileNetV1: 950M cycles, 0 OOB, matches devel baseline (948M)
No MobileNetV1 speedup yet; that requires a per-subclass strategy
designed for PW/DW data layout (future commit).
* style: apply pre-commit (yapf + clang-format)
Pure formatter output from the repo's pre-commit hooks; no logic changes.
Brings PR #28 in line with the CI lint check.
* ci: emit performance summary at end of pytest session
pytest_runtest_logreport now scans each test's captured stdout/stderr for
'Runtime: N cycles' (the harness already prints this; output_parser.py
extracts it for TestResult). pytest_terminal_summary writes:
- a 'Performance Summary' block to the terminal (sorted by nodeid)
- a Markdown table to GITHUB_STEP_SUMMARY when running under GitHub
Actions, so the cycle counts surface on the PR check page
Tests without a Runtime line are silently skipped (e.g. lint-only jobs).
xdist-safe: pytest_runtest_logreport fires on the master with all worker
reports, so collection is single-process.
* perf(PWConvGradX): direct axpy kernel, drop Cin*Cout transpose scratch
The old PW ConvGradX kernel did W transpose + pulp-trainlib mm, which
required a Cin*Cout*sizeof(float) transient buffer. On MobileNetV1
block 6-10 PW layers (Cin=Cout=128) that buffer ate 64 KB of the 128 KB
L1 scratch, forcing the tiler to fragment Cin*H*W into 36 micro-tiles
each pay 95% of its wall time on L3<->L2 DMA / sync instead of compute.
This commit replaces the kernel with a direct dX[ci,hw] = sum_co
W[co,ci] * dY[co,hw] worker that parallelizes over Cin and streams W
rows / dY rows contiguously. No scratch is required, so the template's
transient-buffer hoist is removed too and the kernel signature drops
pTransposeBuffer / transposeBufferSize.
MobileNetV1 block 6-10 PW ConvGradX: 26.17M -> 0.79M cycles each (33x)
End-to-end MobileNetV1 step: 245M -> 145M cycles before the matching
tile policy change (~1.7x); see follow-up commit for full 2.24x.
* perf(PWConvGradX): pin HW=full in tile policy when dY full fits
Without this, the tiler's default cost model splits the PW ConvGradX
spatial extent into single-pixel tiles for layers with small NHW.
On MobileNetV1 block_11/12 (Cin/Cout=128-256, NHW=9) that produced
schedules of 12 / 18 tiles in which the direct axpy's 9-iteration
inner loop never amortised its overhead — block_12 alone cost 23.6M
cycles, *worse* than the baseline mm-based kernel.
Pinning H/W to full in the policy forces Cin to absorb all tiling
pressure. The constraint is conditional on the full dY fitting under
HW_PIN_BUDGET_BYTES (24 KB) so early MobileNetV1 PW layers (NHW=2304
with dY full = 144 KB) keep their HW tiling.
MobileNetV1 block_11 PW ConvGradX: 14.41M -> 0.56M cycles (25x)
MobileNetV1 block_12 PW ConvGradX: 23.62M -> 1.08M cycles (22x)
End-to-end MobileNetV1 step: 245M -> 109M cycles (2.24x)
* ci(perf-summary): also scrape BENCH train_cycles / opt_cycles for training tests
The original perf-summary hook only matched ``Runtime: N cycles`` (the
inference harness format). Training tests use a different banner:
BENCH train_cycles=N opt_cycles=N weight_sram=N
so the GITHUB_STEP_SUMMARY tables came out empty for the
siracusa-training-tiled jobs. The hook now scrapes both formats,
splits the Markdown summary into separate Training / Inference
sections, and renders train / opt / weight_sram side-by-side.
Verified locally with a dummy pytest module that emits both formats.
* fix(PWConvGradX): only accept stride=1 — stride>1 falls back to im2col path
The PW ConvGradX kernel computes `dX[ci, hw] = sum_co W[co, ci] * dY[co, hw]`
and indexes dX with `ci * HW` where `HW = H_out * W_out` — implicitly
assuming dX and dY share the same spatial extent (stride=1, pad=0). For
stride>1 1x1 convolutions (e.g. ResNet8 layer2/3 downsample shortcuts,
shape (16, 32, 32) -> (32, 16, 16)) the correct backward writes are
sparse:
dX[ci, stride*y, stride*x] = sum_co W[co, ci] * dY[co, y, x]
dX[ci, otherwise] = 0
The PW kernel ignores stride and writes dY-sized data into the first
`H_out * W_out` slots of every dX channel, which mostly overlaps the
wrong spatial positions and leaves the rest at zero. The old transpose
+ mm kernel had the same bug; both happened to land within tolerance
historically due to lucky rounding accumulation, but the new direct
axpy kernel pushed the residual past the 0.001 loss tolerance in CI
(ResNet8 L3 single-buffer: loss[1..3] diff 0.025-0.042 vs TOL 0.001).
Reject stride>1 in the PW parser so those layers go through the
NodeMapper's next candidate — `ConvGradX2DIm2ColHWTileConstraint` with
PULP_ConvGradX2d_*_Im2Col_tiled, which handles arbitrary stride.
ResNet8 (Siracusa, L1=128KB, L3): losses now within TOL, 0 errors,
BENCH train_cycles=513M (vs baseline 511M, ~+0.4% from the im2col
fallback on 2 tiny downsample shortcuts).1 parent d88b620 commit 022f045
6 files changed
Lines changed: 942 additions & 316 deletions
File tree
- DeeployTest
- Deeploy/Targets/PULPOpen
- Templates
- TileConstraints
- TargetLibraries/PULPOpen
- inc/kernel
- src
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
552 | 552 | | |
553 | 553 | | |
554 | 554 | | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
555 | 566 | | |
556 | 567 | | |
557 | 568 | | |
| |||
Lines changed: 10 additions & 26 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
353 | 353 | | |
354 | 354 | | |
355 | 355 | | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
356 | 365 | | |
357 | 366 | | |
358 | 367 | | |
359 | 368 | | |
360 | | - | |
361 | | - | |
362 | | - | |
363 | | - | |
364 | | - | |
365 | | - | |
366 | | - | |
367 | | - | |
368 | | - | |
369 | | - | |
370 | | - | |
371 | | - | |
372 | | - | |
373 | | - | |
374 | | - | |
375 | | - | |
376 | | - | |
377 | | - | |
378 | | - | |
379 | | - | |
380 | | - | |
381 | | - | |
382 | | - | |
383 | | - | |
384 | 369 | | |
385 | 370 | | |
386 | 371 | | |
| |||
424 | 409 | | |
425 | 410 | | |
426 | 411 | | |
427 | | - | |
428 | | - | |
| 412 | + | |
429 | 413 | | |
430 | 414 | | |
431 | 415 | | |
| |||
0 commit comments