Skip to content

Commit 022f045

Browse files
authored
perf(ConvGrad): GradWStrategy framework + ConvGradX/W speedups (#28)
* perf(ConvGradX): im2col+GEMM for stride=1 + loop reorder for stride>1 Replace the naive 7-deep direct loop in PULP_ConvGradX2d_fp32_fp32_fp32_CHW_Im2Col_tiled with two optimized paths: stride=1: custom im2col + W transpose + GEMM with Cout blocking - Build dY_col matrix by gathering dY patches (handles padding via boundary checks) - Transpose W[Cout,Cin,P,Q] -> W_flat[Cin, Cout*P*Q] for row-major GEMM - GEMM with M-dimension unroll-by-4 for cache-line-friendly B access - Internal Cout blocking when im2col buffer < full Cout*P*Q*Hin*Win - 8-core parallel over Cin (output rows), with barriers between steps stride>1: loop reorder (ci outermost, co innermost) - Moves co to the innermost loop so W[co,...] and dY[co,...] are accessed with unit stride, improving cache behavior vs the original co-outermost order ResNet8 training results (4 steps, Siracusa 8-core, L3 mode): Original: 1,301,512,576 cycles Optimized: 844,186,047 cycles (1.54x speedup, -35%) ConvGradX step-0 breakdown: Original: 185.6 Mc (56.8% of step) Optimized: 73.3 Mc (34.2% of step) → 2.53x on ConvGradX alone Numerical accuracy verified: all 4 training steps within tolerance. * perf(ConvGradW): tile Cin instead of Cout to reduce tile count Change ConvGradW tiling policy from Cout+spatial tiling to Cin tiling: - Old: Cout split (47+17) × Hout×Wout (8×8) = 128 tiles, GEMM K=1 - New: Cin split (22+22+20) = 3 tiles, GEMM K=Hout*Wout=64 For ResNet8 layer3_conv2 ConvGradW: - 128 tiles → 3 tiles (71% DMA overhead eliminated) - GEMM: [Cout, Cin_tile*P*Q] × [Cin_tile*P*Q, Hout*Wout] with K=64 (proper reduction) instead of K=1 (outer product) ResNet8 training (4 steps, Siracusa 8-core, L3 mode): Before (ConvGradX opt only): 844,186,047 cycles After (+ ConvGradW Cin tile): 546,778,941 cycles (1.54x additional) Total vs original baseline: 2.38x speedup * refactor(ConvGradW): introduce GradWStrategy framework + HWSlice strategy Extract ConvGradW tiling policy/serialize into pluggable Strategy classes so each (layer-shape regime, hardware budget) pair can pick its own tiling. Strategy interface: - applies(): when this strategy is feasible/preferred for a given layer - add_constraints(): policy constraints emitted to the OR-tools tiler - matches_solution(): recognize the strategy's signature in a tiler result - serialize(): emit the per-tile codegen schedule Strategies in this commit: - CinSliceStrategy: lifts Step-2 behavior verbatim (Cout/dY full, Cin tile). Applies when dy_bytes <= 32KB; best for small-spatial / big-channel layers (ResNet8 deep convs). - HWSliceStrategy: new. dY tiled over H/W, dW full as L1 accumulation target, X derived as halo. Required for layers whose dY exceeds L1 (e.g. MobileNetV1 stem: dY = 16x96x96 = 576KB > 128KB L1). ConvGradWTileConstraintBase.addPolicyConstraint / serializeTilingSolution now dispatch through `cls.strategies` (an ordered priority list). Subclasses configure their strategy set: - ConvGradW2DTileConstraint: [CinSlice, HWSlice] (new: HW fallback) - PWConvGradWTileConstraint: [CinSlice] (inherits default) - DWConvGradW2DTileConstraint: [CinSlice] (inherits default) PW / DW keep their existing addPolicyConstraint / addGeometricalConstraint overrides (refactor of those is Commit B). Verification: ResNet8 must reproduce 546M training cycles exactly (CinSlice is a verbatim lift of base behavior at ee1288d). * refactor(ConvGradW): rename HWSlice -> CoutHWSlice, free Cout/HW like devel default The previous HWSliceStrategy pinned dy_dim_1 (Cout) to full, which forced forward Conv's output Y to keep full Cout at every memory level. For MobileNetV1 stem (dY = 16x96x96 = 576KB), this made the joint tiling problem infeasible: forward Conv couldn't fit Y at L1, backward couldn't agree on Y's tiling. Replace with CoutHWSliceStrategy that matches devel's pre-Cin-slice default policy: - X Cin (dim 1) full - dW Cin / kH / kW full, Cout (dim 0) free - dY HW free, dY Cout free The OR-tools tiler picks the optimal Cout / HW split per layer. dW slices are along Cout (disjoint per Cout slab); each (ho, wo) tile inside a Cout slab accumulates partial sums via mm_add. Serialize iterates Cout slabs outer, HW tiles inner (port of devel's serializeTilingSolution). Strategy selection for ConvGradW2DTileConstraint: - dy_bytes <= 32KB -> CinSlice (big GEMM K = Hout*Wout, ResNet8 deep) - dy_bytes > 32KB -> CoutHWSlice (MobileNetV1 stem etc., tiler picks) PW / DW classes still inherit [CinSlice] default; their fallback handling is Commit B. * refactor(ConvGradW): default strategies include CoutHWSlice fallback PW and DW classes' addPolicyConstraint call super(), which goes through the dispatch in the refactored base. Default strategies were just [CinSlice] — for big-dY PW/DW layers (e.g. MobileNetV1 shallow PW), CinSlice doesn't apply and the dispatcher fell back to it anyway, yielding infeasible constraints (dY spatial must be full). Change base default to [CinSlice, CoutHWSlice] so PW/DW + ConvGradW2D all share the same priority list: - dy_bytes <= 32KB -> CinSlice (big GEMM K) - otherwise -> CoutHWSlice (free Cout / HW) PW's extra "dy spatial full" constraint still applies on top (its docstring rationale: forbid HW tiling for memset correctness). The combined effect for big-dY PW: only Cout tiling allowed, matching devel's behavior. MobileNetV1 now passes the tiling solver and reaches simulation; an out-of-bound L1 bank access in sim remains to debug. * fix(ConvGradW): restrict CinSlice to regular Conv; PW/DW use CoutHWSlice only CinSliceStrategy was being dispatched to PW (small-dy deep PW layers) and DW classes via base default strategies, causing L1 bank out-of-bound accesses during MobileNetV1 sim — 1000+ OOB warnings starting at training step 1. Root cause (verified by isolation test): - CinSlice.serialize iterates dW Cin slices from absoluteOutputCubes[i]. rectangle.dims[1] assuming the standard [Cout, Cin/group, P, Q] layout. - For DW: dW has layout [C, 1, P, Q]; dim_1 == 1 makes Cin slicing degenerate, and X tile derivation breaks DW's channel semantics. - For PW: dW is [Cout, Cin, 1, 1]; CinSlice can split Cin but PW's extra dy_spatial/x_spatial=full constraints combined with CinSlice's constraint set produce a tile configuration whose dW pointer reads past the L1 bank boundary in the per-tile mm kernel. Fix: explicit ``strategies = [CoutHWSliceStrategy]`` on PW and DW subclasses. Regular ConvGradW2DTileConstraint keeps [CinSlice, CoutHWSlice] for the ResNet8-style speedup. Verification: - ResNet8: 511M cycles (CinSlice path active for deep conv layers) - MobileNetV1: 950M cycles, 0 OOB, matches devel baseline (948M) No MobileNetV1 speedup yet; that requires a per-subclass strategy designed for PW/DW data layout (future commit). * style: apply pre-commit (yapf + clang-format) Pure formatter output from the repo's pre-commit hooks; no logic changes. Brings PR #28 in line with the CI lint check. * ci: emit performance summary at end of pytest session pytest_runtest_logreport now scans each test's captured stdout/stderr for 'Runtime: N cycles' (the harness already prints this; output_parser.py extracts it for TestResult). pytest_terminal_summary writes: - a 'Performance Summary' block to the terminal (sorted by nodeid) - a Markdown table to GITHUB_STEP_SUMMARY when running under GitHub Actions, so the cycle counts surface on the PR check page Tests without a Runtime line are silently skipped (e.g. lint-only jobs). xdist-safe: pytest_runtest_logreport fires on the master with all worker reports, so collection is single-process. * perf(PWConvGradX): direct axpy kernel, drop Cin*Cout transpose scratch The old PW ConvGradX kernel did W transpose + pulp-trainlib mm, which required a Cin*Cout*sizeof(float) transient buffer. On MobileNetV1 block 6-10 PW layers (Cin=Cout=128) that buffer ate 64 KB of the 128 KB L1 scratch, forcing the tiler to fragment Cin*H*W into 36 micro-tiles each pay 95% of its wall time on L3<->L2 DMA / sync instead of compute. This commit replaces the kernel with a direct dX[ci,hw] = sum_co W[co,ci] * dY[co,hw] worker that parallelizes over Cin and streams W rows / dY rows contiguously. No scratch is required, so the template's transient-buffer hoist is removed too and the kernel signature drops pTransposeBuffer / transposeBufferSize. MobileNetV1 block 6-10 PW ConvGradX: 26.17M -> 0.79M cycles each (33x) End-to-end MobileNetV1 step: 245M -> 145M cycles before the matching tile policy change (~1.7x); see follow-up commit for full 2.24x. * perf(PWConvGradX): pin HW=full in tile policy when dY full fits Without this, the tiler's default cost model splits the PW ConvGradX spatial extent into single-pixel tiles for layers with small NHW. On MobileNetV1 block_11/12 (Cin/Cout=128-256, NHW=9) that produced schedules of 12 / 18 tiles in which the direct axpy's 9-iteration inner loop never amortised its overhead — block_12 alone cost 23.6M cycles, *worse* than the baseline mm-based kernel. Pinning H/W to full in the policy forces Cin to absorb all tiling pressure. The constraint is conditional on the full dY fitting under HW_PIN_BUDGET_BYTES (24 KB) so early MobileNetV1 PW layers (NHW=2304 with dY full = 144 KB) keep their HW tiling. MobileNetV1 block_11 PW ConvGradX: 14.41M -> 0.56M cycles (25x) MobileNetV1 block_12 PW ConvGradX: 23.62M -> 1.08M cycles (22x) End-to-end MobileNetV1 step: 245M -> 109M cycles (2.24x) * ci(perf-summary): also scrape BENCH train_cycles / opt_cycles for training tests The original perf-summary hook only matched ``Runtime: N cycles`` (the inference harness format). Training tests use a different banner: BENCH train_cycles=N opt_cycles=N weight_sram=N so the GITHUB_STEP_SUMMARY tables came out empty for the siracusa-training-tiled jobs. The hook now scrapes both formats, splits the Markdown summary into separate Training / Inference sections, and renders train / opt / weight_sram side-by-side. Verified locally with a dummy pytest module that emits both formats. * fix(PWConvGradX): only accept stride=1 — stride>1 falls back to im2col path The PW ConvGradX kernel computes `dX[ci, hw] = sum_co W[co, ci] * dY[co, hw]` and indexes dX with `ci * HW` where `HW = H_out * W_out` — implicitly assuming dX and dY share the same spatial extent (stride=1, pad=0). For stride>1 1x1 convolutions (e.g. ResNet8 layer2/3 downsample shortcuts, shape (16, 32, 32) -> (32, 16, 16)) the correct backward writes are sparse: dX[ci, stride*y, stride*x] = sum_co W[co, ci] * dY[co, y, x] dX[ci, otherwise] = 0 The PW kernel ignores stride and writes dY-sized data into the first `H_out * W_out` slots of every dX channel, which mostly overlaps the wrong spatial positions and leaves the rest at zero. The old transpose + mm kernel had the same bug; both happened to land within tolerance historically due to lucky rounding accumulation, but the new direct axpy kernel pushed the residual past the 0.001 loss tolerance in CI (ResNet8 L3 single-buffer: loss[1..3] diff 0.025-0.042 vs TOL 0.001). Reject stride>1 in the PW parser so those layers go through the NodeMapper's next candidate — `ConvGradX2DIm2ColHWTileConstraint` with PULP_ConvGradX2d_*_Im2Col_tiled, which handles arbitrary stride. ResNet8 (Siracusa, L1=128KB, L3): losses now within TOL, 0 errors, BENCH train_cycles=513M (vs baseline 511M, ~+0.4% from the im2col fallback on 2 tiny downsample shortcuts).
1 parent d88b620 commit 022f045

6 files changed

Lines changed: 942 additions & 316 deletions

File tree

Deeploy/Targets/PULPOpen/Parsers.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -552,6 +552,17 @@ def parseNode(self, node: gs.Node) -> bool:
552552
if kernel_shape != [1, 1]:
553553
return False
554554

555+
# The PW ConvGradX kernel (direct dX[ci, hw] = sum_co W[co, ci] *
556+
# dY[co, hw]) implicitly assumes dX and dY share the same spatial
557+
# extent — i.e. stride == 1. Stride>1 1x1 convolutions (e.g.
558+
# ResNet8 downsample shortcuts) need the sparse-write semantics
559+
# `dX[ci, stride*y, stride*x] = ...` which only the im2col-tiled
560+
# general kernel implements correctly, so reject them here and let
561+
# them fall through to ConvGradX2DIm2ColHWTileConstraint.
562+
strides = self.operatorRepresentation.get('strides', [1, 1])
563+
if list(strides) != [1, 1]:
564+
return False
565+
555566
return wellFormed and True
556567

557568

Deeploy/Targets/PULPOpen/Templates/FloatConvGradTemplate.py

Lines changed: 10 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -353,34 +353,19 @@ def hoistTransientBuffers(self, ctxt: NetworkContext,
353353

354354

355355
class PULP2DFloatPWConvGradXTemplate(NodeTemplate):
356+
"""PW (1x1) ConvGradX template.
357+
358+
The direct PULP_PWConvGradX2d_fp32_fp32_fp32_CHW kernel parallelises over
359+
Cin and streams W rows / dY rows contiguously, so no weight-transpose
360+
scratch is required. Allocating a Cin*Cout transient (the old
361+
transpose buffer) used to eat ~64 KB of L1 for the MobileNetV1
362+
block 6-10 PW layers and forced the tiler to fragment Cin/H/W into
363+
~36 tiles per layer; removing it lets the tiler pick coarse tiles.
364+
"""
356365

357366
def __init__(self, templateStr):
358367
super().__init__(templateStr)
359368

360-
@staticmethod
361-
def computeTransientBuffersSize(
362-
ctxt: NetworkContext,
363-
operatorRepresentation: OperatorRepresentation) -> List[Tuple[str, Union[int, IntVar]]]:
364-
# Transpose buffer for weight matrix transpose (C_out x C_in)
365-
# For pointwise convolution, kernel size is 1x1
366-
bt_dim = (operatorRepresentation["weight_type"].typeWidth // 8) * \
367-
operatorRepresentation['ch_im_in'] * operatorRepresentation['ch_im_out']
368-
369-
bt_name = operatorRepresentation['nodeName'] + "_transpose_buffer"
370-
371-
return [(bt_name, bt_dim)]
372-
373-
def hoistTransientBuffers(self, ctxt: NetworkContext,
374-
operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]:
375-
bt_name, bt_dim = PULP2DFloatPWConvGradXTemplate.computeTransientBuffersSize(ctxt, operatorRepresentation)[0]
376-
377-
ctxt.hoistTransientBuffer(bt_name, bt_dim)
378-
379-
operatorRepresentation['transposeBuffer'] = bt_name
380-
operatorRepresentation['transposeBufferSize'] = bt_dim
381-
382-
return ctxt, operatorRepresentation, [bt_name]
383-
384369

385370
referencePWConvGradW2DTemplate = _ConvGradWTemplate("""
386371
// 2D FP Pointwise ConvGradW (1x1) NCHW using pulp-trainlib pw interface (Name: ${nodeName}, Op: ${nodeOp})
@@ -424,8 +409,7 @@ def hoistTransientBuffers(self, ctxt: NetworkContext,
424409
ref_${grad_in}_${weight},
425410
${ch_im_in},
426411
ref_${grad_in}_out,
427-
${dim_im_in_x}, ${dim_im_in_y},
428-
${transposeBuffer}, ${transposeBufferSize}
412+
${dim_im_in_x}, ${dim_im_in_y}
429413
);
430414
431415
ref_${grad_in}_${grad_out} += ${ch_im_out} * ${dim_im_out_y} * ${dim_im_out_x};

0 commit comments

Comments
 (0)