Skip to content

Commit 9e51c34

Browse files
runwangdlclaude
andauthored
Replace DW ConvGradX with trainlib gather kernel (#32)
* Replace DW ConvGradX scatter kernel with trainlib gather kernel Replace the Deeploy-written scatter-based DW ConvGradX kernel with a tile-aware extension of pulp-trainlib's gather kernel. The gather pattern accumulates all contributions per dX pixel into a register and writes once, vs the old scatter pattern which wrote each dX element C_out times. Changes: - ConvGrad.c: Add trainlib_tiled wrapper, remove old scatter kernel and unused PULP_DWConvTrans2d_fp32_fp32_fp32_HWC - Bindings.py: DW ConvGradX binding ForkTransformer → ClusterTransformer (trainlib kernels use pi_cl_team_fork internally) - FloatConvGradTemplate.py: Call _trainlib_tiled variant - TilingCodeGeneration.py: Edge-tile minimization fallback Tested: DSCNN, ResNet8, MobileNetV1 tiled training all PASS (Errors: 0) * ci: trigger rebuild after pushing trainlib submodule * Clean up PW ConvGradX: remove unused declarations, unify naming - Remove unused pulp_conv_pw_fp32_bw_input_grads_cl forward declaration - Remove unused transp_args, matMul_args, transpose_matrix, mm declarations (were for trainlib PW path, but PW uses direct AXPY kernel instead) - Move pw_convgradx_args_t to point of use (internal detail) - Rename worker: pulp_pw_convgradx_fp32_worker → PULP_PWConvGradX2d_worker (consistent with PULP_PWConvGradX2d_fp32_fp32_fp32_CHW naming) * Rename PW ConvGradX kernel: PULP_PWConvGradX2d_worker → pw_kernel_input_grad * Add ConvGrad.h header and reorganize ConvGrad.c - New kernel/ConvGrad.h: all public kernel function declarations organized by type (Regular/DW/PW) and direction (GradW/GradX) - Include ConvGrad.h via DeeployPULPMath.h - ConvGrad.c: compact trainlib interface structs, add section comments * Split ConvGrad.c into Dense/DW/PW files - ConvGrad.c: regular (dense) conv GradW + GradX kernels (770 lines) - DWConvGrad.c: depthwise GradW + GradX trainlib tiled (145 lines) - PWConvGrad.c: pointwise GradW + GradX AXPY (131 lines) - ConvGrad_internal.h: shared trainlib interface structs + utilities No functional changes. CMake picks up new files via GLOB_RECURSE. * Move structs and trainlib declarations into kernel/ConvGrad.h Remove ConvGrad_internal.h — all shared struct definitions (blob, Conv2D_args, DepthWise_Conv_args, PointWise_Conv_args), trainlib dispatch declarations, and utility functions now live in the public header kernel/ConvGrad.h alongside the kernel function prototypes. * Replace dense ConvGradX with trainlib gather kernel, remove unused functions - Add trainlib_tiled wrapper for dense ConvGradX (ClusterTransformer) - Remove 4 unused ConvGradX functions: _trainlib (non-tiled), _Im2Col (non-tiled), _CHW_tiled (scatter), _CHW (no-offset) - Update ConvGrad.h: add trainlib_tiled declaration, Conv2D_args offset fields - Template: referenceConvGradX2DTemplate now calls _trainlib_tiled - Keep im2col_tiled kernel (ForkTransformer) unchanged * DMA hoisting: skip redundant transfers for unchanged tiles Detect input tensors whose HyperRectangle is identical across all tile iterations and emit their DMA transfer once before the tile loop instead of on every iteration. Key example: CinSlice strategy keeps dY full across all Cin tiles. Previously dY was re-transferred every tile; now loaded once. Results (cycles): DSCNN: 2,654,391 → 2,649,898 (-0.2%) ResNet8: 513,114,910 → 506,531,380 (-1.3%) MobileNetV1: 418,380,763 → 411,330,774 (-1.7%) * Add roofline DMA cost model for ConvGradW strategy selection (analysis only) Add _estimate_gradw_dma() cost estimator that computes expected DMA cycles per strategy, considering tensor promotion levels (L2 vs L3 bandwidth). Currently used for analysis/logging only — strategy selection still uses the empirically-validated priority order (CinSlice > CoutHWSlice). The cost model correctly identifies per-tensor bandwidth but underestimates solver-imposed overhead when switching strategies, causing regression on ResNet8 (506M → 842M) when used for active selection. Hardware parameters from gvsoc Siracusa config: BW_L2_TO_L1 = 16 B/cycle (MCHAN 4 ports × 4B) BW_L3_TO_L1 = 1 B/cycle (HyperBus) DMA_SETUP = 200 cycles/transaction * Post-solve roofline cost model using actual tile sizes from solver Replace pre-solve cost estimation with post-solve analysis that uses the actual tile shapes produced by OR-Tools. Logs per-layer cost breakdown (compute vs DMA cycles, memory-bound vs compute-bound, tile shapes, hoistability) for paper analysis. Hardware params from gvsoc Siracusa config (documented with source files). No strategy selection changes — cost model is analysis/logging only. All 3 tests PASS with no regression. * Per-tile roofline cost model with SB serial DMA+compute Rewrite _post_solve_cost to use per-tile arithmetic intensity: tile_AI = tile_FLOPs / tile_bytes_moved tile_attainable = min(peak, AI × bandwidth) tile_cost = tile_compute + tile_dma (SB serial, no overlap) Hardware params: L3→L2 = 1 B/cycle (HyperBus), L2→L1 = 16 B/cycle (MCHAN) Source: AI_AGENT/GVSOC/siracusa_bandwidth_parameters.md Key findings on ResNet8 (L3 mode): - layer1 3x3 convs: AI=3.7, memory-bound, only 23% peak utilization - shortcut 1x1 convs: AI=3.9-6.4, memory-bound, 6-10x overhead - layer2/3 with CinSlice: AI=22-68, compute-bound, near peak → Memory-bound layers are the primary optimization targets for promotion * Cost-model-driven min Cin constraint for ConvGradX tiling Add minimum Cin tile size constraint in ConvGradX addPolicyConstraint, derived from roofline AI target (AI_MIN=4.0). For layers with large W (e.g., ResNet8 layer3.conv2: W=147KB > L1), the solver was forced to tile Cin to 1 → 64 tiny tiles with AI=0.45 (2.5% peak utilization). The new constraint computes cin_min from: AI = 2×Cout×cin_t×K×Ho×Wo / (dY_bytes + cin_t × per_cin_bytes) ensuring each tile has enough compute to amortize DMA cost. ResNet8 layer3.conv2: cin=1→5, tiles=64→13, AI=0.45→2.23, overhead=71x→14.6x Results (all PASS, Errors: 0/4): DSCNN: 2,649,898 (unchanged) ResNet8: 506M → 418M (-17.3%, -18.4% vs original baseline) MobileNetV1: 411M (unchanged) * Revert spatial min constraint (caused regression on layer1) Spatial minimum tile constraint improved layer2/3 (12→8 tiles) but worsened layer1 (128→168 tiles) and layer2.conv2 (64→128 tiles) by restricting solver freedom. The Cin-only constraint is sufficient for the large-W layers where tile explosion is worst. Kept: min Cin constraint (fixes layer3 from 64→13 tiles, -18% cycles) Removed: spatial min constraint (net regression 418M→525M) * Replace im2col ConvGradX with scatter kernel (no transient buffer) Switch dense ConvGradX from im2col+GEMM (3 barriers, 2 transient buffers) to scatter-add kernel (no barriers, no buffers). The scatter kernel uses the forward index relation ih=oh*s+ky-pad directly, no kernel flip. Benefits: - No ctxtBuffer/btBuffer → more L1 for data tiles → fewer tiles - No im2col/transpose/GEMM barriers → less sync overhead - Parallel over Cin (each core owns exclusive dX slice, no race) Results (all PASS, Errors: 0/4): DSCNN: 2,649,899 (unchanged) ResNet8: 418M → 384M (-8.1%, -25.1% vs original baseline) MobileNetV1: 411M (unchanged) * Cap W_tile to L1/2 in ConvGradX tileconstraint When W is large (e.g. layer3.conv2 W=147KB > L1=128KB), the solver allocates most of L1 to W_tile, leaving tiny spatial tiles (5x1) and producing 32 tiles. Cap W_tile to L1/2 so spatial tiles get adequate space, reducing tile count and DMA overhead. ResNet8 backward time now comparable to forward (~20% faster). * Fix W tile cap feasibility + gitignore build artifacts - ConvGradConstraint: only apply W tile cap when max_cin >= cin_min (prevents infeasible constraints on MobileNetV1) - gitignore: add DeeployTest build artifacts * Revert DMA hoisting (causes DMA counter underflow on CCT) * InPlaceAccumulatorV2: pin dim1 full to reduce tile count * Revert InPlaceAccumulator pin dim1 (triggers DMA counter underflow on CCT) * Fix TransposeTileConstraint for weight Transpose with broadcast batch dims When MatMulLayer.computeShapes injects a leading batch dim into the Transpose output (e.g. weight [K,N] → [1,N,K]), the perm only covers the spatial dims. The old addGeometricalConstraint applied perm indices directly to output dims without accounting for the extra leading dims, causing the tiler to severely underestimate the input tile size (e.g. 128 bytes instead of 4096 for a [32,32] weight). This led to MiniMalloc allocating overlapping L1 buffers for the Transpose's data_in and data_out, corrupting the transposed weight matrix. The corruption manifested as ~0.001–0.04 loss drift in tiled CCT training (attention Q/K/V weight transposes), while non-tiled mode matched ORT reference perfectly. Fix: shift perm constraint indices by numExtra and pin the extra leading output dims to their full size. Tested: CCT (img=8/16/32, dim=32/64/128), ResNet8 — all 0/4 errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix lint: yapf, clang-format, autoflake on branch files - yapf: Bindings.py, ConvGradConstraint.py - clang-format: ConvGrad.h, ConvGrad.c, DWConvGrad.c - autoflake: remove unused import in ConvGradConstraint.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * CCT-2 with updated size * update CCT-2 with new tolerance * Fix yapf formatting in test_siracusa_tiled_config.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * perf: InPlaceAccumulatorV2 pin dim1 full to reduce tile count Re-apply the pin-dim1-full policy constraint for InPlaceAccumulatorV2. This was previously reverted (d3e508a) because it triggered a DMA counter underflow on CCT. The root cause was the TransposeTileConstraint L1 buffer overlap bug (fixed in ade87b5) which corrupted weight data and caused downstream DMA failures. With the Transpose fix in place, pinning dim1 is now safe. For a [128, 128] gradient tensor this reduces tiles from 256 → ~2, cutting DMA overhead from 91% to negligible. CCT perf: 470M → 443M cycles/step (-5.7%, 27M cycles saved). Correctness: 0/4 errors, max diff 0.000063. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f025566 commit 9e51c34

18 files changed

Lines changed: 906 additions & 829 deletions

File tree

.gitignore

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,16 @@ docs/_build
4141
DeeployTest/TestFiles/
4242
DeeployTest/Tests/**/*.txt
4343
DeeployTest/**/BUILD/*
44-
DeeployTest/TEST_*/*
44+
DeeployTest/TEST_*/
45+
DeeployTest/CMakeCache.txt
46+
DeeployTest/CMakeFiles/
47+
DeeployTest/Makefile
48+
DeeployTest/cmake_install.cmake
49+
DeeployTest/lib/
50+
DeeployTest/bin/
51+
DeeployTest/TargetLibraries/
52+
DeeployTest/DeeployTest/
53+
DeeployTest/build_*/
4554
DeeployTest/deeployStates*/*
4655
DeeployTest/DeeployState*
4756
DeeployTest/testUtils/graphDebug.py

Deeploy/Targets/Generic/TileConstraints/TransposeTileConstraint.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,25 @@ def addGeometricalConstraint(tilerModel: TilerModel, parseDict: Dict, ctxt: Netw
2828
for bufferName in [inputBufferName, outputBufferName]:
2929
tilerModel.addTensorDimToModel(ctxt, bufferName)
3030

31-
# Map output dims to inputs dims
32-
for idx, perm_idx in enumerate(parseDict["perm"]):
31+
inputShape = ctxt.lookup(inputBufferName).shape
32+
outputShape = ctxt.lookup(outputBufferName).shape
33+
perm = parseDict["perm"]
34+
35+
# When output has extra leading batch dims compared to input
36+
# (e.g. weight Transpose [K,N] -> [1,N,K] injected by MatMulLayer),
37+
# the perm only covers the spatial (last len(perm)) dims of the output.
38+
# Pin the extra leading output dims to their full size (they are batch=1)
39+
# and apply the perm constraints with shifted output indices.
40+
numExtra = len(outputShape) - len(perm)
41+
42+
for i in range(numExtra):
3343
tilerModel.addConstraint(
34-
tilerModel.getTensorDimVar(tensorName = outputBufferName, dimIdx = idx) == tilerModel.getTensorDimVar(
35-
tensorName = inputBufferName, dimIdx = perm_idx))
44+
tilerModel.getTensorDimVar(tensorName = outputBufferName, dimIdx = i) == outputShape[i])
45+
46+
for idx, perm_idx in enumerate(perm):
47+
tilerModel.addConstraint(
48+
tilerModel.getTensorDimVar(tensorName = outputBufferName, dimIdx = numExtra + idx) ==
49+
tilerModel.getTensorDimVar(tensorName = inputBufferName, dimIdx = perm_idx))
3650

3751
return tilerModel
3852

Deeploy/Targets/PULPOpen/Bindings.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -252,7 +252,7 @@
252252

253253
PULPFloatConvGradX2DBindings = [
254254
NodeBinding(ConvChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
255-
FloatConvGradTemplate.referenceConvGradX2DIm2ColTiledTemplate, ForkTransformer)
255+
FloatConvGradTemplate.referenceConvGradX2DTemplate, ForkTransformer)
256256
]
257257

258258
PULPFloatDWConv2DBindings = [
@@ -265,7 +265,7 @@
265265

266266
PULPFloatDWConvGradX2DBindings = [
267267
NodeBinding(ConvChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
268-
FloatConvGradTemplate.referenceDWConvGradX2DTiledTemplate, ForkTransformer)
268+
FloatConvGradTemplate.referenceDWConvGradX2DTiledTemplate, ClusterTransformer)
269269
]
270270

271271
PULPFloatDWConvGradW2DBindings = [

Deeploy/Targets/PULPOpen/Templates/FloatConvGradTemplate.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ def hoistTransientBuffers(self, ctxt: NetworkContext,
150150
${grad_in_type.typeName} ref_${grad_in} = ${grad_in}; // dX
151151
152152
for (uint32_t n=0; n<${batch}; ++n) {
153-
PULP_ConvGradX2d_fp${grad_out_type.referencedType.typeWidth}_fp${weight_type.referencedType.typeWidth}_fp${grad_in_type.referencedType.typeWidth}_CHW_tiled(
153+
PULP_ConvGradX2d_fp${grad_out_type.referencedType.typeWidth}_fp${weight_type.referencedType.typeWidth}_fp${grad_in_type.referencedType.typeWidth}_CHW_scatter_tiled(
154154
ref_${grad_out},
155155
${dim_im_out_x}, ${dim_im_out_y}, ${ch_im_out},
156156
ref_${weight},
@@ -322,13 +322,13 @@ def hoistTransientBuffers(self, ctxt: NetworkContext,
322322
""")
323323

324324
referenceDWConvGradX2DTiledTemplate = NodeTemplate("""
325-
// 2D FP DW ConvGradX (dX) CHW tiled (Name: ${nodeName}, Op: ${nodeOp})
325+
// 2D FP DW ConvGradX (dX) CHW tiled — trainlib gather kernel (Name: ${nodeName}, Op: ${nodeOp})
326326
${grad_out_type.typeName} ref_${grad_out} = ${grad_out}; // dY
327327
${weight_type.typeName} ref_${weight} = ${weight}; // W
328328
${grad_in_type.typeName} ref_${grad_in}_out = ${grad_in}; // dX
329329
330330
for (uint32_t n=0; n<${batch}; ++n) {
331-
PULP_DWConvGradX2d_fp${grad_out_type.referencedType.typeWidth}_fp${weight_type.referencedType.typeWidth}_fp${grad_in_type.referencedType.typeWidth}_CHW_tiled(
331+
PULP_DWConvGradX2d_fp${grad_out_type.referencedType.typeWidth}_fp${weight_type.referencedType.typeWidth}_fp${grad_in_type.referencedType.typeWidth}_CHW_trainlib_tiled(
332332
ref_${grad_out},
333333
${dim_im_out_x}, ${dim_im_out_y}, ${ch_im_out},
334334
ref_${weight},

0 commit comments

Comments
 (0)