Commit 9e51c34
Replace DW ConvGradX with trainlib gather kernel (#32)
* Replace DW ConvGradX scatter kernel with trainlib gather kernel
Replace the Deeploy-written scatter-based DW ConvGradX kernel with a
tile-aware extension of pulp-trainlib's gather kernel. The gather
pattern accumulates all contributions per dX pixel into a register and
writes once, vs the old scatter pattern which wrote each dX element
C_out times.
Changes:
- ConvGrad.c: Add trainlib_tiled wrapper, remove old scatter kernel
and unused PULP_DWConvTrans2d_fp32_fp32_fp32_HWC
- Bindings.py: DW ConvGradX binding ForkTransformer → ClusterTransformer
(trainlib kernels use pi_cl_team_fork internally)
- FloatConvGradTemplate.py: Call _trainlib_tiled variant
- TilingCodeGeneration.py: Edge-tile minimization fallback
Tested: DSCNN, ResNet8, MobileNetV1 tiled training all PASS (Errors: 0)
* ci: trigger rebuild after pushing trainlib submodule
* Clean up PW ConvGradX: remove unused declarations, unify naming
- Remove unused pulp_conv_pw_fp32_bw_input_grads_cl forward declaration
- Remove unused transp_args, matMul_args, transpose_matrix, mm declarations
(were for trainlib PW path, but PW uses direct AXPY kernel instead)
- Move pw_convgradx_args_t to point of use (internal detail)
- Rename worker: pulp_pw_convgradx_fp32_worker → PULP_PWConvGradX2d_worker
(consistent with PULP_PWConvGradX2d_fp32_fp32_fp32_CHW naming)
* Rename PW ConvGradX kernel: PULP_PWConvGradX2d_worker → pw_kernel_input_grad
* Add ConvGrad.h header and reorganize ConvGrad.c
- New kernel/ConvGrad.h: all public kernel function declarations
organized by type (Regular/DW/PW) and direction (GradW/GradX)
- Include ConvGrad.h via DeeployPULPMath.h
- ConvGrad.c: compact trainlib interface structs, add section comments
* Split ConvGrad.c into Dense/DW/PW files
- ConvGrad.c: regular (dense) conv GradW + GradX kernels (770 lines)
- DWConvGrad.c: depthwise GradW + GradX trainlib tiled (145 lines)
- PWConvGrad.c: pointwise GradW + GradX AXPY (131 lines)
- ConvGrad_internal.h: shared trainlib interface structs + utilities
No functional changes. CMake picks up new files via GLOB_RECURSE.
* Move structs and trainlib declarations into kernel/ConvGrad.h
Remove ConvGrad_internal.h — all shared struct definitions (blob,
Conv2D_args, DepthWise_Conv_args, PointWise_Conv_args), trainlib
dispatch declarations, and utility functions now live in the public
header kernel/ConvGrad.h alongside the kernel function prototypes.
* Replace dense ConvGradX with trainlib gather kernel, remove unused functions
- Add trainlib_tiled wrapper for dense ConvGradX (ClusterTransformer)
- Remove 4 unused ConvGradX functions: _trainlib (non-tiled), _Im2Col
(non-tiled), _CHW_tiled (scatter), _CHW (no-offset)
- Update ConvGrad.h: add trainlib_tiled declaration, Conv2D_args offset fields
- Template: referenceConvGradX2DTemplate now calls _trainlib_tiled
- Keep im2col_tiled kernel (ForkTransformer) unchanged
* DMA hoisting: skip redundant transfers for unchanged tiles
Detect input tensors whose HyperRectangle is identical across all tile
iterations and emit their DMA transfer once before the tile loop instead
of on every iteration.
Key example: CinSlice strategy keeps dY full across all Cin tiles.
Previously dY was re-transferred every tile; now loaded once.
Results (cycles):
DSCNN: 2,654,391 → 2,649,898 (-0.2%)
ResNet8: 513,114,910 → 506,531,380 (-1.3%)
MobileNetV1: 418,380,763 → 411,330,774 (-1.7%)
* Add roofline DMA cost model for ConvGradW strategy selection (analysis only)
Add _estimate_gradw_dma() cost estimator that computes expected DMA cycles
per strategy, considering tensor promotion levels (L2 vs L3 bandwidth).
Currently used for analysis/logging only — strategy selection still uses
the empirically-validated priority order (CinSlice > CoutHWSlice). The
cost model correctly identifies per-tensor bandwidth but underestimates
solver-imposed overhead when switching strategies, causing regression
on ResNet8 (506M → 842M) when used for active selection.
Hardware parameters from gvsoc Siracusa config:
BW_L2_TO_L1 = 16 B/cycle (MCHAN 4 ports × 4B)
BW_L3_TO_L1 = 1 B/cycle (HyperBus)
DMA_SETUP = 200 cycles/transaction
* Post-solve roofline cost model using actual tile sizes from solver
Replace pre-solve cost estimation with post-solve analysis that uses
the actual tile shapes produced by OR-Tools. Logs per-layer cost
breakdown (compute vs DMA cycles, memory-bound vs compute-bound,
tile shapes, hoistability) for paper analysis.
Hardware params from gvsoc Siracusa config (documented with source files).
No strategy selection changes — cost model is analysis/logging only.
All 3 tests PASS with no regression.
* Per-tile roofline cost model with SB serial DMA+compute
Rewrite _post_solve_cost to use per-tile arithmetic intensity:
tile_AI = tile_FLOPs / tile_bytes_moved
tile_attainable = min(peak, AI × bandwidth)
tile_cost = tile_compute + tile_dma (SB serial, no overlap)
Hardware params: L3→L2 = 1 B/cycle (HyperBus), L2→L1 = 16 B/cycle (MCHAN)
Source: AI_AGENT/GVSOC/siracusa_bandwidth_parameters.md
Key findings on ResNet8 (L3 mode):
- layer1 3x3 convs: AI=3.7, memory-bound, only 23% peak utilization
- shortcut 1x1 convs: AI=3.9-6.4, memory-bound, 6-10x overhead
- layer2/3 with CinSlice: AI=22-68, compute-bound, near peak
→ Memory-bound layers are the primary optimization targets for promotion
* Cost-model-driven min Cin constraint for ConvGradX tiling
Add minimum Cin tile size constraint in ConvGradX addPolicyConstraint,
derived from roofline AI target (AI_MIN=4.0). For layers with large W
(e.g., ResNet8 layer3.conv2: W=147KB > L1), the solver was forced to
tile Cin to 1 → 64 tiny tiles with AI=0.45 (2.5% peak utilization).
The new constraint computes cin_min from:
AI = 2×Cout×cin_t×K×Ho×Wo / (dY_bytes + cin_t × per_cin_bytes)
ensuring each tile has enough compute to amortize DMA cost.
ResNet8 layer3.conv2: cin=1→5, tiles=64→13, AI=0.45→2.23, overhead=71x→14.6x
Results (all PASS, Errors: 0/4):
DSCNN: 2,649,898 (unchanged)
ResNet8: 506M → 418M (-17.3%, -18.4% vs original baseline)
MobileNetV1: 411M (unchanged)
* Revert spatial min constraint (caused regression on layer1)
Spatial minimum tile constraint improved layer2/3 (12→8 tiles) but
worsened layer1 (128→168 tiles) and layer2.conv2 (64→128 tiles) by
restricting solver freedom. The Cin-only constraint is sufficient for
the large-W layers where tile explosion is worst.
Kept: min Cin constraint (fixes layer3 from 64→13 tiles, -18% cycles)
Removed: spatial min constraint (net regression 418M→525M)
* Replace im2col ConvGradX with scatter kernel (no transient buffer)
Switch dense ConvGradX from im2col+GEMM (3 barriers, 2 transient buffers)
to scatter-add kernel (no barriers, no buffers). The scatter kernel uses
the forward index relation ih=oh*s+ky-pad directly, no kernel flip.
Benefits:
- No ctxtBuffer/btBuffer → more L1 for data tiles → fewer tiles
- No im2col/transpose/GEMM barriers → less sync overhead
- Parallel over Cin (each core owns exclusive dX slice, no race)
Results (all PASS, Errors: 0/4):
DSCNN: 2,649,899 (unchanged)
ResNet8: 418M → 384M (-8.1%, -25.1% vs original baseline)
MobileNetV1: 411M (unchanged)
* Cap W_tile to L1/2 in ConvGradX tileconstraint
When W is large (e.g. layer3.conv2 W=147KB > L1=128KB), the solver
allocates most of L1 to W_tile, leaving tiny spatial tiles (5x1) and
producing 32 tiles. Cap W_tile to L1/2 so spatial tiles get adequate
space, reducing tile count and DMA overhead.
ResNet8 backward time now comparable to forward (~20% faster).
* Fix W tile cap feasibility + gitignore build artifacts
- ConvGradConstraint: only apply W tile cap when max_cin >= cin_min
(prevents infeasible constraints on MobileNetV1)
- gitignore: add DeeployTest build artifacts
* Revert DMA hoisting (causes DMA counter underflow on CCT)
* InPlaceAccumulatorV2: pin dim1 full to reduce tile count
* Revert InPlaceAccumulator pin dim1 (triggers DMA counter underflow on CCT)
* Fix TransposeTileConstraint for weight Transpose with broadcast batch dims
When MatMulLayer.computeShapes injects a leading batch dim into the
Transpose output (e.g. weight [K,N] → [1,N,K]), the perm only covers
the spatial dims. The old addGeometricalConstraint applied perm indices
directly to output dims without accounting for the extra leading dims,
causing the tiler to severely underestimate the input tile size
(e.g. 128 bytes instead of 4096 for a [32,32] weight).
This led to MiniMalloc allocating overlapping L1 buffers for the
Transpose's data_in and data_out, corrupting the transposed weight
matrix. The corruption manifested as ~0.001–0.04 loss drift in
tiled CCT training (attention Q/K/V weight transposes), while
non-tiled mode matched ORT reference perfectly.
Fix: shift perm constraint indices by numExtra and pin the extra
leading output dims to their full size.
Tested: CCT (img=8/16/32, dim=32/64/128), ResNet8 — all 0/4 errors.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix lint: yapf, clang-format, autoflake on branch files
- yapf: Bindings.py, ConvGradConstraint.py
- clang-format: ConvGrad.h, ConvGrad.c, DWConvGrad.c
- autoflake: remove unused import in ConvGradConstraint.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* CCT-2 with updated size
* update CCT-2 with new tolerance
* Fix yapf formatting in test_siracusa_tiled_config.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* perf: InPlaceAccumulatorV2 pin dim1 full to reduce tile count
Re-apply the pin-dim1-full policy constraint for InPlaceAccumulatorV2.
This was previously reverted (d3e508a) because it triggered a DMA
counter underflow on CCT. The root cause was the TransposeTileConstraint
L1 buffer overlap bug (fixed in ade87b5) which corrupted weight data
and caused downstream DMA failures.
With the Transpose fix in place, pinning dim1 is now safe. For a
[128, 128] gradient tensor this reduces tiles from 256 → ~2, cutting
DMA overhead from 91% to negligible.
CCT perf: 470M → 443M cycles/step (-5.7%, 27M cycles saved).
Correctness: 0/4 errors, max diff 0.000063.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent f025566 commit 9e51c34
18 files changed
Lines changed: 906 additions & 829 deletions
File tree
- DeeployTest
- Tests/Models/Training/CCT
- cct_optimizer
- cct_train
- Deeploy
- Targets
- Generic/TileConstraints
- PULPOpen
- Templates
- TileConstraints
- TilingExtension/CodeTransformationPasses
- TargetLibraries/PULPOpen
- inc
- kernel
- src
- third_party
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
45 | 54 | | |
46 | 55 | | |
47 | 56 | | |
| |||
Lines changed: 18 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
32 | | - | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
33 | 43 | | |
34 | | - | |
35 | | - | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
36 | 50 | | |
37 | 51 | | |
38 | 52 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
252 | 252 | | |
253 | 253 | | |
254 | 254 | | |
255 | | - | |
| 255 | + | |
256 | 256 | | |
257 | 257 | | |
258 | 258 | | |
| |||
265 | 265 | | |
266 | 266 | | |
267 | 267 | | |
268 | | - | |
| 268 | + | |
269 | 269 | | |
270 | 270 | | |
271 | 271 | | |
| |||
Lines changed: 3 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
153 | | - | |
| 153 | + | |
154 | 154 | | |
155 | 155 | | |
156 | 156 | | |
| |||
322 | 322 | | |
323 | 323 | | |
324 | 324 | | |
325 | | - | |
| 325 | + | |
326 | 326 | | |
327 | 327 | | |
328 | 328 | | |
329 | 329 | | |
330 | 330 | | |
331 | | - | |
| 331 | + | |
332 | 332 | | |
333 | 333 | | |
334 | 334 | | |
| |||
0 commit comments