feat(platform): port Siracusa+RedMulE from pulp-platform/Deeploy#67 by runwangdl · Pull Request #20 · runwangdl/TrainDeeploy

runwangdl · 2026-05-10T14:21:36Z

Summary

Cherry-picks the 9 commits of [Draft] Redmule platform pulp-platform/Deeploy#67 onto TrainDeeploy/devel.
Adds the new Deeploy/Targets/Redmule/ platform (Platform/Engine/Deployer/Bindings/Parsers/Tiler/Templates/TileConstraints/TopologyOptimizationPasses), the FP32 RedMulE matmul kernel TargetLibraries/PULPOpen/src/Matmul_fp32_Redmule.c, plus tiled-runner glue for Siracusa_w_redmule.
Adds CI workflow ci-platform-siracusa-redmule-tiled.yml (uses ghcr.io/runwangdl/deeploy:redmule image; auth via secrets.GITHUB_TOKEN).

Verified

Cherry-pick clean, no manual conflicts. 3 shared files auto-merged: DeeployTest/CMakeLists.txt, TargetLibraries/PULPOpen/CMakeLists.txt, Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py.
from Deeploy.Targets.Redmule import Platform, Engine, Deployer, Bindings imports OK.
pytest -m siracusa_redmule_tiled collects 3 cases (Kernels/FP32/GEMM/{Regular,TransB} single/double-buffer); fixtures present.
Local kernel/sim run not done — needs the runwangdl/gvsoc@35d00d1 fork build that the CI image bundles.

Test plan

CI Siracusa + RedMulE (Tiled) job pulls ghcr.io/runwangdl/deeploy:redmule and finishes the 3 collected cases.

Minimal port of RedMulE-platform code from the user's redmule_platform branch (which had accumulated unrelated CCT_Optim merges) onto a clean devel base. What landed: - New target Deeploy/Targets/Redmule/ (Platform, Engine, Deployer, Bindings, Parsers, Tiler, Templates, TileConstraints, TopologyOptimizationPasses). - FP32 RedMulE matmul kernel TargetLibraries/PULPOpen/src/Matmul_fp32_Redmule.c - Test runner DeeployTest/testRunner_tiled_siracusa_w_redmule.py plus Float test fixtures (testFloat{Matmul,MatmulLarge,MatmulLarge256,2DConvolution,2dConvLarge,GEMM,GEMMtransB}). - Wiring in platformMapping.py, top-level CMakeLists.txt, DeeployTest/CMakeLists.txt, TargetLibraries/PULPOpen/CMakeLists.txt. - Makefile: GVSOC_COMMIT_HASH points at runwangdl/gvsoc fork 35d00d1 (carries the light_redmule vendored copy + Siracusa cluster wiring). Fixes / portings required for devel compatibility: - Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py: define float32_tPtr locally (unresolved import left on devel). - Deeploy/Targets/Redmule/TopologyOptimizationPasses/Passes.py: switch from the retired _permuteLastTwoDims / _appendTransposeNode helpers to upstream's _appendTranspose. - Add empty __init__.py to Targets/{Chimera,Redmule,SoftHier}. What intentionally did NOT land: - CCT_Optim-era edits to PULPOpen Templates (Add/Conv/GELU/Layernorm/ MatMul/MaxPool/Relu/Softmax), Generic Layers.py computeOps, CCT test suites, parallel/unroll rewrites. - Buggy -march=rv32imc inside meson-build-script-rv32imf.txt. - Hard-to-merge edits to DeeployTest/Platforms/Siracusa/src/deeploytest.c. - The old-style .github/workflows/TestRunnerTiledSiracusaWithRedmule.yml; new-style ci-platform-siracusa-redmule-tiled.yml TBD. Verified end-to-end: testFloatMatmul on GVSoC (runwangdl/gvsoc@35d00d1, pulp submodule @ 371772c) passes with 'Errors: 0 out of 256'.

The Tests/ directory layout on devel was reorganized into Kernels/, Models/, Others/ subdirectories. Drop the flat-path Float test inputs ported from redmule_platform; they'll be re-added under the new structure in a follow-up.

Mirrors the neureka-tiled pattern: - DeeployTest/test_siracusa_redmule_tiled_config.py with empty L2_{SINGLE,DOUBLE}BUFFER_KERNELS dicts (to be populated once Float kernel test fixtures land under Tests/Kernels/Float/). - conftest.py: register 'siracusa_redmule_tiled' pytest marker. - test_platforms.py: two parametrized test functions (L2 single- and double-buffer) for the redmule platform. - .github/workflows/_runner-siracusa-redmule-tiled.yml: reusable runner mirroring _runner-siracusa-neureka-tiled.yml. - .github/workflows/ci-platform-siracusa-redmule-tiled.yml: top-level trigger, defaults to ghcr.io/runwangdl/deeploy:redmule Docker image. With empty configs the tests collect and skip cleanly (pytest 'got empty parameter set'). No wmem variants since RedMulE does not use Neureka weight memory.

- yapf / isort / autoflake / trailing-whitespace across the Redmule Python target and platformMapping wiring. - clang-format over TargetLibraries/PULPOpen/src/Matmul_fp32_Redmule.c. - Add SPDX/license header to Matmul_fp32_Redmule.c (reuse hook).

The GAP9 CI uses ghcr.io/pulp-platform/deeploy-gap9:devel, which is only pullable with pulp-platform org credentials. On a fork the job fails at 'Initialize containers'. Add github.repository_owner guard so forks skip the jobs cleanly.

The docs workflow publishes to gh-pages, which on a fork races with external pushes and lacks origin remote setup. Gate on github.repository_owner == 'pulp-platform' so only upstream publishes.

Point the redmule tiled CI config at existing upstream FP32 kernel test fixtures under Tests/Kernels/FP32/GEMM (Regular, TransB). Both single-buffer and double-buffer variants verified locally end-to-end on GVSoC (Errors: 0 / 256, runtime ~4k cycles).

Without this fallback _select-env.yml resolves to the upstream pulp-platform/deeploy:devel image, which ships a GVSoC build that does not include the light_redmule model — the redmule test runner then hangs. Point the default at the fork's custom image so push events get the correct GVSoC build.

ghcr.io/runwangdl/deeploy:redmule is a private package; add credentials block using the workflow's GITHUB_TOKEN so the runner container step can pull it.

Adds a single training-pipeline job to the Siracusa+RedMulE CI workflow, running the same Models/Training/CCT/cct_train fixture the Siracusa-only tiled CI already exercises -- forward + backward + SGD, L1=128000, L2=2M, default_mem_level=L3, single-buffer. RedmulePlatform inherits from PULPPlatform and declares engines=[RedmuleEngine, PULPClusterEngine]; RedmuleEngine binds only Matmul/Conv/GEMM (FP32), so the rest of the CCT graph (LayerNorm, GELU, MaxPool, *Grad ops) falls back to PULPCluster, which already carries TrainDeeploy's training kernel set. Pieces: - test_siracusa_redmule_tiled_config.L3_SINGLEBUFFER_TRAINING_MODELS + TRAINING_MODEL_OVERRIDES mirror the Siracusa values (num_data_inputs=1, tolerance=5e-3 for CCT step-0 attention-reduction drift) so the override semantics match across platforms. - test_platforms.test_siracusa_redmule_tiled_training_l3_singlebuffer is a thin clone of the existing Siracusa training test with platform="Siracusa_w_redmule"; markers (siracusa_redmule_tiled, training, singlebuffer, l3) line up with the marker filters the existing redmule CI runner workflow consumes. - ci-platform-siracusa-redmule-tiled.yml gains a siracusa-redmule-training-tiled-singlebuffer-L3 job invoking _runner-siracusa-redmule-tiled.yml with marker "training and singlebuffer and l3". Local validation: pytest --collect-only -m "siracusa_redmule_tiled" now collects 4 cases (3 existing kernel + the new CCT training case). Real end-to-end run deferred to CI -- needs the runwangdl/gvsoc@35d00d1 fork bundled inside ghcr.io/runwangdl/deeploy:redmule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s crashing The Siracusa+RedMulE CCT_train CI job (added in b682739) crashed during tiling with KeyError 'C' inside RedmuleGEMMTileConstraint -- the constraint unconditionally reads parseDict['C'] but GEMMRedmuleParser.parseNodeCtxt only populates it when len(node.inputs) == 3 and noBiasHoisting is True (the default). Backward-pass codegen for CCT (GradFusedMatMul rewrites) emits a flurry of 2-input ONNX Gemm nodes (alpha=1, no bias), which match the binding but never get a 'C' field -- hence the lookup blows up. A bias-less Gemm with alpha=1 is mathematically just MatMul, and the Redmule platform already routes ONNX MatMul through MatMulRedmuleMapper / RedmuleMatMulTilingReadyBindings (no C operand needed). So instead of papering over the parser, lower the op: - Add RedMuleBiaslessGemmToMatMulPass in Targets/Redmule/TopologyOptimizationPasses/Passes.py. It matches Gemm nodes (the 2-input pattern reused from RedMuleGEMMTransposePass), guards on len(inputs) == 2 and alpha == 1, materializes any transA/transB (constants get folded, variables get a Transpose appended via the same _appendTranspose helper the transpose pass already uses), then rewrites op="MatMul" and clears attrs. - Wire it into RedmuleDeployer.loweringOptimizer.passes BEFORE RedMuleGEMMTransposePass so the latter only ever sees real (3-input) Gemms; otherwise it would write transA/transB=0 onto what we just rewrote into a MatMul, and the stale Gemm op would still hit the bias-required tile constraint. 3-input Gemms (forward CCT FCs, the existing testFloatGEMM/testFloatGEMMtransB kernel fixtures) are untouched: the new pass returns the graph unchanged when len(inputs) != 2, and RedMuleGEMMTransposePass continues to see them as before. Local validation: pytest --collect-only -m "siracusa_redmule_tiled" still yields the same 4 cases (3 kernel + 1 training); module import of Deployer + the new pass class both succeed. Real run deferred to CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This commit replaces 39bb8f1's experimental Gemm->MatMul lowering pass (which unblocked the original KeyError 'C' but exposed a deeper Transpose rank-mismatch bug downstream) with two smaller, locally-verified fixes: 1) Hoist a properly-shaped zero C tensor in GEMMRedmuleParser when an ONNX Gemm has only A and B (e.g. backward GradFusedMatMul rewrites in CCT_train). Fixes for the hoist path: - GEMMRedmuleParser.__init__ used to set self.noBiasHoisting *before* calling super().__init__(), but MatMulParser.__init__ also writes self.noBiasHoisting from its own default of True -- so the caller's flag was silently clobbered. Reverse the order and forward the kwarg. - The hoist used to allocate a 1-element np.zeros((1)) scalar; that would never satisfy RedmuleGEMMTileConstraint's "C dim equals output dim" assertion. Allocate a zero array whose shape matches node.outputs[0].shape. - Pass _type=PointerClass(float32_t) to ctxt.hoistConstant so the buffer is type-annotated up-front. Without it, MemoryScheduler.getConstantTensorOffset later trips an AttributeError on the un-annotated buffer. - Append the hoisted Constant to node.inputs so the tiler picks it up via its node.inputs + node.outputs walk, AND register the Gemm as a user via newCtxt.addUser so the MemoryConstraintFlow kill-set assertion (which walks _users) finds a consumer. - Engine.GEMMMRedmuleMapper now instantiates with noBiasHoisting=False so the hoist path is actually taken. Drop the BiaslessGemmToMatMulPass class (added in 39bb8f1) and its Deployer registration: the parser-side hoist is the smaller fix and side-steps the MatMul broadcasting issue entirely. 2) Fix Generic/TransposeTileConstraint and PULPOpen/TransposeTemplate to use a *spatial-view* interpretation of perm. When MatMulLayer. computeShapes broadens an already-existing tensor that is simultaneously a forward MatMul B input *and* a downstream non-broadening consumer (Gemm/Transpose), data_in and data_out of a downstream Transpose can end up with different ranks. Both addGeometricalConstraint and serializeTilingSolution previously assumed len(perm) == data_in_rank == data_out_rank; they now offset their shape lookups by len(shape) - len(perm) so the perm targets the trailing spatial dims in either tensor. PULPTransposeTemplate's alignToContext gets the same treatment for its dimLen_<idx> lookup and parallelDim selection. Aligned cases (existing kernel fixtures testFloatGEMM / testFloatGEMMtransB) compute identical offsets of 0 and behave exactly as before. This commit verifies the fix locally on Models/Training/CCT/cct_train: testMVPTraining.py and testMVPOptimizer.py both exit 0 on Siracusa_w_redmule, producing a ~7.7 MB TrainingNetwork.c and matching OptimizerNetwork.c. C compilation + GVSoC simulation still need to be validated on CI (can't run the runwangdl/gvsoc fork locally in the agent container). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Siracusa+RedMulE training CI on 1782a88 got past Python codegen but failed at link time: ld.lld: error: undefined symbol: Conv2d_Im2Col_fp32_fp32_fp32_HWC_8_Redmule >>> referenced by TrainingNetwork.c:5386 in _node_1_tokenizer_..._Conv_cluster_fork The original RedMulE PR (pulp-platform/Deeploy#67) shipped only the matmul kernel TargetLibraries/PULPOpen/src/Matmul_fp32_Redmule.c. The ConvTemplate references a `Conv2d_Im2Col_..._8_Redmule` kernel that has no corresponding source in the tree, and 67b754b already deleted the testFloat2DConvolution / testFloat2dConvLarge fixtures that would have exercised the Redmule Conv path. So the Conv binding has always been load-bearing only for non-test models like CCT_train, and on those it breaks the link. Two coupled changes route Conv through the existing PULPClusterEngine (which has a working PULP_Conv2d_Im2Col_fp32_fp32_fp32_HWC): - Drop 'Conv' from RedmuleMapping. Without it Conv falls through to the second engine in RedmulePlatform's engine list (PULPCluster). - Drop RedMuleAdjustWeightMemoryLayoutPass from the lowering passes. That pass transposed Conv weights from [F,H,W,Cin] to [H,W,Cin,F] for the RedMulE accelerator's expected layout; once Conv is on the PULPCluster engine, PULP expects [F,H,W,Cin] and the pre-applied transpose makes Tiling produce out-of-bounds tile rectangles (locally repro'd: AssertionError "Rectangle offset should be zero when the dimensions are the same. Received rectangle HyperRectangle(offset=(3, 0, 0, 0), dims=(3, 3, 3, 32))" in TilingCodegen.minimizeRectangle). Both are clearly marked in-source as "restore when the RedMulE Conv kernel lands." Locally validated end-to-end: - testMVPTraining.py -> exit 0 (TrainingNetwork.c emits PULP_Conv2d_Im2Col_fp32_fp32_fp32_HWC for the tokenizer Conv). - testMVPOptimizer.py -> exit 0. Matmul / Gemm continue to bind to RedMulE as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ank line Pure formatting / unused-import cleanup, applied verbatim from the "All changes made by hooks" diff in the failing Lint & Licenses CI job for 78a05d4. No behaviour change. - Generic/TransposeTileConstraint.py: yapf collapses the `assert` and `tilerModel.addConstraint(...)` calls onto fewer lines. - Redmule/Engine.py: autoflake drops the now-unused `ConvLayer` import (Conv was unmapped from RedmuleMapping in 78a05d4). - Redmule/TopologyOptimizationPasses/Passes.py: trim trailing blank line at EOF. - DeeployTest/test_platforms.py: isort wraps the long L3_SINGLEBUFFER_TRAINING_MODELS import line. Local pre-commit couldn't be run end-to-end in the agent container (clang-format wheel build fails on Python 3.10 / aarch64), so applied the patch from the CI's own pre-commit log instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both `_runner-siracusa-redmule-tiled.yml` and `_runner-siracusa-tiled.yml` ran tests with pytest's default stdout capture on, so the per-test "Cycles" line that GVSoC prints (and that the Deeploy testRunner forwards) was eaten by pytest and never made it to the CI log -- making it impossible to compute a speedup number from the existing artifacts. Add a new optional `pytest-flags` input to both runner workflows (default preserves the existing behavior: `-n 4` on the redmule runner, empty on the plain siracusa-tiled runner). Override it from the two training callers: - ci-platform-siracusa-redmule-tiled.yml / siracusa-redmule-training-tiled-singlebuffer-L3 -> "-s -p no:xdist" (xdist re-buffers stdout even with -s; only one test in this matrix so dropping -n 4 is harmless) - ci-platform-siracusa-tiled.yml / siracusa-training-tiled-l3-singlebuffer -> "-s" (already sequential) After this lands, the next push will produce log files where the cct_train Cycles report appears verbatim on both platforms; the diff gives the actual speedup of routing FP32 Matmul/Gemm through RedMulE vs. the PULP cluster fallback. Kernel-test jobs and other non-training callers keep the original flags (no -s, full xdist parallelism) so their wall-clock isn't affected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements the kernel symbol that Deeploy/Targets/Redmule/Templates/ConvTemplate.py has been pointing at since the original pulp-platform/Deeploy#67 port -- it was a declared- but-never-defined dangling reference, which is why 78a05d4 had to unmap Conv from RedmuleMapping and route it through PULPCluster. - TargetLibraries/PULPOpen/src/Conv2d_Im2Col_fp32_Redmule.c All 8 cluster cores cooperatively build the [N_out, P*Q*C] im2col matrix in the hoisted L1 transient buffer (contiguous slices of output positions, zero-pad when h_in/w_in fall outside the input). Core 0 then triggers a single RedMulE GEMM [N_out, K] @ [K, F] -> [N_out, F] via MatMul_*_Redmule / Gemm_*_Redmule from Matmul_fp32_Redmule.c. When has_bias is true the [F] bias is broadcast in-place into pOut and Gemm runs with y_addr = z_addr = pOut (same pattern the existing MatMul kernel already uses for its Y=Z=pDstY zero-init). - Conv.h declares the new symbol. - ConvTemplate.py: * forwards ${bias} and ${has_bias} (PULPFPConv2DParser already populates them) -- the previous template silently dropped bias. * sizes the im2col transient buffer to the full per-tile H_out * W_out * (C*P*Q) footprint instead of the prior 8-row scratch; one big GEMM amortises RedMulE's MMIO setup cost. - Engine.RedmuleMapping restores 'Conv': ConvLayer([Conv2DRedmuleMapper]). - Deployer.py restores RedMuleAdjustWeightMemoryLayoutPass -- it permutes Conv weights from [F,P,Q,C] to [P,Q,C,F] = flat [P*Q*C, F], exactly the right operand the im2col GEMM consumes. Both Conv and the layout pass were disabled together in 78a05d4 (PULPCluster fallback expects [F,P,Q,C]); both come back together now. Locally validated: testMVPTraining.py + testMVPOptimizer.py both exit 0 on Models/Training/CCT/cct_train @ Siracusa_w_redmule; generated TrainingNetwork.c now emits Conv2d_Im2Col_fp32_fp32_fp32_HWC_8_Redmule for the tokenizer Conv (was PULP_Conv2d_Im2Col_*_HWC). GVSoC numerical tolerance still has to be checked on CI -- this is a new kernel, not a wrapper around an existing one, and the broadcasted- bias path was never exercised before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Actions UI Two unrelated cleanups bundled together because both are CI hygiene: 1) clang-format complaint on TargetLibraries/PULPOpen/src/Conv2d_Im2Col_fp32_Redmule.c from 4517cc9 -- one wrapped const assignment had to be collapsed onto a single line to satisfy the pre-commit hook. Applied verbatim from the "All changes made by hooks" diff in the failing Lint & Licenses run (job 75249916168). Zero behaviour change. 2) Post-run summary in both runner workflows. pytest -s already pipes GVSoC's "BENCH train_cycles=... opt_cycles=... weight_sram=..." line to the log; tee it to /tmp/pytest_out.log and have a follow-up step grep it into $GITHUB_STEP_SUMMARY so reviewers don't have to dig through ~30k-line raw logs to find a speedup number. - _runner-siracusa-tiled.yml emits one row per BENCH line (multiple training models share that workflow), keyed by weight_sram so the reader can identify which model. Kernel-only invocations have no BENCH lines and the step quietly no-ops. - _runner-siracusa-redmule-tiled.yml additionally queries the GitHub Actions API for the matching CI • Siracusa (Tiled) push run on the same head_sha, pulls its training-l3-singlebuffer job log, greps the BENCH line with weight_sram=94953 (the cct_train fixture) and prints a speedup table inline. Best-effort: API failures or a baseline that hasn't finished yet are logged and skipped, never fail the redmule training job. Net effect: open the Actions UI for a push run, click into the RedMulE Tiled workflow, and the run summary shows train_cycles | Siracusa | + RedMulE | speedup directly -- no manual log scraping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…del summary What changed: 1) DeeployTest/test_siracusa_redmule_tiled_config.py: L3_SINGLEBUFFER_TRAINING_MODELS gains ResNet8/resnet8_train and MobileNetV1/mobilenetv1_train alongside the existing CCT_train. These are the Conv-heavy / DW-PW-heavy fixtures the Siracusa-only training CI has been validating; they're the workloads where RedMulE should actually show up in the speedup table. 2) Conv mapping reverted (again). 4517cc9 wired Conv to a new Conv2d_Im2Col_fp32_fp32_fp32_HWC_8_Redmule kernel; the kernel itself works (CCT_train passes locally and on CI with this commit's revert undone), but RedmuleConv2DTileConstraint.addPolicyConstraint hard-pins inputHeightVar / inputWidthVar to the full feature map (see lines 108-112 in TileConstraints/ConvTileConstraint.py), which makes the activation tensor unconditionally fit-in-L1. For CCT2's 8x8 tokenizer that is fine; for ResNet8 / MobileNet middle layers with 32x32 activations it makes the tiler infeasible (DEEPLOY_PATTERN_MEM_copyIdx_* > 196 KiB > 128 KiB L1 budget). PULP's Conv2DTileConstraint already supports spatial tiling with halo regions, so routing Conv back to PULPClusterEngine keeps the bigger Conv-heavy fixtures tilable; MatMul / Gemm continue to bind to RedMulE. RedMuleAdjustWeightMemoryLayoutPass is removed from the lowering passes for the same reason (it permuted weights into [P,Q,C,F] for the RedMulE kernel, which PULP can't consume). The RedMulE Conv kernel + ConvTemplate + Conv.h decl + chunked im2col logic all stay in tree as ready infrastructure: re-enable the single 'Conv': ConvLayer([Conv2DRedmuleMapper]) line and restore the weight-layout pass once RedmuleConv2DTileConstraint learns spatial tiling (the kernel itself was already redesigned in this commit to stream im2col in IM2COL_CHUNK_ROWS=16 chunks rather than one big buffer, so the L1 footprint of the RedMulE Conv path is also no longer a blocker -- only the tile constraint is). ConvTemplate.computeTransientBuffersSize correspondingly went back to 16 * K rows instead of the full-image H_out * W_out * K -- aligns with the kernel's chunked behaviour and keeps L1 budget room for activations when Conv eventually rebinds to RedMulE. 3) Conv2d_Im2Col_fp32_Redmule.c refactored to chunked im2col. The kernel now loops over output positions in IM2COL_CHUNK_ROWS=16 chunks, builds im2col for each chunk in parallel across the 8 cluster cores, then triggers one RedMulE GEMM per chunk. Multiple RedMulE setups (~200 cycles each) instead of one, but the buffer never exceeds 16 rows × K, which keeps the future re-enable feasible even on ResNet8-class spatial sizes. Existing local validation (CCT_train, ResNet8_train, MobileNetV1_train codegen exit=0) passes. 4) _runner-siracusa-redmule-tiled.yml summary step rewritten in Python to handle multiple BENCH lines per job: now parses every `BENCH train_cycles=... weight_sram=...` line, fetches the matching Siracusa-baseline run via the GitHub Actions API, and prints one speedup row per model paired by weight_sram. Previously hard-coded to the single CCT weight_sram=94953 case; with ResNet8 + MobileNetV1 in the matrix the previous logic would only match CCT and silently miss the other two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds pointwise (1x1) Conv backward kernels routed through RedMulE. The math is the same matmul-after-reshape that pulp-trainlib's PULP_PWConvGrad{W,X}2d_fp32_fp32_fp32_CHW already does, but the inner GEMM is now a single RedMulE trigger after a parallel transpose. This is the most natural RedMulE win in CCT-style training: the dY @ X^T (weight-grad) and W^T @ dY (input-grad) reductions on MobileNet-style PW blocks have K = H*W ≥ a few hundred elements once you walk a couple of blocks past the stem, which is well above the K=27 size where the RedMulE setup cost dominates on the CCT tokenizer. What landed: - TargetLibraries/PULPOpen/src/PWConvGrad_fp32_Redmule.c Two kernels: PWConvGradW2d_fp32_fp32_fp32_CHW_Redmule(dY, H_out, W_out, C_out, X, H_in, W_in, C_in, dW, pTransposeBuffer) PWConvGradX2d_fp32_fp32_fp32_CHW_Redmule(dY, H_out, W_out, C_out, W, C_in, dX, H_in, W_in, pTransposeBuffer, transposeBufferSize) Each one builds the required transpose (X^T or W^T) in parallel across the 8 cluster cores into the hoisted transient buffer, then fires a single MatMul_fp32_fp32_fp32_Redmule. - TargetLibraries/PULPOpen/inc/kernel/Conv.h decls for both new symbols. - Deeploy/Targets/Redmule/Templates/ConvGradTemplate.py (new) Two NodeTemplate subclasses (RedmulePWConvGradWTemplate, RedmulePWConvGradXTemplate) that hoist the transient transpose buffer (C_in * H_in * W_in for dW; C_in * C_out for dX, identical to the PULP version so PWConvGradX TileConstraint keeps working) and emit the kernel call in the per-batch loop. - Deeploy/Targets/Redmule/Bindings.py RedmulePWConvGradW2DBindings / RedmulePWConvGradX2DBindings with the same ConvChecker(2 inputs -> 1 output) signature as the PULP versions. - Deeploy/Targets/Redmule/Tiler.py Tiling-ready bindings paired with the *existing* PULP PWConvGradWTileConstraint / PWConvGradXTileConstraint -- the tile-shape search is engine-agnostic; only the binding body changes. - Deeploy/Targets/Redmule/Engine.py Adds 'ConvGradW': ConvGradWLayer([PWConvGradW2DRedmuleMapper]) and 'ConvGradX': ConvGradXLayer([PWConvGradX2DRedmuleMapper]) to RedmuleMapping. Both reuse PULPPWConvGrad{W,X}2DParser, which already screens for kernel_shape == [1, 1] / group == 1; non-PW backward Convs (regular 3x3 ConvGradW, depthwise variants) transparently fall through to PULPClusterEngine (which still carries the full [PW, DW, regular] mapper list). - DeeployTest/test_siracusa_redmule_tiled_config.py Kernels/FP32/ConvGradW_PW (1x1 dY 1x128x12x12, dW 128x64) and Kernels/FP32/ConvGradX_PW_block_11 (1x1 dY 1x256x3x3, W 256x128) added to the redmule kernel matrix. - DeeployTest/test_siracusa_tiled_config.py Same fixtures added to the Siracusa baseline kernel matrix so the CI step-summary speedup table has matching PULP cycle numbers to diff against. Local pytest --collect-only yields 5 redmule kernel tests (3 GEMM + 2 PWConvGrad). Real cycle / numerical correctness checks happen on CI; the summary script written in 46e4f3c already parses BENCH lines for training, the kernel-side `Runtime: N cycles` parsing will be a follow-up in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to 28b18a8. The previous commit wired both PWConvGradW2D and PWConvGradX2D into RedmuleEngine.Mapping; CI failed on ResNet8 / MobileNet training because their non-PW (3x3 / DW) backward Convs reached RedmuleEngine first (first-match in DeploymentPlatform._selectEngine) and the layer-level mapper list there carried only the PW Redmule mapper -- backtracking exhausted at the platform layer instead of falling back to PULPClusterEngine. Tried two fixes locally before settling: 1) Add DW + regular PULP mappers to RedmuleEngine's ConvGradW/X layer. Parsing now succeeds (3x3 binds to the PULP regular mapper just like before), but the tiler reports infeasible memory-pattern constraints with min-size 196608 B on the same nodes that worked on pure PULP -- some interaction we couldn't fully diagnose between adding mappers to the layer and the pattern-memory solver. 2) Don't touch RedmuleEngine.Mapping at all; instead, insert the RedMulE PW mappers at position 0 of PULPClusterEngine's existing ConvGradW / ConvGradX layers from RedmulePlatform.__init__. This keeps the same Layer object the pure-PULP path uses. Approach 2 works for ConvGradX but still trips the infeasibility for ConvGradW; isolation showed it's specifically the W mapper's insertion that breaks tiling (X-only is green on all three training fixtures, W-only fails on ResNet8 / MobileNet). The suspect is the W template's new C_in * H_in * W_in transposeBuffer footprint being accounted for by the pattern-memory solver even on nodes where the PW parser declines and the mapper is never selected -- but the exact mechanism is non-trivial to chase and out of scope for this round. This commit: - Drops the 'ConvGradW' / 'ConvGradX' entries from RedmuleMapping entirely; documents the failure mode in-source. - Inserts only PWConvGradX2DRedmuleMapper at position 0 of the PULPClusterEngine ConvGradXLayer in RedmulePlatform.__init__. PWConvGradW2DRedmuleMapper, its template, binding, and kernel stay in tree as ready infrastructure -- restore the insertion (and the Kernels/FP32/ConvGradW_PW matrix entry) once the W-side regression is diagnosed. - Removes Kernels/FP32/ConvGradW_PW from the redmule kernel matrix for the same reason; ConvGradX_PW_block_11 stays. Locally validated: CCT_train, ResNet8_train, MobileNetV1_train codegen all exit 0 on Siracusa_w_redmule; MobileNet TrainingNetwork.c now emits PWConvGradX2d_fp32_fp32_fp32_CHW_Redmule for the PW input-grad nodes (was PULP_PWConvGradX2d_*). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ernel CI Two stability fixes on the PW backward path + a temporary CI scope cut to let the kernel tests validate numerical correctness on CI before re-enabling the full training matrix. PWConvGradW kernel rewrite: The previous single-shot transpose buffer was C_in * H_out * W_out floats. On MobileNetV1 early blocks (C_in=32, H_out=W_out=48) that's a 288 KiB transient -- the pattern-memory solver runs out of L1 budget before the GEMM even gets to consider tiling. Switch the kernel to a chunked accumulation: PWGW_CHUNK_P = 16 output positions at a time, one RedMulE Gemm trigger per chunk where the previous dW serves as the bias-and-z operand (y_addr = z_addr = pGradWeight pattern, same trick Matmul_fp32_Redmule already uses for Y=Z=pDstY zero-init). When P > CHUNK_P the kernel also needs a contiguous [C_out, this_chunk] slice of dY, so the transient buffer also reserves CHUNK_P * C_out floats past the X^T region. Net L1 footprint is fixed at CHUNK_P * (C_in + C_out) * 4 B, independent of feature-map area. Stride > 1 is handled by deriving (SP, SQ) = (H_in/H_out, W_in/W_out) in the kernel and sampling X at (h_out*SP, w_out*SQ). Mirrors what pulp-trainlib's PULP_PWConvGradW2d does internally without changing the public signature. PWConvGradX kernel rewrite (stride-aware): The old kernel computed dX[C_in, H_in*W_in] = W^T @ dY[C_out, P] with P = H_in*W_in. That's correct only for stride 1; for stride 2 downsample 1x1 convs (ResNet8 layer2/3) the GEMM would over-iterate and produce wrong dX values. Fix: GEMM into a dense tmp[C_in, H_out*W_out], then scatter to pGradIn at strided positions while leaving the rest zero. The template's transient buffer grows to C_in*C_out + C_in*H_out*W_out to host both W^T and tmp. Mapping: RedmulePlatform.__init__ now inserts both PWConvGradW2DRedmuleMapper and PWConvGradX2DRedmuleMapper at position 0 of the PULPCluster ConvGradW / ConvGradX layers (previously X-only). Re-add the ConvGradW_PW kernel test to the redmule matrix so CI exercises the W path with reference reference inputs and a known ORT-computed dW. Scope cut: L3_SINGLEBUFFER_TRAINING_MODELS is temporarily empty in the redmule test config -- the kernel-level tests above are the cleanest way to validate the new kernels' numerical correctness; full-network runs will resume once those are green. The Siracusa-baseline training matrix is untouched, so per-kernel speedup vs PULP is still observable through the CI summary's per-fixture baseline lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5b59d3a emptied L3_SINGLEBUFFER_TRAINING_MODELS to isolate the new PWConvGrad{W,X} RedMulE kernels behind the kernel-test matrix; that made @pytest.mark.parametrize get an empty list, which pytest treats as "error raised while trying to determine id of parameter 'test_params' at position 0" at collection time -- not just on the training job but on the kernel-singlebuffer-L2 job too, because both jobs collect the same test_platforms.py module. Restore CCT/cct_train (the smallest of the three training fixtures) as the minimum non-empty entry so parametrize stays happy. Kernel tests run independently from training tests in the same module -- the kernel matrix is still our primary correctness signal for the new kernels; CCT just gives parametrize something to enumerate, and runs as a bonus end-to-end sanity check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

runwangdl and others added 22 commits May 10, 2026 14:12

drop Float test fixtures

a5c8c5f

The Tests/ directory layout on devel was reorganized into Kernels/, Models/, Others/ subdirectories. Drop the flat-path Float test inputs ported from redmule_platform; they'll be re-added under the new structure in a follow-up.

ci: skip gh-pages publish on forks

ef1e5dc

The docs workflow publishes to gh-pages, which on a fork races with external pushes and lacks origin remote setup. Gate on github.repository_owner == 'pulp-platform' so only upstream publishes.

ci(redmule): authenticate container pulls for private ghcr.io image

4bb02d0

ghcr.io/runwangdl/deeploy:redmule is a private package; add credentials block using the workflow's GITHUB_TOKEN so the runner container step can pull it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(platform): port Siracusa+RedMulE from pulp-platform/Deeploy#67#20

feat(platform): port Siracusa+RedMulE from pulp-platform/Deeploy#67#20
runwangdl wants to merge 22 commits into
develfrom
feat/redmule

runwangdl commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

runwangdl commented May 10, 2026

Summary

Verified

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant