x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693crafcat7 wants to merge 5 commits into
Conversation
Summary: Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar code carries a true prefix-sum data dependency and the compiler cannot auto-vectorize. The kernel performs an in-register Kogge-Stone scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound axes with no inner dependency are left to the base implementation since the compiler already auto-vectorizes them at memory bandwidth. Changes: 1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86 2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback 3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2) 4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes
Summary:
Add dedicated boundary test cases that exercise w at, just below, and
just above common SIMD vector widths (4 / 8 / 16). The cases cover the
single-tile no-tail path, the running base propagation across multiple
tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing
tests only hit these paths incidentally; the new cases make the coverage
explicit and remain valid for any vectorized backend.
Changes:
1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32}
2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths
3. Wire the new function into main() alongside the existing 1d/2d/3d groups
|
@codex review |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6693 +/- ##
==========================================
+ Coverage 93.95% 93.97% +0.01%
==========================================
Files 933 936 +3
Lines 299730 299942 +212
==========================================
+ Hits 281624 281866 +242
+ Misses 18106 18076 -30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4f8e403ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| { | ||
| int j = 0; | ||
|
|
||
| #if __AVX2__ |
There was a problem hiding this comment.
Enable AVX2 kernel in runtime-dispatch builds
The AVX2 fast path is guarded by #if __AVX2__, but this layer only adds cumulativesum_x86.cpp and no cumulativesum_x86_avx2.cpp helper. In the default NCNN_RUNTIME_CPU=ON x86 build, this source is instantiated as avx/fma variants (see cmake/ncnn_add_layer.cmake flags -mavx and -mavx -mfma), so __AVX2__ is false on AVX2/FMA CPUs without AVX512 and the code falls back to SSE2/scalar. That means the new 8-lane kernel is effectively unavailable for the most common AVX2-only runtime target.
Useful? React with 👍 / 👎.
Summary: Add an x86-specific CumulativeSum implementation that accelerates inner-axis prefix scans with AVX and an out-of-line AVX2 Kogge-Stone helper, while covering all valid base-layer dims/axis cases directly. This also replaces the tail base-class bounce with an explicit invalid-input return path and adds boundary tests for SIMD tile edges. Changes: 1. Add x86 cumulativesum AVX2 helper source and runtime dispatch from AVX-capable variants. 2. Implement SIMD prefix-scan and outer-axis accumulation paths for all valid 1D, 2D, and 3D cases in the x86 layer. 3. Extend cumulativesum tests with boundary widths to validate vector tails and running-base propagation.
|
I've revised my submission and updated the current PR information. The main changes are as follows:
|
Summary: Move the CumulativeSum x86 SIMD implementation and AVX2 runtime dispatch into a shared header included by both the layer wrapper and the AVX2 wrapper. This matches the existing x86 packed helper structure while preserving optimized prefix scan behavior. Changes: 1. Add a shared CumulativeSum x86 packed helper header with prefix/add kernels and forwarding logic 2. Simplify the x86 layer implementation to call the shared forward path 3. Export an AVX2 wrapper translation unit for runtime CPU dispatch
|
I've restructured the code, splitting it into:
|
There was a problem hiding this comment.
Pull request overview
Adds an x86-specific CumulativeSum implementation to accelerate inner-dimension prefix scans using SIMD (AVX2 Kogge-Stone prefix scan with AVX/SSE fallbacks) while keeping all valid (dims, axis) cases handled within the x86 layer and extending test coverage to boundary sizes that stress SIMD tiling/tails.
Changes:
- Introduces
CumulativeSum_x86and an x86 implementation offorward_inplace()that covers all(dims, axis)cases. - Adds a SIMD prefix-sum kernel for contiguous inner-dimension scans plus a SIMD “cur += prev” helper for outer-axis scans, including runtime AVX2 dispatch for
__AVX__ && !__AVX2__builds. - Extends
test_cumulativesumwith boundary cases targeting vector-width edges and tail handling.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_cumulativesum.cpp | Adds boundary-focused test cases for 1D/2D/3D shapes to exercise SIMD tiling and tails. |
| src/layer/x86/cumulativesum_x86.h | Declares the x86-specific CumulativeSum_x86 layer override. |
| src/layer/x86/cumulativesum_x86.cpp | Implements CumulativeSum_x86::forward_inplace() and routes to the packed SIMD implementation. |
| src/layer/x86/cumulativesum_x86_packed.h | Provides the SIMD prefix-sum (AVX2/AVX/SSE2) and add helpers plus the main x86 forward implementation with runtime AVX2 dispatch. |
| src/layer/x86/cumulativesum_x86_avx2.cpp | Defines the out-of-line AVX2 entry point used by runtime dispatch builds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rd path
Summary:
Restructure cumulativesum_x86 to follow the established x86 layer
pattern (e.g. deconvolution, interp, cast): normal implementation in
x86.cpp, ISA-specific variant in packed.h + avx2.cpp wrapper.
Changes:
1. Move normal forward_inplace logic and SSE/AVX helpers into
cumulativesum_x86.cpp
2. Rename packed.h helpers to _avx2 suffix and guard with #if __AVX2__
3. Add runtime AVX2 dispatch in forward_inplace via
cumulative_sum_forward_inplace_avx2 free function
4. Update cumulativesum_x86_avx2.cpp wrapper to call _impl variant
Summary
Adds an x86-specific implementation of
CumulativeSumthat replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone helper plus AVX / SSE fallbacks in the main x86 source. On the refreshed benchmark matrix this yields 1.75×–2.46× single-thread speedup and 1.46×–2.00× 8-thread speedup across the four Stage 2 demos. The x86 layer now covers all valid(dims, axis)cases; outer-axis scans use a simple SIMDcur += prevhelper instead of falling back to the native implementation.Motivation
CumulativeSum::forward_inplaceinsrc/layer/cumulativesum.cppwalks data with an inner serial recurrence:Of the 6
(dims, axis)cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:dims == 1dims == 2, axis == 1dims == 3, axis == 2The other three cases (
dims=2 axis=0,dims=3 axis=0/1) scan across the outer dimension, so each step updates an independent row/channel slice withcur += prev. These paths are memory-bandwidth-bound, but the x86 implementation still handles them with a small SIMD add helper so all valid cases stay inside the x86 layer.Algorithm
Inner row of
wfloats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:_mm256_slli_si256by 4B), then add.After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
_mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now inv.Per-lane top-to-bottom view of one 8-lane tile (lanes 0..7 =
a..h). Each column is one Kogge-Stone stage; values inside a stage are computed in parallel.Tail (<8 lanes) is finished with a scalar accumulator. The SSE2 path uses the same structure on 4 lanes with just stages 1 and 2; the AVX path stitches two 128-bit prefix sums into one 256-bit tile when
__AVX__is available but__AVX2__is not.We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.
Dispatch
The x86 layer handles all six valid
(dims, axis)cases.support_packingis left at its inherited default so the packing machinery auto-unpacks topack1beforeforward_inplace.AVX2 dispatch:
cumulativesum_x86.cppcontains the x86 layer and compile-time AVX2 / AVX / SSE kernels.cumulativesum_x86.hdeclares both the x86 layer class and the AVX2 helper interface, matching existing x86 patterns that avoid extra ISA-only headers for small helper shims.cumulativesum_x86_avx2.cppcontains out-of-line AVX2 helpers compiled byncnn_add_arch_opt_source()with-mavx2or/arch:AVX2.forward_inplace()call only from generatedfma/avxvariants whencpu_support_x86_avx2()is true. This fixes the common AVX2-only target where the layer creator resolves to thefma/avxvariant, where__AVX2__is otherwise false, while keeping compile-time AVX2 code in the main x86 source for fixed-ISA builds.Multi-threading:
dims == 2, axis == 1:#pragma omp parallel forover rows.dims == 3, axis == 2: flattened#pragma omp parallel forover(channel, row)— same parallelism ascollapse(2), but compatible with MSVC OpenMP.Correctness
tests/test_cumulativesum.cppis extended with boundary cases that exercise every edge of the SIMD tiling:axis=1widths 1, 3, 8, 16, 17 with 5 rows.axis=2widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies the flattened(channel, row)parallel region.All 22 cases pass on the SIMD build.
Performance
Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++,
-O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements usebenchncnnwithloop=100,cooldown=0,taskset -c 0-7; reported as theminmetric (most stable at sub-millisecond workloads).Baseline build is rebuilt in a detached source tree with
src/layer/x86/cumulativesum_x86.{h,cpp}removed so the base scalar implementation is used.Benchmark matrix
Four demo graphs, each stacking 3
CumulativeSumlayers to amortize benchmark harness noise:cumsum_1d_demo[65536]cumsum_2d_axis1_demo[512, 512]cumsum_axis2_demo[256, 256, 32]cumsum_demo[256, 256, 32]All numbers are
minmilliseconds over 100 iterations on cores 0-7.Interpretation
axis=2demo at 2.46× (1T) and 1.89× (8T).cumulativesum_x86.cppcaused generatedavx512variants on the benchmark machine to bypass the dedicated helper and regress noticeably. Routing all__AVX__variants back through the out-of-line AVX2 helper recovered the earlier benchmark envelope.axis=2layer still carries the largest win, while the first two layers benefit from the x86 SIMD add helper.No regressions
All other layers and networks are unaffected — the new class only overrides
forward_inplaceforCumulativeSum, preserves the existing in-place API, and falls back to the base only for unexpected invalid cases.