x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan by crafcat7 · Pull Request #6693 · Tencent/ncnn

crafcat7 · 2026-04-24T13:50:08Z

Summary

Adds an x86-specific implementation of CumulativeSum that replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone helper plus AVX / SSE fallbacks in the main x86 source. On the refreshed benchmark matrix this yields 1.75×–2.46× single-thread speedup and 1.46×–2.00× 8-thread speedup across the four Stage 2 demos. The x86 layer now covers all valid (dims, axis) cases; outer-axis scans use a simple SIMD cur += prev helper instead of falling back to the native implementation.

Motivation

CumulativeSum::forward_inplace in src/layer/cumulativesum.cpp walks data with an inner serial recurrence:

for (int k = 1; k < w; k++)
    ptr[k] = ptr[k] + ptr[k - 1];  // RAW dependency → cannot auto-vectorize

Of the 6 (dims, axis) cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:

dims == 1
dims == 2, axis == 1
dims == 3, axis == 2

The other three cases (dims=2 axis=0, dims=3 axis=0/1) scan across the outer dimension, so each step updates an independent row/channel slice with cur += prev. These paths are memory-bandwidth-bound, but the x86 implementation still handles them with a small SIMD add helper so all valid cases stay inside the x86 layer.

Algorithm

Inner row of w floats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:

Stage 1 — shift-by-1 within each 128-bit half (_mm256_slli_si256 by 4B), then add.
Stage 2 — shift-by-2 within each 128-bit half (8B), then add.
After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
Stage 3 — broadcast the last lane of the low half into all 4 lanes of the high half via _mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now in v.
Running base — add a broadcast of the previous tile's last lane, then update the base with the new last lane for the next tile.

Per-lane top-to-bottom view of one 8-lane tile (lanes 0..7 = a..h). Each column is one Kogge-Stone stage; values inside a stage are computed in parallel.

 lane  | input |  stage 1   |   stage 2    |     stage 3      |  + running_base
       |       | shift-1 +  | shift-2 +    | broadcast low.3  |
       |       | add (half) | add (half)   | → high + add     |
-------+-------+------------+--------------+------------------+--------------------
   0   |   a   |     a      |      a       |        a         |  a + base
   1   |   b   |    a+b     |     a+b      |       a+b        |  a+b + base
   2   |   c   |    b+c     |    a+b+c     |      a+b+c       |  a+b+c + base
   3   |   d   |    c+d     |   a+b+c+d    |     a+b+c+d      |  a+b+c+d + base
   ----  128-bit lane boundary  ------------------------------+--------------------
   4   |   e   |     e      |      e       |    a+b+c+d + e   |  ...+e + base
   5   |   f   |    e+f     |     e+f      |   a+b+c+d + e+f  |  ...+ef + base
   6   |   g   |    f+g     |    e+f+g     |  a+b+c+d + e+f+g |  ...+efg + base
   7   |   h   |    g+h     |   e+f+g+h    | a+b+c+d + e+f+g+h|  ...+efgh + base
                                                                       │
                                              base for next tile  ◄────┘
                                              (broadcast of lane 7)

Tail (<8 lanes) is finished with a scalar accumulator. The SSE2 path uses the same structure on 4 lanes with just stages 1 and 2; the AVX path stitches two 128-bit prefix sums into one 256-bit tile when __AVX__ is available but __AVX2__ is not.

We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.

Dispatch

The x86 layer handles all six valid (dims, axis) cases. support_packing is left at its inherited default so the packing machinery auto-unpacks to pack1 before forward_inplace.

AVX2 dispatch:

cumulativesum_x86.cpp contains the x86 layer and compile-time AVX2 / AVX / SSE kernels.
cumulativesum_x86.h declares both the x86 layer class and the AVX2 helper interface, matching existing x86 patterns that avoid extra ISA-only headers for small helper shims.
cumulativesum_x86_avx2.cpp contains out-of-line AVX2 helpers compiled by ncnn_add_arch_opt_source() with -mavx2 or /arch:AVX2.
Runtime CPU builds select the AVX2 helper once per forward_inplace() call only from generated fma / avx variants when cpu_support_x86_avx2() is true. This fixes the common AVX2-only target where the layer creator resolves to the fma / avx variant, where __AVX2__ is otherwise false, while keeping compile-time AVX2 code in the main x86 source for fixed-ISA builds.

Multi-threading:

dims == 2, axis == 1: #pragma omp parallel for over rows.
dims == 3, axis == 2: flattened #pragma omp parallel for over (channel, row) — same parallelism as collapse(2), but compatible with MSVC OpenMP.

Correctness

ctest --test-dir build -R test_cumulativesum --output-on-failure
1/1 Test #44: test_cumulativesum ............... Passed

tests/test_cumulativesum.cpp is extended with boundary cases that exercise every edge of the SIMD tiling:

1D lengths 1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17, 32 — covers tail-only, exact vector, partial-tail, and the first tile past the running-base propagation.
2D axis=1 widths 1, 3, 8, 16, 17 with 5 rows.
3D axis=2 widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies the flattened (channel, row) parallel region.

All 22 cases pass on the SIMD build.

Performance

Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++, -O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements use benchncnn with loop=100, cooldown=0, taskset -c 0-7; reported as the min metric (most stable at sub-millisecond workloads).

Baseline build is rebuilt in a detached source tree with src/layer/x86/cumulativesum_x86.{h,cpp} removed so the base scalar implementation is used.

Benchmark matrix

Four demo graphs, each stacking 3 CumulativeSum layers to amortize benchmark harness noise:

Demo	Shape	Axis path	1T base	1T opt	1T ×	8T base	8T opt	8T ×
`cumsum_1d_demo`	`[65536]`	dims=1	0.07	0.04	1.75×	0.07	0.04	1.75×
`cumsum_2d_axis1_demo`	`[512, 512]`	dims=2 axis=1	0.29	0.13	2.23×	0.06	0.03	2.00×
`cumsum_axis2_demo`	`[256, 256, 32]`	dims=3 axis=2	2.24	0.91	2.46×	0.53	0.28	1.89×
`cumsum_demo`	`[256, 256, 32]`	mixed axis=0→1→2	1.05	0.56	1.88×	0.41	0.28	1.46×

All numbers are min milliseconds over 100 iterations on cores 0-7.

Interpretation

RAW-dependency paths (dims=1, dims=2 axis=1, dims=3 axis=2): the scalar recurrence is still the core bottleneck, and the dedicated AVX2 helper restores the stronger performance profile after the temporary AVX-main-path regression. The best current result is the pure axis=2 demo at 2.46× (1T) and 1.89× (8T).
Regression note: inlining the AVX2 prefix kernel into cumulativesum_x86.cpp caused generated avx512 variants on the benchmark machine to bypass the dedicated helper and regress noticeably. Routing all __AVX__ variants back through the out-of-line AVX2 helper recovered the earlier benchmark envelope.
Mixed demo (1.88× at 1T, 1.46× at 8T): the final axis=2 layer still carries the largest win, while the first two layers benefit from the x86 SIMD add helper.

No regressions

All other layers and networks are unaffected — the new class only overrides forward_inplace for CumulativeSum, preserves the existing in-place API, and falls back to the base only for unexpected invalid cases.

Summary: Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar code carries a true prefix-sum data dependency and the compiler cannot auto-vectorize. The kernel performs an in-register Kogge-Stone scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound axes with no inner dependency are left to the base implementation since the compiler already auto-vectorizes them at memory bandwidth. Changes: 1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86 2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback 3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2) 4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes

Summary: Add dedicated boundary test cases that exercise w at, just below, and just above common SIMD vector widths (4 / 8 / 16). The cases cover the single-tile no-tail path, the running base propagation across multiple tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing tests only hit these paths incidentally; the new cases make the coverage explicit and remain valid for any vectorized backend. Changes: 1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32} 2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths 3. Wire the new function into main() alongside the existing 1d/2d/3d groups

nihui · 2026-04-27T02:48:24Z

@codex review

codecov-commenter · 2026-04-27T02:51:26Z

Codecov Report

❌ Patch coverage is 99.16667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.97%. Comparing base (d0d5063) to head (e0a3dfd).
⚠️ Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
src/layer/x86/cumulativesum_x86_packed.h	99.12%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6693      +/-   ##
==========================================
+ Coverage   93.95%   93.97%   +0.01%     
==========================================
  Files         933      936       +3     
  Lines      299730   299942     +212     
==========================================
+ Hits       281624   281866     +242     
+ Misses      18106    18076      -30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4f8e403ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T02:53:46Z

+{
+    int j = 0;
+
+#if __AVX2__


Enable AVX2 kernel in runtime-dispatch builds

The AVX2 fast path is guarded by #if __AVX2__, but this layer only adds cumulativesum_x86.cpp and no cumulativesum_x86_avx2.cpp helper. In the default NCNN_RUNTIME_CPU=ON x86 build, this source is instantiated as avx/fma variants (see cmake/ncnn_add_layer.cmake flags -mavx and -mavx -mfma), so __AVX2__ is false on AVX2/FMA CPUs without AVX512 and the code falls back to SSE2/scalar. That means the new 8-lane kernel is effectively unavailable for the most common AVX2-only runtime target.

Useful? React with 👍 / 👎.

Summary: Add an x86-specific CumulativeSum implementation that accelerates inner-axis prefix scans with AVX and an out-of-line AVX2 Kogge-Stone helper, while covering all valid base-layer dims/axis cases directly. This also replaces the tail base-class bounce with an explicit invalid-input return path and adds boundary tests for SIMD tile edges. Changes: 1. Add x86 cumulativesum AVX2 helper source and runtime dispatch from AVX-capable variants. 2. Implement SIMD prefix-scan and outer-axis accumulation paths for all valid 1D, 2D, and 3D cases in the x86 layer. 3. Extend cumulativesum tests with boundary widths to validate vector tails and running-base propagation.

crafcat7 · 2026-04-28T02:42:54Z

I've revised my submission and updated the current PR information. The main changes are as follows:

Added x86-optimized implementation for CumulativeSum

Covered all valid (dims, axis) scenarios in the base layer in src/layer/x86/cumulativesum_x86.cpp
Used SIMD paths for inner-axis prefix scan
Used SIMD cur += prev paths for outer-axis accumulation

Added a separate AVX2 helper

src/layer/x86/cumulativesum_x86_avx2.cpp
Handled stronger ISA paths with out-of-line AVX2 Kogge-Stone prefix scan
Upgraded to AVX2 helper via runtime dispatch in the AVX-capable x86 variant

Corrected the behavior boundaries of the x86 layer

No longer falls back to the base class for repeated execution
Directly returns for uncovered illegal input -100, consistent with the semantics of the base class's missed branch

Supplementary tests

Extend tests/test_cumulativesum.cpp
Add boundary width tests to cover SIMD tile, tail, and running-base propagation scenarios

Summary: Move the CumulativeSum x86 SIMD implementation and AVX2 runtime dispatch into a shared header included by both the layer wrapper and the AVX2 wrapper. This matches the existing x86 packed helper structure while preserving optimized prefix scan behavior. Changes: 1. Add a shared CumulativeSum x86 packed helper header with prefix/add kernels and forwarding logic 2. Simplify the x86 layer implementation to call the shared forward path 3. Export an AVX2 wrapper translation unit for runtime CPU dispatch

crafcat7 · 2026-04-28T16:52:04Z

I've restructured the code, splitting it into:

cumulativesum_x86_packed.h -> Contains the main implementation
cumulativesum_x86.h -> External interface
cumulativesum_x86.cpp -> Forward
cumulativesum_x86_avx2.cpp -> Forward

Copilot

Pull request overview

Adds an x86-specific CumulativeSum implementation to accelerate inner-dimension prefix scans using SIMD (AVX2 Kogge-Stone prefix scan with AVX/SSE fallbacks) while keeping all valid (dims, axis) cases handled within the x86 layer and extending test coverage to boundary sizes that stress SIMD tiling/tails.

Changes:

Introduces CumulativeSum_x86 and an x86 implementation of forward_inplace() that covers all (dims, axis) cases.
Adds a SIMD prefix-sum kernel for contiguous inner-dimension scans plus a SIMD “cur += prev” helper for outer-axis scans, including runtime AVX2 dispatch for __AVX__ && !__AVX2__ builds.
Extends test_cumulativesum with boundary cases targeting vector-width edges and tail handling.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_cumulativesum.cpp	Adds boundary-focused test cases for 1D/2D/3D shapes to exercise SIMD tiling and tails.
src/layer/x86/cumulativesum_x86.h	Declares the x86-specific `CumulativeSum_x86` layer override.
src/layer/x86/cumulativesum_x86.cpp	Implements `CumulativeSum_x86::forward_inplace()` and routes to the packed SIMD implementation.
src/layer/x86/cumulativesum_x86_packed.h	Provides the SIMD prefix-sum (AVX2/AVX/SSE2) and add helpers plus the main x86 forward implementation with runtime AVX2 dispatch.
src/layer/x86/cumulativesum_x86_avx2.cpp	Defines the out-of-line AVX2 entry point used by runtime dispatch builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…rd path Summary: Restructure cumulativesum_x86 to follow the established x86 layer pattern (e.g. deconvolution, interp, cast): normal implementation in x86.cpp, ISA-specific variant in packed.h + avx2.cpp wrapper. Changes: 1. Move normal forward_inplace logic and SSE/AVX helpers into cumulativesum_x86.cpp 2. Rename packed.h helpers to _avx2 suffix and guard with #if __AVX2__ 3. Add runtime AVX2 dispatch in forward_inplace via cumulative_sum_forward_inplace_avx2 free function 4. Update cumulativesum_x86_avx2.cpp wrapper to call _impl variant

crafcat7 added 2 commits April 23, 2026 22:25

github-actions Bot added test x86 labels Apr 24, 2026

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

nihui requested a review from Copilot May 7, 2026 11:53

Copilot started reviewing on behalf of nihui May 7, 2026 11:54 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
crafcat7 wants to merge 5 commits into
Tencent:masterfrom
crafcat7:feat/x86-cumulativesum

crafcat7 commented Apr 24, 2026 •

edited

Loading

Uh oh!

nihui commented Apr 27, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

crafcat7 commented Apr 28, 2026 •

edited

Loading

Uh oh!

crafcat7 commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

crafcat7 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Algorithm

Dispatch

Correctness

Performance

Benchmark matrix

Interpretation

No regressions

Uh oh!

nihui commented Apr 27, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

crafcat7 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crafcat7 commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

crafcat7 commented Apr 24, 2026 •

edited

Loading

codecov-commenter commented Apr 27, 2026 •

edited

Loading

crafcat7 commented Apr 28, 2026 •

edited

Loading