Skip to content

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693

Open
crafcat7 wants to merge 5 commits into
Tencent:masterfrom
crafcat7:feat/x86-cumulativesum
Open

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
crafcat7 wants to merge 5 commits into
Tencent:masterfrom
crafcat7:feat/x86-cumulativesum

Conversation

@crafcat7
Copy link
Copy Markdown
Contributor

@crafcat7 crafcat7 commented Apr 24, 2026

Summary

Adds an x86-specific implementation of CumulativeSum that replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone helper plus AVX / SSE fallbacks in the main x86 source. On the refreshed benchmark matrix this yields 1.75×–2.46× single-thread speedup and 1.46×–2.00× 8-thread speedup across the four Stage 2 demos. The x86 layer now covers all valid (dims, axis) cases; outer-axis scans use a simple SIMD cur += prev helper instead of falling back to the native implementation.

Motivation

CumulativeSum::forward_inplace in src/layer/cumulativesum.cpp walks data with an inner serial recurrence:

for (int k = 1; k < w; k++)
    ptr[k] = ptr[k] + ptr[k - 1];  // RAW dependency → cannot auto-vectorize

Of the 6 (dims, axis) cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:

  • dims == 1
  • dims == 2, axis == 1
  • dims == 3, axis == 2

The other three cases (dims=2 axis=0, dims=3 axis=0/1) scan across the outer dimension, so each step updates an independent row/channel slice with cur += prev. These paths are memory-bandwidth-bound, but the x86 implementation still handles them with a small SIMD add helper so all valid cases stay inside the x86 layer.

Algorithm

Inner row of w floats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:

  1. Stage 1 — shift-by-1 within each 128-bit half (_mm256_slli_si256 by 4B), then add.
  2. Stage 2 — shift-by-2 within each 128-bit half (8B), then add.
    After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
  3. Stage 3 — broadcast the last lane of the low half into all 4 lanes of the high half via _mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now in v.
  4. Running base — add a broadcast of the previous tile's last lane, then update the base with the new last lane for the next tile.

Per-lane top-to-bottom view of one 8-lane tile (lanes 0..7 = a..h). Each column is one Kogge-Stone stage; values inside a stage are computed in parallel.

 lane  | input |  stage 1   |   stage 2    |     stage 3      |  + running_base
       |       | shift-1 +  | shift-2 +    | broadcast low.3  |
       |       | add (half) | add (half)   | → high + add     |
-------+-------+------------+--------------+------------------+--------------------
   0   |   a   |     a      |      a       |        a         |  a + base
   1   |   b   |    a+b     |     a+b      |       a+b        |  a+b + base
   2   |   c   |    b+c     |    a+b+c     |      a+b+c       |  a+b+c + base
   3   |   d   |    c+d     |   a+b+c+d    |     a+b+c+d      |  a+b+c+d + base
   ----  128-bit lane boundary  ------------------------------+--------------------
   4   |   e   |     e      |      e       |    a+b+c+d + e   |  ...+e + base
   5   |   f   |    e+f     |     e+f      |   a+b+c+d + e+f  |  ...+ef + base
   6   |   g   |    f+g     |    e+f+g     |  a+b+c+d + e+f+g |  ...+efg + base
   7   |   h   |    g+h     |   e+f+g+h    | a+b+c+d + e+f+g+h|  ...+efgh + base
                                                                       │
                                              base for next tile  ◄────┘
                                              (broadcast of lane 7)

Tail (<8 lanes) is finished with a scalar accumulator. The SSE2 path uses the same structure on 4 lanes with just stages 1 and 2; the AVX path stitches two 128-bit prefix sums into one 256-bit tile when __AVX__ is available but __AVX2__ is not.

We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.

Dispatch

The x86 layer handles all six valid (dims, axis) cases. support_packing is left at its inherited default so the packing machinery auto-unpacks to pack1 before forward_inplace.

AVX2 dispatch:

  • cumulativesum_x86.cpp contains the x86 layer and compile-time AVX2 / AVX / SSE kernels.
  • cumulativesum_x86.h declares both the x86 layer class and the AVX2 helper interface, matching existing x86 patterns that avoid extra ISA-only headers for small helper shims.
  • cumulativesum_x86_avx2.cpp contains out-of-line AVX2 helpers compiled by ncnn_add_arch_opt_source() with -mavx2 or /arch:AVX2.
  • Runtime CPU builds select the AVX2 helper once per forward_inplace() call only from generated fma / avx variants when cpu_support_x86_avx2() is true. This fixes the common AVX2-only target where the layer creator resolves to the fma / avx variant, where __AVX2__ is otherwise false, while keeping compile-time AVX2 code in the main x86 source for fixed-ISA builds.

Multi-threading:

  • dims == 2, axis == 1: #pragma omp parallel for over rows.
  • dims == 3, axis == 2: flattened #pragma omp parallel for over (channel, row) — same parallelism as collapse(2), but compatible with MSVC OpenMP.

Correctness

ctest --test-dir build -R test_cumulativesum --output-on-failure
1/1 Test #44: test_cumulativesum ............... Passed

tests/test_cumulativesum.cpp is extended with boundary cases that exercise every edge of the SIMD tiling:

  • 1D lengths 1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17, 32 — covers tail-only, exact vector, partial-tail, and the first tile past the running-base propagation.
  • 2D axis=1 widths 1, 3, 8, 16, 17 with 5 rows.
  • 3D axis=2 widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies the flattened (channel, row) parallel region.

All 22 cases pass on the SIMD build.

Performance

Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++, -O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements use benchncnn with loop=100, cooldown=0, taskset -c 0-7; reported as the min metric (most stable at sub-millisecond workloads).

Baseline build is rebuilt in a detached source tree with src/layer/x86/cumulativesum_x86.{h,cpp} removed so the base scalar implementation is used.

Benchmark matrix

Four demo graphs, each stacking 3 CumulativeSum layers to amortize benchmark harness noise:

Demo Shape Axis path 1T base 1T opt 1T × 8T base 8T opt 8T ×
cumsum_1d_demo [65536] dims=1 0.07 0.04 1.75× 0.07 0.04 1.75×
cumsum_2d_axis1_demo [512, 512] dims=2 axis=1 0.29 0.13 2.23× 0.06 0.03 2.00×
cumsum_axis2_demo [256, 256, 32] dims=3 axis=2 2.24 0.91 2.46× 0.53 0.28 1.89×
cumsum_demo [256, 256, 32] mixed axis=0→1→2 1.05 0.56 1.88× 0.41 0.28 1.46×

All numbers are min milliseconds over 100 iterations on cores 0-7.

Interpretation

  • RAW-dependency paths (dims=1, dims=2 axis=1, dims=3 axis=2): the scalar recurrence is still the core bottleneck, and the dedicated AVX2 helper restores the stronger performance profile after the temporary AVX-main-path regression. The best current result is the pure axis=2 demo at 2.46× (1T) and 1.89× (8T).
  • Regression note: inlining the AVX2 prefix kernel into cumulativesum_x86.cpp caused generated avx512 variants on the benchmark machine to bypass the dedicated helper and regress noticeably. Routing all __AVX__ variants back through the out-of-line AVX2 helper recovered the earlier benchmark envelope.
  • Mixed demo (1.88× at 1T, 1.46× at 8T): the final axis=2 layer still carries the largest win, while the first two layers benefit from the x86 SIMD add helper.

No regressions

All other layers and networks are unaffected — the new class only overrides forward_inplace for CumulativeSum, preserves the existing in-place API, and falls back to the base only for unexpected invalid cases.

Summary:
  Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan
  axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar
  code carries a true prefix-sum data dependency and the compiler
  cannot auto-vectorize. The kernel performs an in-register Kogge-Stone
  scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning
  N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound
  axes with no inner dependency are left to the base implementation
  since the compiler already auto-vectorizes them at memory bandwidth.

Changes:
  1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86
  2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback
  3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2)
  4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes
Summary:
  Add dedicated boundary test cases that exercise w at, just below, and
  just above common SIMD vector widths (4 / 8 / 16). The cases cover the
  single-tile no-tail path, the running base propagation across multiple
  tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing
  tests only hit these paths incidentally; the new cases make the coverage
  explicit and remain valid for any vectorized backend.

Changes:
  1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32}
  2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths
  3. Wire the new function into main() alongside the existing 1d/2d/3d groups
@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 27, 2026

@codex review

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 99.16667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.97%. Comparing base (d0d5063) to head (e0a3dfd).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/x86/cumulativesum_x86_packed.h 99.12% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6693      +/-   ##
==========================================
+ Coverage   93.95%   93.97%   +0.01%     
==========================================
  Files         933      936       +3     
  Lines      299730   299942     +212     
==========================================
+ Hits       281624   281866     +242     
+ Misses      18106    18076      -30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4f8e403ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/layer/x86/cumulativesum_x86.cpp Outdated
{
int j = 0;

#if __AVX2__
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enable AVX2 kernel in runtime-dispatch builds

The AVX2 fast path is guarded by #if __AVX2__, but this layer only adds cumulativesum_x86.cpp and no cumulativesum_x86_avx2.cpp helper. In the default NCNN_RUNTIME_CPU=ON x86 build, this source is instantiated as avx/fma variants (see cmake/ncnn_add_layer.cmake flags -mavx and -mavx -mfma), so __AVX2__ is false on AVX2/FMA CPUs without AVX512 and the code falls back to SSE2/scalar. That means the new 8-lane kernel is effectively unavailable for the most common AVX2-only runtime target.

Useful? React with 👍 / 👎.

Summary:
  Add an x86-specific CumulativeSum implementation that accelerates inner-axis prefix scans with AVX and an out-of-line AVX2 Kogge-Stone helper, while covering all valid base-layer dims/axis cases directly. This also replaces the tail base-class bounce with an explicit invalid-input return path and adds boundary tests for SIMD tile edges.

Changes:
  1. Add x86 cumulativesum AVX2 helper source and runtime dispatch from AVX-capable variants.
  2. Implement SIMD prefix-scan and outer-axis accumulation paths for all valid 1D, 2D, and 3D cases in the x86 layer.
  3. Extend cumulativesum tests with boundary widths to validate vector tails and running-base propagation.
@crafcat7
Copy link
Copy Markdown
Contributor Author

crafcat7 commented Apr 28, 2026

I've revised my submission and updated the current PR information. The main changes are as follows:

  • Added x86-optimized implementation for CumulativeSum
  1. Covered all valid (dims, axis) scenarios in the base layer in src/layer/x86/cumulativesum_x86.cpp
  2. Used SIMD paths for inner-axis prefix scan
  3. Used SIMD cur += prev paths for outer-axis accumulation
  • Added a separate AVX2 helper
  1. src/layer/x86/cumulativesum_x86_avx2.cpp
  2. Handled stronger ISA paths with out-of-line AVX2 Kogge-Stone prefix scan
  3. Upgraded to AVX2 helper via runtime dispatch in the AVX-capable x86 variant
  • Corrected the behavior boundaries of the x86 layer
  1. No longer falls back to the base class for repeated execution
  2. Directly returns for uncovered illegal input -100, consistent with the semantics of the base class's missed branch
  • Supplementary tests
  1. Extend tests/test_cumulativesum.cpp
  2. Add boundary width tests to cover SIMD tile, tail, and running-base propagation scenarios

Summary:

  Move the CumulativeSum x86 SIMD implementation and AVX2 runtime dispatch into a shared header included by both the layer wrapper and the AVX2 wrapper. This matches the existing x86 packed helper structure while preserving optimized prefix scan behavior.

Changes:

  1. Add a shared CumulativeSum x86 packed helper header with prefix/add kernels and forwarding logic

  2. Simplify the x86 layer implementation to call the shared forward path

  3. Export an AVX2 wrapper translation unit for runtime CPU dispatch
@crafcat7
Copy link
Copy Markdown
Contributor Author

I've restructured the code, splitting it into:

  • cumulativesum_x86_packed.h -> Contains the main implementation

  • cumulativesum_x86.h -> External interface

  • cumulativesum_x86.cpp -> Forward

  • cumulativesum_x86_avx2.cpp -> Forward

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an x86-specific CumulativeSum implementation to accelerate inner-dimension prefix scans using SIMD (AVX2 Kogge-Stone prefix scan with AVX/SSE fallbacks) while keeping all valid (dims, axis) cases handled within the x86 layer and extending test coverage to boundary sizes that stress SIMD tiling/tails.

Changes:

  • Introduces CumulativeSum_x86 and an x86 implementation of forward_inplace() that covers all (dims, axis) cases.
  • Adds a SIMD prefix-sum kernel for contiguous inner-dimension scans plus a SIMD “cur += prev” helper for outer-axis scans, including runtime AVX2 dispatch for __AVX__ && !__AVX2__ builds.
  • Extends test_cumulativesum with boundary cases targeting vector-width edges and tail handling.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_cumulativesum.cpp Adds boundary-focused test cases for 1D/2D/3D shapes to exercise SIMD tiling and tails.
src/layer/x86/cumulativesum_x86.h Declares the x86-specific CumulativeSum_x86 layer override.
src/layer/x86/cumulativesum_x86.cpp Implements CumulativeSum_x86::forward_inplace() and routes to the packed SIMD implementation.
src/layer/x86/cumulativesum_x86_packed.h Provides the SIMD prefix-sum (AVX2/AVX/SSE2) and add helpers plus the main x86 forward implementation with runtime AVX2 dispatch.
src/layer/x86/cumulativesum_x86_avx2.cpp Defines the out-of-line AVX2 entry point used by runtime dispatch builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…rd path

Summary:
  Restructure cumulativesum_x86 to follow the established x86 layer
  pattern (e.g. deconvolution, interp, cast): normal implementation in
  x86.cpp, ISA-specific variant in packed.h + avx2.cpp wrapper.

Changes:
  1. Move normal forward_inplace logic and SSE/AVX helpers into
     cumulativesum_x86.cpp
  2. Rename packed.h helpers to _avx2 suffix and guard with #if __AVX2__
  3. Add runtime AVX2 dispatch in forward_inplace via
     cumulative_sum_forward_inplace_avx2 free function
  4. Update cumulativesum_x86_avx2.cpp wrapper to call _impl variant
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants