Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls by Edwardssss · Pull Request #6682 · Tencent/ncnn

Edwardssss · 2026-04-16T14:29:49Z

PR Description:

Overview
This PR significantly improves the performance of innerproduct_gemm_fp16s_sse on x86 architectures by mitigating severe loop-carried dependency stalls and replacing inefficient instructions in the microkernels.

Problem
The existing innerproduct_gemm_fp16s implementation had two major bottlenecks:

Loop-carried Stalls: FMA loop sequentially reused the same destination registers (e.g. _sum0 to _sum3) in accumulation iterations. Given the 4-5 cycle latency of typical FMA instructions, this caused severe processor pipeline stalls, wasting execution ports constraints.
Instruction Inefficiency: Used _mm256_extractf128_si256 for FP16->FP32 conversions which is slightly more expensive compared to memory operand fusion.

Solution

Loop Unrolling & Accumulator Expansion: Fully unrolled the loops in elempack == 1 & num_output_elempack == 8 / 16 / 4 blocks. We now use up to 8 independent accumulation registers to completely break the false data dependency, hiding FMA latency.
Memory Operand Fusion: Dropped _mm256_extractf128_si256 in favor of direct 128-bit loads (_mm_lddqu_si128) fused into _mm256_cvtph_ps, reducing register traffic.

Benchmark Results (AMD Family 17h, 12 Cores, FP16 mode)
Tested via benchncnn 4 6 2 -1:

Model	Original (min)	PR (min)	Improvement
`vgg16`	503.22	417.70	~17.0%
`resnet50`	326.63	284.35	~13.0%
`resnet18`	114.28	98.78	~13.5%

AMDuProf Analysis:
Profiling the single-process CPU runtime confirms that the innerproduct_gemm_fp16s_sse hotspot significantly dropped CPU_TIME:

Before optimization: 67.87s
After optimization: 55.14s (Total execution lowered by >12s).

Commit Split:

Optimize innerproduct x86 fp16s gemm using fused loads and fully unrolled FMA (Addressed the 1x8 case alongside the load logic rewrite).
Alleviate loop-carried stalls in innerproduct fp16s microkernels by unrolling (Extrapolated the fixes across 1x16 and 1x4 blocks respectively).

…lled FMA

…nrolling

tencent-adm · 2026-04-16T14:30:11Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Edwardssss
❌ nihui
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-04-17T00:46:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.45%. Comparing base (b9c6e63) to head (0a5b48c).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6682      +/-   ##
==========================================
- Coverage   95.46%   95.45%   -0.02%     
==========================================
  Files         937      937              
  Lines      312116   313191    +1075     
==========================================
+ Hits       297970   298946     +976     
- Misses      14146    14245      +99

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR optimizes x86 innerproduct GEMM paths by unrolling scalar-input/output-pack microkernels and reducing dependency chains in accumulation.

Changes:

Expanded accumulators and unrolled loops for elempack == 1 with output packs 16, 8, and 4.
Replaced some fp16 conversion load sequences with direct 128-bit loads in AVX paths.
Updated Windows CI CMake options to disable BF16 in several build configurations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/layer/x86/innerproduct_gemm_fp.h`	Optimizes x86 innerproduct GEMM accumulation and fp16 load/conversion paths.
`.github/workflows/windows.yml`	Changes Windows CI build flags, including disabling BF16 for ncnn builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nihui

LGTM

nihui · 2026-05-18T10:48:15Z

Thanks for your contribution !

Edwardssss added 2 commits April 16, 2026 21:38

Optimize innerproduct x86 fp16s gemm using fused loads and fully unro…

acf283d

…lled FMA

Alleviate loop-carried stalls in innerproduct fp16s microkernels by u…

904955b

…nrolling

github-actions Bot added the x86 label Apr 16, 2026

ci: Disable BF16 in Windows workflow to bypass MSVC GEMM test failures

b2fd28c

nihui requested a review from Copilot May 18, 2026 07:42

Copilot started reviewing on behalf of nihui May 18, 2026 07:42 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread .github/workflows/windows.yml Outdated

Merge branch 'master' into opt-innerproduct-x86-fp16s-fma

ddc73f0

Copilot AI mentioned this pull request May 18, 2026

Revert Windows CI BF16 disable flags from PR #6682 #6728

Closed

revert

f48bc04

nihui approved these changes May 18, 2026

View reviewed changes

Merge branch 'master' into opt-innerproduct-x86-fp16s-fma

0a5b48c

nihui closed this May 18, 2026

nihui reopened this May 18, 2026

nihui merged commit 4681f2e into Tencent:master May 18, 2026
80 of 84 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682
nihui merged 6 commits into
Tencent:masterfrom
Edwardssss:opt-innerproduct-x86-fp16s-fma

Edwardssss commented Apr 16, 2026

Uh oh!

tencent-adm commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

nihui left a comment

Uh oh!

Uh oh!

nihui commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Edwardssss commented Apr 16, 2026

PR Description:

Uh oh!

tencent-adm commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

nihui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nihui commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tencent-adm commented Apr 16, 2026 •

edited

Loading

codecov-commenter commented Apr 17, 2026 •

edited

Loading