Skip to content

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682

Merged
nihui merged 6 commits into
Tencent:masterfrom
Edwardssss:opt-innerproduct-x86-fp16s-fma
May 18, 2026
Merged

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682
nihui merged 6 commits into
Tencent:masterfrom
Edwardssss:opt-innerproduct-x86-fp16s-fma

Conversation

@Edwardssss
Copy link
Copy Markdown
Contributor

PR Description:

Overview
This PR significantly improves the performance of innerproduct_gemm_fp16s_sse on x86 architectures by mitigating severe loop-carried dependency stalls and replacing inefficient instructions in the microkernels.

Problem
The existing innerproduct_gemm_fp16s implementation had two major bottlenecks:

  1. Loop-carried Stalls: FMA loop sequentially reused the same destination registers (e.g. _sum0 to _sum3) in accumulation iterations. Given the 4-5 cycle latency of typical FMA instructions, this caused severe processor pipeline stalls, wasting execution ports constraints.
  2. Instruction Inefficiency: Used _mm256_extractf128_si256 for FP16->FP32 conversions which is slightly more expensive compared to memory operand fusion.

Solution

  • Loop Unrolling & Accumulator Expansion: Fully unrolled the loops in elempack == 1 & num_output_elempack == 8 / 16 / 4 blocks. We now use up to 8 independent accumulation registers to completely break the false data dependency, hiding FMA latency.
  • Memory Operand Fusion: Dropped _mm256_extractf128_si256 in favor of direct 128-bit loads (_mm_lddqu_si128) fused into _mm256_cvtph_ps, reducing register traffic.

Benchmark Results (AMD Family 17h, 12 Cores, FP16 mode)
Tested via benchncnn 4 6 2 -1:

Model Original (min) PR (min) Improvement
vgg16 503.22 417.70 ~17.0%
resnet50 326.63 284.35 ~13.0%
resnet18 114.28 98.78 ~13.5%

AMDuProf Analysis:
Profiling the single-process CPU runtime confirms that the innerproduct_gemm_fp16s_sse hotspot significantly dropped CPU_TIME:

  • Before optimization: 67.87s
  • After optimization: 55.14s (Total execution lowered by >12s).

Commit Split:

  1. Optimize innerproduct x86 fp16s gemm using fused loads and fully unrolled FMA (Addressed the 1x8 case alongside the load logic rewrite).
  2. Alleviate loop-carried stalls in innerproduct fp16s microkernels by unrolling (Extrapolated the fixes across 1x16 and 1x4 blocks respectively).

@github-actions github-actions Bot added the x86 label Apr 16, 2026
@tencent-adm
Copy link
Copy Markdown
Member

tencent-adm commented Apr 16, 2026

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Edwardssss
❌ nihui
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.45%. Comparing base (b9c6e63) to head (0a5b48c).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6682      +/-   ##
==========================================
- Coverage   95.46%   95.45%   -0.02%     
==========================================
  Files         937      937              
  Lines      312116   313191    +1075     
==========================================
+ Hits       297970   298946     +976     
- Misses      14146    14245      +99     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes x86 innerproduct GEMM paths by unrolling scalar-input/output-pack microkernels and reducing dependency chains in accumulation.

Changes:

  • Expanded accumulators and unrolled loops for elempack == 1 with output packs 16, 8, and 4.
  • Replaced some fp16 conversion load sequences with direct 128-bit loads in AVX paths.
  • Updated Windows CI CMake options to disable BF16 in several build configurations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/layer/x86/innerproduct_gemm_fp.h Optimizes x86 innerproduct GEMM accumulation and fp16 load/conversion paths.
.github/workflows/windows.yml Changes Windows CI build flags, including disabling BF16 for ncnn builds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/windows.yml Outdated
Copy link
Copy Markdown
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nihui nihui closed this May 18, 2026
@nihui nihui reopened this May 18, 2026
@nihui nihui merged commit 4681f2e into Tencent:master May 18, 2026
80 of 84 checks passed
@nihui
Copy link
Copy Markdown
Member

nihui commented May 18, 2026

Thanks for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants