Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682
Conversation
|
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #6682 +/- ##
==========================================
- Coverage 95.46% 95.45% -0.02%
==========================================
Files 937 937
Lines 312116 313191 +1075
==========================================
+ Hits 297970 298946 +976
- Misses 14146 14245 +99 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes x86 innerproduct GEMM paths by unrolling scalar-input/output-pack microkernels and reducing dependency chains in accumulation.
Changes:
- Expanded accumulators and unrolled loops for
elempack == 1with output packs 16, 8, and 4. - Replaced some fp16 conversion load sequences with direct 128-bit loads in AVX paths.
- Updated Windows CI CMake options to disable BF16 in several build configurations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/layer/x86/innerproduct_gemm_fp.h |
Optimizes x86 innerproduct GEMM accumulation and fp16 load/conversion paths. |
.github/workflows/windows.yml |
Changes Windows CI build flags, including disabling BF16 for ncnn builds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for your contribution ! |
PR Description:
Overview
This PR significantly improves the performance of
innerproduct_gemm_fp16s_sseon x86 architectures by mitigating severe loop-carried dependency stalls and replacing inefficient instructions in the microkernels.Problem
The existing
innerproduct_gemm_fp16simplementation had two major bottlenecks:_sum0to_sum3) in accumulation iterations. Given the 4-5 cycle latency of typical FMA instructions, this caused severe processor pipeline stalls, wasting execution ports constraints._mm256_extractf128_si256for FP16->FP32 conversions which is slightly more expensive compared to memory operand fusion.Solution
elempack == 1&num_output_elempack == 8 / 16 / 4blocks. We now use up to 8 independent accumulation registers to completely break the false data dependency, hiding FMA latency._mm256_extractf128_si256in favor of direct 128-bit loads (_mm_lddqu_si128) fused into_mm256_cvtph_ps, reducing register traffic.Benchmark Results (AMD Family 17h, 12 Cores, FP16 mode)
Tested via
benchncnn 4 6 2 -1:vgg16resnet50resnet18AMDuProf Analysis:
Profiling the single-process CPU runtime confirms that the
innerproduct_gemm_fp16s_ssehotspot significantly dropped CPU_TIME:67.87s55.14s(Total execution lowered by >12s).Commit Split:
Optimize innerproduct x86 fp16s gemm using fused loads and fully unrolled FMA(Addressed the1x8case alongside the load logic rewrite).Alleviate loop-carried stalls in innerproduct fp16s microkernels by unrolling(Extrapolated the fixes across1x16and1x4blocks respectively).