massive mips and loongarch optimization by nihui · Pull Request #6662 · Tencent/ncnn

nihui · 2026-04-09T08:56:07Z

No description provided.

tencent-adm · 2026-04-09T08:56:26Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-04-09T09:01:43Z

Codecov Report

❌ Patch coverage is 98.19639% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.95%. Comparing base (4681f2e) to head (069b4a5).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/layer/loongarch/convolution_loongarch.cpp	75.40%	107 Missing ⚠️
src/layer/loongarch/convolution_packed_bf16s.h	99.75%	3 Missing ⚠️
src/layer/loongarch/convolution_packed_int8.h	98.87%	3 Missing ⚠️
src/layer/loongarch/convolution1d_loongarch.cpp	95.00%	2 Missing ⚠️
src/layer/loongarch/binaryop_loongarch.cpp	99.80%	1 Missing ⚠️
src/layer/loongarch/convolution_packed.h	99.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6662      +/-   ##
==========================================
+ Coverage   95.84%   95.95%   +0.10%     
==========================================
  Files         937      970      +33     
  Lines      313206   403584   +90378     
==========================================
+ Hits       300205   387240   +87035     
- Misses      13001    16344    +3343

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile, transpose_unpack_output_tile, and gemm_transB_packed_tile for all ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4). Update get_optimal_tile_mnk to align TILE_N to multiples of 12 for better utilization of the new kernel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ngArch Integrate bf16 storage support into multiple operators: MIPS: batchnorm, clip, dropout, selu, erf LoongArch: batchnorm, clip, dropout Each operator now declares forward_inplace_bf16s in its header, sets support_bf16_storage=true in the constructor, dispatches bf16 inputs from forward_inplace, and implements the bf16s path using the existing bf16s helper headers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add support_bf16_storage = true in constructors for both architectures - Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes) - Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies - Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit) - Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit) - Dispatch to bf16 variants when elemsize matches bf16 packing - Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing 256-bit SIMD (8 floats) resize operations using LASX intrinsics. Update interp_loongarch.cpp to: - Include lasxintrin.h and the new pack8 headers under __loongarch_asx - Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… approach - Replace hand-written kernel packing and convolution loops with convolution1d_transform_kernel_packed() and convolution1d_packed() from convolution1d_packed.h - Rename weight_data_packed to weight_data_tm to match x86 pattern - Add LASX (256-bit) support with pack8 out_elempack - Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16) - Add bf16 weight/bias cast in dynamic weight forward path - Include cpu.h, lasxintrin.h headers for new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nihui · 2026-05-05T13:43:33Z

3a6000 loongnix-20
4.19.0-19-loongson-3
gcc 8.3.0

1t	baseline	pr6662	pr6662-bf16s
squeezenet	21.25	12.82	13.05
squeezenet_int8	35.65	13.23	12.63
mobilenet	37.77	22.03	25.56
mobilenet_int8	75.81	24.87	25.59
mobilenet_v2	25.06	16.10	17.05
mobilenet_v3	19.97	13.62	14.69
shufflenet	12.67	8.13	10.07
shufflenet_v2	12.24	8.84	14.43
mnasnet	25.07	16.17	16.97
proxylessnasnet	30.95	19.47	18.32
efficientnet_b0	49.33	32.41	33.05
efficientnetv2_b0	55.41	34.56	37.89
regnety_400m	33.78	20.21	23.00
blazeface	5.34	2.06	2.71
googlenet	87.11	46.94	48.28
googlenet_int8	133.64	46.47	45.32
resnet18	68.85	40.69	41.37
resnet18_int8	114.22	37.45	36.93
alexnet	96.10	29.38	29.89
vgg16	360.85	205.36	189.28
vgg16_int8	631.80	216.57	213.89
resnet50	187.97	108.57	118.25
resnet50_int8	295.43	95.68	94.19
squeezenet_ssd	62.01	37.09	36.14
squeezenet_ssd_int8	81.80	35.83	35.54
mobilenet_ssd	75.75	47.04	53.45
mobilenet_ssd_int8	147.48	49.18	50.39
mobilenet_yolo	197.19	108.46	129.84
mobilenetv2_yolov3	86.50	58.05	60.71
yolov4-tiny	117.88	74.90	71.26
nanodet_m	28.51	20.61	32.22
yolo-fastest-1.1	11.88	9.12	10.54
yolo-fastestv2	10.55	9.13	9.43
vision_transformer	4582.75	463.24	559.67
FastestDet	11.81	10.38	10.17

4t	baseline	pr6662	pr6662-bf16s
squeezenet	7.81	5.18	4.50
squeezenet_int8	10.22	5.47	4.95
mobilenet	12.95	7.21	6.79
mobilenet_int8	19.32	7.44	7.61
mobilenet_v2	8.57	6.18	5.11
mobilenet_v3	6.71	5.51	5.22
shufflenet	4.62	4.07	4.31
shufflenet_v2	4.40	4.24	5.67
mnasnet	7.72	5.40	5.32
proxylessnasnet	9.19	6.04	5.74
efficientnet_b0	14.82	10.85	9.77
efficientnetv2_b0	18.47	12.68	12.42
regnety_400m	14.93	11.15	12.31
blazeface	1.84	0.85	1.05
googlenet	28.37	18.81	17.46
googlenet_int8	36.93	18.89	18.32
resnet18	25.74	19.43	18.74
resnet18_int8	32.41	15.41	15.08
alexnet	31.40	16.05	14.43
vgg16	154.29	111.99	100.75
vgg16_int8	189.38	90.29	84.84
resnet50	66.72	42.46	40.87
resnet50_int8	78.62	37.31	33.88
squeezenet_ssd	28.95	19.86	18.49
squeezenet_ssd_int8	28.73	18.73	15.63
mobilenet_ssd	28.96	16.80	15.12
mobilenet_ssd_int8	38.48	16.25	16.07
mobilenet_yolo	96.66	41.22	45.44
mobilenetv2_yolov3	35.14	25.40	18.41
yolov4-tiny	56.07	37.81	32.35
nanodet_m	10.74	9.49	12.20
yolo-fastest-1.1	4.57	4.78	5.39
yolo-fastestv2	4.33	5.50	5.12
vision_transformer	1217.27	134.59	161.67
FastestDet	4.73	5.51	5.09

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd6b5905e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1c814c823

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a2119f149

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

This reverts commit b885070.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f12a23258c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T13:42:42Z

+                v4f32 _beta = (v4f32)__msa_ld_w(beta_ptr, 0);
+                _p = __ncnn_msa_fmadd_w(_mean0, _p, _var0);
+                _p = __ncnn_msa_fmadd_w(_beta, _p, _gamma);
+                *(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0);


Replace unaligned int64 stores in bf16 LayerNorm tail loop

This tail path writes bf16 results via *(int64_t*)ptr, but ptr is not guaranteed to be 8-byte aligned for elempack == 1 (row starts are computed as data + w * y * elemsize, so odd w on bf16 rows yields only 2-byte alignment). On MIPS targets with strict alignment, this can raise an address error (and is UB in C++ even when it “works”), so valid shapes can crash in forward_inplace_bf16s; use an unaligned-safe store helper (memcpy/__msa_storel_d) instead.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-18T13:42:42Z

+                v4f32 _gamma = (v4f32)__msa_ld_w(gamma_ptr, 0);
+                _p = __msa_fmul_w(_p, _rms0);
+                _p = __msa_fmul_w(_p, _gamma);
+                *(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0);


Use unaligned-safe bf16 stores in RMSNorm vector tail

The bf16 vector tail stores through *(int64_t*)ptr, which assumes 8-byte alignment that is not guaranteed for pack1 tensors (row offsets are w * elemsize, so odd widths can produce misaligned row pointers). That makes this path undefined behavior and can fault on MIPS alignment checks, causing runtime crashes for otherwise valid inputs; switch to an unaligned-safe 64-bit store pattern (memcpy/__msa_storel_d).

Useful? React with 👍 / 👎.

nihui · 2026-05-18T13:55:50Z

3a4000

3a6000

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 069b4a59a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T15:05:00Z

+            v4f32 _p = bfloat2float_msa(ptr);
+            _p = __msa_fsub_w(_p, _max);
+            _p = exp_ps(_p);
+            *(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0);


Replace unaligned bf16 vector store with safe helper

This bf16 softmax kernel writes through *(int64_t*)ptr, but ptr can come from row-based offsets (for example bottom_top_blob.row<unsigned short>(i) in forward_inplace_bf16s), and row starts are computed as w * y * elemsize, which is only 2-byte aligned for bf16 and can be misaligned for odd w. On strict-alignment MIPS targets this can fault at runtime (and is undefined behavior in C++), so valid bf16 inputs can crash in softmax; use an unaligned-safe store path such as __msa_storel_d/memcpy.

Useful? React with 👍 / 👎.

massive mips and loongarch optimization

6529782

github-actions Bot added core loongarch mips labels Apr 9, 2026

opt

8a0c38d

nihui force-pushed the mips-opt3 branch from dc8fc0f to 8a0c38d Compare April 10, 2026 07:08

nihui and others added 11 commits April 10, 2026 07:10

apply code-format changes

8b2010e

wip

d1f9876

apply code-format changes

df9cac1

fix

f4bb8c7

wip

cccdcc2

wip

5e68a2f

nihui force-pushed the mips-opt3 branch from 6bbdc54 to 5e68a2f Compare April 15, 2026 02:33

github-actions Bot added the test label Apr 15, 2026

nihui and others added 3 commits April 15, 2026 02:35

apply code-format changes

19b564e

cc

e5b89af

cc

1d498d6

nihui force-pushed the mips-opt3 branch from 5431c84 to 1d498d6 Compare April 15, 2026 06:31

cc

bc43dcc

nihui force-pushed the mips-opt3 branch from 0720de1 to bc43dcc Compare April 15, 2026 07:30

nihui and others added 3 commits April 15, 2026 07:32

apply code-format changes

7ef3ae0

fix bias

d84a773

cc

39a8f45

nihui and others added 6 commits May 5, 2026 23:06

opt

21dea17

Merge branch 'Tencent:master' into mips-opt3

2ab4f7e

Merge branch 'master' into mips-opt3

a809ad8

Merge branch 'master' into mips-opt3

feb9fb1

opt++

a5c3fc8

apply code-format changes

cd6b590

nihui closed this May 11, 2026

nihui reopened this May 11, 2026

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/layer/mips/rmsnorm_mips.cpp

Comment thread src/layer/loongarch/rmsnorm_loongarch.cpp

Merge branch 'master' into mips-opt3

b1c814c

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/layer/mips/layernorm_mips.cpp

Comment thread src/layer/loongarch/layernorm_loongarch.cpp

Merge branch 'master' into mips-opt3

1a2119f

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/layer/loongarch/layernorm_loongarch.cpp

Comment thread src/layer/mips/rmsnorm_mips.cpp

nihui added 9 commits May 13, 2026 19:12

opt

4694a6c

unroll kk

ac91091

re

434195c

w

b885070

Revert "w"

95873b8

This reverts commit b885070.

ww

572866d

f

8fb7334

Merge branch 'master' into mips-opt3

dc76c98

Merge branch 'master' into mips-opt3

f12a232

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

nihui added 3 commits May 18, 2026 22:38

gcc8 workaround

e2edce0

cc

340f51c

cc

069b4a5

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

nihui merged commit 0f5c6ef into Tencent:master May 19, 2026
162 of 166 checks passed

Conversation

nihui commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tencent-adm commented Apr 9, 2026

Uh oh!

codecov-commenter commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nihui commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

nihui commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nihui commented Apr 9, 2026 •

edited

Loading

codecov-commenter commented Apr 9, 2026 •

edited

Loading

nihui commented May 5, 2026 •

edited

Loading