Skip to content

massive mips and loongarch optimization#6662

Merged
nihui merged 175 commits into
Tencent:masterfrom
nihui:mips-opt3
May 19, 2026
Merged

massive mips and loongarch optimization#6662
nihui merged 175 commits into
Tencent:masterfrom
nihui:mips-opt3

Conversation

@nihui
Copy link
Copy Markdown
Member

@nihui nihui commented Apr 9, 2026

No description provided.

@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 98.19639% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.95%. Comparing base (4681f2e) to head (069b4a5).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/loongarch/convolution_loongarch.cpp 75.40% 107 Missing ⚠️
src/layer/loongarch/convolution_packed_bf16s.h 99.75% 3 Missing ⚠️
src/layer/loongarch/convolution_packed_int8.h 98.87% 3 Missing ⚠️
src/layer/loongarch/convolution1d_loongarch.cpp 95.00% 2 Missing ⚠️
src/layer/loongarch/binaryop_loongarch.cpp 99.80% 1 Missing ⚠️
src/layer/loongarch/convolution_packed.h 99.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6662      +/-   ##
==========================================
+ Coverage   95.84%   95.95%   +0.10%     
==========================================
  Files         937      970      +33     
  Lines      313206   403584   +90378     
==========================================
+ Hits       300205   387240   +87035     
- Misses      13001    16344    +3343     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui and others added 11 commits April 10, 2026 07:10
Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile,
transpose_unpack_output_tile, and gemm_transB_packed_tile for all
ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so
jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4).

Update get_optimal_tile_mnk to align TILE_N to multiples of 12
for better utilization of the new kernel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngArch

Integrate bf16 storage support into multiple operators:

MIPS: batchnorm, clip, dropout, selu, erf
LoongArch: batchnorm, clip, dropout

Each operator now declares forward_inplace_bf16s in its header,
sets support_bf16_storage=true in the constructor, dispatches bf16
inputs from forward_inplace, and implements the bf16s path using
the existing bf16s helper headers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add support_bf16_storage = true in constructors for both architectures
- Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes)
- Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies
- Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit)
- Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit)
- Dispatch to bf16 variants when elemsize matches bf16 packing
- Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing
256-bit SIMD (8 floats) resize operations using LASX intrinsics.

Update interp_loongarch.cpp to:
- Include lasxintrin.h and the new pack8 headers under __loongarch_asx
- Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… approach

- Replace hand-written kernel packing and convolution loops with
  convolution1d_transform_kernel_packed() and convolution1d_packed()
  from convolution1d_packed.h
- Rename weight_data_packed to weight_data_tm to match x86 pattern
- Add LASX (256-bit) support with pack8 out_elempack
- Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16)
- Add bf16 weight/bias cast in dynamic weight forward path
- Include cpu.h, lasxintrin.h headers for new functionality

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 5, 2026

3a6000 loongnix-20
4.19.0-19-loongson-3
gcc 8.3.0

1t baseline pr6662 pr6662-bf16s
squeezenet 21.25 12.82 13.05
squeezenet_int8 35.65 13.23 12.63
mobilenet 37.77 22.03 25.56
mobilenet_int8 75.81 24.87 25.59
mobilenet_v2 25.06 16.10 17.05
mobilenet_v3 19.97 13.62 14.69
shufflenet 12.67 8.13 10.07
shufflenet_v2 12.24 8.84 14.43
mnasnet 25.07 16.17 16.97
proxylessnasnet 30.95 19.47 18.32
efficientnet_b0 49.33 32.41 33.05
efficientnetv2_b0 55.41 34.56 37.89
regnety_400m 33.78 20.21 23.00
blazeface 5.34 2.06 2.71
googlenet 87.11 46.94 48.28
googlenet_int8 133.64 46.47 45.32
resnet18 68.85 40.69 41.37
resnet18_int8 114.22 37.45 36.93
alexnet 96.10 29.38 29.89
vgg16 360.85 205.36 189.28
vgg16_int8 631.80 216.57 213.89
resnet50 187.97 108.57 118.25
resnet50_int8 295.43 95.68 94.19
squeezenet_ssd 62.01 37.09 36.14
squeezenet_ssd_int8 81.80 35.83 35.54
mobilenet_ssd 75.75 47.04 53.45
mobilenet_ssd_int8 147.48 49.18 50.39
mobilenet_yolo 197.19 108.46 129.84
mobilenetv2_yolov3 86.50 58.05 60.71
yolov4-tiny 117.88 74.90 71.26
nanodet_m 28.51 20.61 32.22
yolo-fastest-1.1 11.88 9.12 10.54
yolo-fastestv2 10.55 9.13 9.43
vision_transformer 4582.75 463.24 559.67
FastestDet 11.81 10.38 10.17
4t baseline pr6662 pr6662-bf16s
squeezenet 7.81 5.18 4.50
squeezenet_int8 10.22 5.47 4.95
mobilenet 12.95 7.21 6.79
mobilenet_int8 19.32 7.44 7.61
mobilenet_v2 8.57 6.18 5.11
mobilenet_v3 6.71 5.51 5.22
shufflenet 4.62 4.07 4.31
shufflenet_v2 4.40 4.24 5.67
mnasnet 7.72 5.40 5.32
proxylessnasnet 9.19 6.04 5.74
efficientnet_b0 14.82 10.85 9.77
efficientnetv2_b0 18.47 12.68 12.42
regnety_400m 14.93 11.15 12.31
blazeface 1.84 0.85 1.05
googlenet 28.37 18.81 17.46
googlenet_int8 36.93 18.89 18.32
resnet18 25.74 19.43 18.74
resnet18_int8 32.41 15.41 15.08
alexnet 31.40 16.05 14.43
vgg16 154.29 111.99 100.75
vgg16_int8 189.38 90.29 84.84
resnet50 66.72 42.46 40.87
resnet50_int8 78.62 37.31 33.88
squeezenet_ssd 28.95 19.86 18.49
squeezenet_ssd_int8 28.73 18.73 15.63
mobilenet_ssd 28.96 16.80 15.12
mobilenet_ssd_int8 38.48 16.25 16.07
mobilenet_yolo 96.66 41.22 45.44
mobilenetv2_yolov3 35.14 25.40 18.41
yolov4-tiny 56.07 37.81 32.35
nanodet_m 10.74 9.49 12.20
yolo-fastest-1.1 4.57 4.78 5.39
yolo-fastestv2 4.33 5.50 5.12
vision_transformer 1217.27 134.59 161.67
FastestDet 4.73 5.51 5.09

@nihui nihui closed this May 11, 2026
@nihui nihui reopened this May 11, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd6b5905e4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/layer/mips/rmsnorm_mips.cpp
Comment thread src/layer/loongarch/rmsnorm_loongarch.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b1c814c823

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/layer/mips/layernorm_mips.cpp
Comment thread src/layer/loongarch/layernorm_loongarch.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a2119f149

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/layer/loongarch/layernorm_loongarch.cpp
Comment thread src/layer/mips/rmsnorm_mips.cpp
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f12a23258c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

v4f32 _beta = (v4f32)__msa_ld_w(beta_ptr, 0);
_p = __ncnn_msa_fmadd_w(_mean0, _p, _var0);
_p = __ncnn_msa_fmadd_w(_beta, _p, _gamma);
*(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace unaligned int64 stores in bf16 LayerNorm tail loop

This tail path writes bf16 results via *(int64_t*)ptr, but ptr is not guaranteed to be 8-byte aligned for elempack == 1 (row starts are computed as data + w * y * elemsize, so odd w on bf16 rows yields only 2-byte alignment). On MIPS targets with strict alignment, this can raise an address error (and is UB in C++ even when it “works”), so valid shapes can crash in forward_inplace_bf16s; use an unaligned-safe store helper (memcpy/__msa_storel_d) instead.

Useful? React with 👍 / 👎.

v4f32 _gamma = (v4f32)__msa_ld_w(gamma_ptr, 0);
_p = __msa_fmul_w(_p, _rms0);
_p = __msa_fmul_w(_p, _gamma);
*(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use unaligned-safe bf16 stores in RMSNorm vector tail

The bf16 vector tail stores through *(int64_t*)ptr, which assumes 8-byte alignment that is not guaranteed for pack1 tensors (row offsets are w * elemsize, so odd widths can produce misaligned row pointers). That makes this path undefined behavior and can fault on MIPS alignment checks, causing runtime crashes for otherwise valid inputs; switch to an unaligned-safe 64-bit store pattern (memcpy/__msa_storel_d).

Useful? React with 👍 / 👎.

@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 18, 2026

3a4000
图片

3a6000
图片

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 069b4a59a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

v4f32 _p = bfloat2float_msa(ptr);
_p = __msa_fsub_w(_p, _max);
_p = exp_ps(_p);
*(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace unaligned bf16 vector store with safe helper

This bf16 softmax kernel writes through *(int64_t*)ptr, but ptr can come from row-based offsets (for example bottom_top_blob.row<unsigned short>(i) in forward_inplace_bf16s), and row starts are computed as w * y * elemsize, which is only 2-byte aligned for bf16 and can be misaligned for odd w. On strict-alignment MIPS targets this can fault at runtime (and is undefined behavior in C++), so valid bf16 inputs can crash in softmax; use an unaligned-safe store path such as __msa_storel_d/memcpy.

Useful? React with 👍 / 👎.

@nihui nihui merged commit 0f5c6ef into Tencent:master May 19, 2026
162 of 166 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants