massive mips and loongarch optimization#6662
Conversation
|
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6662 +/- ##
==========================================
+ Coverage 95.84% 95.95% +0.10%
==========================================
Files 937 970 +33
Lines 313206 403584 +90378
==========================================
+ Hits 300205 387240 +87035
- Misses 13001 16344 +3343 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile, transpose_unpack_output_tile, and gemm_transB_packed_tile for all ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4). Update get_optimal_tile_mnk to align TILE_N to multiples of 12 for better utilization of the new kernel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngArch Integrate bf16 storage support into multiple operators: MIPS: batchnorm, clip, dropout, selu, erf LoongArch: batchnorm, clip, dropout Each operator now declares forward_inplace_bf16s in its header, sets support_bf16_storage=true in the constructor, dispatches bf16 inputs from forward_inplace, and implements the bf16s path using the existing bf16s helper headers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add support_bf16_storage = true in constructors for both architectures - Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes) - Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies - Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit) - Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit) - Dispatch to bf16 variants when elemsize matches bf16 packing - Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing 256-bit SIMD (8 floats) resize operations using LASX intrinsics. Update interp_loongarch.cpp to: - Include lasxintrin.h and the new pack8 headers under __loongarch_asx - Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… approach - Replace hand-written kernel packing and convolution loops with convolution1d_transform_kernel_packed() and convolution1d_packed() from convolution1d_packed.h - Rename weight_data_packed to weight_data_tm to match x86 pattern - Add LASX (256-bit) support with pack8 out_elempack - Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16) - Add bf16 weight/bias cast in dynamic weight forward path - Include cpu.h, lasxintrin.h headers for new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
3a6000 loongnix-20
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cd6b5905e4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1c814c823
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a2119f149
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f12a23258c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| v4f32 _beta = (v4f32)__msa_ld_w(beta_ptr, 0); | ||
| _p = __ncnn_msa_fmadd_w(_mean0, _p, _var0); | ||
| _p = __ncnn_msa_fmadd_w(_beta, _p, _gamma); | ||
| *(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0); |
There was a problem hiding this comment.
Replace unaligned int64 stores in bf16 LayerNorm tail loop
This tail path writes bf16 results via *(int64_t*)ptr, but ptr is not guaranteed to be 8-byte aligned for elempack == 1 (row starts are computed as data + w * y * elemsize, so odd w on bf16 rows yields only 2-byte alignment). On MIPS targets with strict alignment, this can raise an address error (and is UB in C++ even when it “works”), so valid shapes can crash in forward_inplace_bf16s; use an unaligned-safe store helper (memcpy/__msa_storel_d) instead.
Useful? React with 👍 / 👎.
| v4f32 _gamma = (v4f32)__msa_ld_w(gamma_ptr, 0); | ||
| _p = __msa_fmul_w(_p, _rms0); | ||
| _p = __msa_fmul_w(_p, _gamma); | ||
| *(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0); |
There was a problem hiding this comment.
Use unaligned-safe bf16 stores in RMSNorm vector tail
The bf16 vector tail stores through *(int64_t*)ptr, which assumes 8-byte alignment that is not guaranteed for pack1 tensors (row offsets are w * elemsize, so odd widths can produce misaligned row pointers). That makes this path undefined behavior and can fault on MIPS alignment checks, causing runtime crashes for otherwise valid inputs; switch to an unaligned-safe 64-bit store pattern (memcpy/__msa_storel_d).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 069b4a59a5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| v4f32 _p = bfloat2float_msa(ptr); | ||
| _p = __msa_fsub_w(_p, _max); | ||
| _p = exp_ps(_p); | ||
| *(int64_t*)ptr = __msa_copy_s_d((v2i64)float2bfloat_msa(_p), 0); |
There was a problem hiding this comment.
Replace unaligned bf16 vector store with safe helper
This bf16 softmax kernel writes through *(int64_t*)ptr, but ptr can come from row-based offsets (for example bottom_top_blob.row<unsigned short>(i) in forward_inplace_bf16s), and row starts are computed as w * y * elemsize, which is only 2-byte aligned for bf16 and can be misaligned for odd w. On strict-alignment MIPS targets this can fault at runtime (and is undefined behavior in C++), so valid bf16 inputs can crash in softmax; use an unaligned-safe store path such as __msa_storel_d/memcpy.
Useful? React with 👍 / 👎.


No description provided.