Skip to content

arm multiheadattention bf16 storage#6717

Merged
nihui merged 4 commits into
Tencent:masterfrom
nihui:mha-arm-bf16-1
May 14, 2026
Merged

arm multiheadattention bf16 storage#6717
nihui merged 4 commits into
Tencent:masterfrom
nihui:mha-arm-bf16-1

Conversation

@nihui
Copy link
Copy Markdown
Member

@nihui nihui commented May 12, 2026

No description provided.

@nihui nihui requested a review from Copilot May 12, 2026 09:37
@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 12, 2026

@codex review

@github-actions github-actions Bot added the arm label May 12, 2026
@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.97%. Comparing base (421de78) to head (86ee7be).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6717      +/-   ##
==========================================
+ Coverage   93.90%   93.97%   +0.07%     
==========================================
  Files         933      930       -3     
  Lines      310041   310141     +100     
==========================================
+ Hits       291140   291466     +326     
+ Misses      18901    18675     -226     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables bf16 storage support for the ARM MultiHeadAttention implementation by allowing bf16-backed pipelines while forcing specific intermediate GEMM outputs to fp32, and extends the ARM bf16s GEMM path to support fp32 output (via output_elemtype).

Changes:

  • Enable support_bf16_storage for MultiHeadAttention_arm when NCNN_BF16 is enabled, with special handling/disablement under int8_scale_term.
  • Force qk/qkv GEMM intermediates to output fp32 (output_elemtype=fp32) and add casting for bf16 v_affine where needed.
  • Extend ARM bf16s GEMM unpack/output handling to optionally write fp32 output (output_elemtype == 1).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/layer/arm/multiheadattention_arm.cpp Enables bf16 storage support and adjusts GEMM options/intermediate buffers for fp32 outputs.
src/layer/arm/gemm_bf16s.h Adds output_elemtype handling to allow bf16s GEMM unpack to write fp32 output.
src/layer/arm/gemm_arm.cpp Plumbs output_elemtype through bf16s GEMM helpers and allocates output buffer size accordingly.
src/layer/arm/gemm_arm_bf16.cpp Updates bf16 bf16 wrapper signature to forward output_elemtype.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/layer/arm/multiheadattention_arm.cpp
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@nihui
Copy link
Copy Markdown
Member Author

nihui commented May 12, 2026

Dimensity 9000 bf16s

big core 1t baseline pr
squeezenet 13.35 9.47
mobilenet 23.28 15.77
mobilenet_v2 16.28 14.54
mobilenet_v3 13.38 12.61
shufflenet 13.01 7.58
shufflenet_v2 14.29 7.92
mnasnet 14.90 13.94
proxylessnasnet 17.50 15.42
efficientnet_b0 25.40 19.17
efficientnetv2_b0 32.45 23.53
regnety_400m 21.30 15.12
blazeface 3.50 2.13
googlenet 54.96 49.58
resnet18 58.69 55.14
alexnet 49.48 50.94
vgg16 371.84 385.67
resnet50 136.50 96.35
squeezenet_ssd 50.67 50.10
mobilenet_ssd 50.43 29.04
mobilenet_yolo 127.77 81.33
mobilenetv2_yolov3 59.00 39.01
yolov4-tiny 99.66 91.64
nanodet_m 21.51 14.96
yolo-fastest-1.1 14.84 10.46
yolo-fastestv2 11.95 8.43
vision_transformer 515.18 263.99
FastestDet 13.28 9.18
big core 4t baseline pr
squeezenet 9.31 7.34
mobilenet 12.37 6.75
mobilenet_v2 9.79 7.99
mobilenet_v3 9.14 7.36
shufflenet 6.41 5.87
shufflenet_v2 5.58 4.77
mnasnet 9.24 7.10
proxylessnasnet 9.90 7.76
efficientnet_b0 14.15 10.98
efficientnetv2_b0 19.34 15.84
regnety_400m 22.80 20.63
blazeface 1.72 1.64
googlenet 32.66 33.94
resnet18 48.23 42.19
alexnet 26.30 27.50
vgg16 274.30 275.21
resnet50 80.83 72.96
squeezenet_ssd 45.54 45.60
mobilenet_ssd 28.00 16.86
mobilenet_yolo 80.72 56.09
mobilenetv2_yolov3 37.76 28.44
yolov4-tiny 70.16 64.22
nanodet_m 14.72 11.04
yolo-fastest-1.1 7.35 5.97
yolo-fastestv2 7.52 6.35
vision_transformer 282.33 146.52
FastestDet 7.67 6.98
small core 1t baseline pr
squeezenet 86.68 55.85
mobilenet 153.99 110.53
mobilenet_v2 102.56 91.72
mobilenet_v3 89.11 73.32
shufflenet 56.79 47.14
shufflenet_v2 64.98 54.40
mnasnet 101.14 79.93
proxylessnasnet 132.60 109.52
efficientnet_b0 176.80 150.42
efficientnetv2_b0 202.85 165.31
regnety_400m 140.22 113.87
blazeface 17.25 14.75
googlenet 330.04 323.66
resnet18 276.94 253.82
alexnet 230.89 220.68
vgg16 1105.18 1106.81
resnet50 748.32 597.05
squeezenet_ssd 210.70 197.29
mobilenet_ssd 287.63 229.86
mobilenet_yolo 834.49 639.62
mobilenetv2_yolov3 387.28 300.92
yolov4-tiny 452.39 412.05
nanodet_m 156.76 130.47
yolo-fastest-1.1 37.22 54.16
yolo-fastestv2 48.02 38.35
vision_transformer 3259.91 2445.90
FastestDet 51.76 26.63
small core 4t baseline pr
squeezenet 22.74 22.40
mobilenet 35.53 35.05
mobilenet_v2 27.61 27.77
mobilenet_v3 25.91 24.95
shufflenet 19.33 18.23
shufflenet_v2 19.02 18.66
mnasnet 27.86 26.88
proxylessnasnet 33.05 32.12
efficientnet_b0 43.23 41.96
efficientnetv2_b0 52.29 51.41
regnety_400m 57.88 56.44
blazeface 5.21 4.91
googlenet 85.05 84.38
resnet18 76.30 75.60
alexnet 65.15 61.12
vgg16 345.80 344.46
resnet50 186.29 183.74
squeezenet_ssd 69.52 70.35
mobilenet_ssd 75.46 72.09
mobilenet_yolo 242.58 237.27
mobilenetv2_yolov3 103.28 99.18
yolov4-tiny 144.65 143.35
nanodet_m 42.26 41.14
yolo-fastest-1.1 19.74 18.51
yolo-fastestv2 19.05 18.35
vision_transformer 799.67 818.26
FastestDet 18.72 18.11

@nihui nihui merged commit 4d68240 into Tencent:master May 14, 2026
57 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants