Add GEMM-based standard SDPA benchmark#18646
Add GEMM-based standard SDPA benchmark#18646kimishpatel wants to merge 2 commits intogh/kimishpatel/219/basefrom
Conversation
Add bench_sdpa.cpp with a standalone GEMM-based SDPA implementation (run_standard_sdpa) alongside ExecuTorch's tiled flash attention (custom_sdpa_out) for comparative benchmarking. The standalone SDPA uses full GEMM per head with 3-pass softmax and supports both [B,S,H,D] and [B,H,S,D] layouts via BLAS leading dimension parameters, allowing isolation of algorithm vs layout effects. Includes validation tests that verify the GEMM-based implementation matches custom_sdpa_out within tolerance. Differential Revision: [D96044313](https://our.internmc.facebook.com/intern/diff/D96044313/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18646
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Cancelled JobsAs of commit 5cbe9f5 with merge base fb1618e ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
digantdesai
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
This PR needs a
|
Add bench_sdpa.cpp with a standalone GEMM-based SDPA implementation (run_standard_sdpa) alongside ExecuTorch's tiled flash attention (custom_sdpa_out) for comparative benchmarking. The standalone SDPA uses full GEMM per head with 3-pass softmax and supports both [B,S,H,D] and [B,H,S,D] layouts via BLAS leading dimension parameters, allowing isolation of algorithm vs layout effects. Includes validation tests that verify the GEMM-based implementation matches custom_sdpa_out within tolerance. Differential Revision: [D96044313](https://our.internmc.facebook.com/intern/diff/D96044313/) [ghstack-poisoned]
Stack from ghstack (oldest at bottom):
Add bench_sdpa.cpp with a standalone GEMM-based SDPA implementation
(run_standard_sdpa) alongside ExecuTorch's tiled flash attention
(custom_sdpa_out) for comparative benchmarking.
The standalone SDPA uses full GEMM per head with 3-pass softmax and
supports both [B,S,H,D] and [B,H,S,D] layouts via BLAS leading
dimension parameters, allowing isolation of algorithm vs layout effects.
Includes validation tests that verify the GEMM-based implementation
matches custom_sdpa_out within tolerance.
Differential Revision: D96044313