Add benchmarks for transposed vs standard KV cache layout#18711
Add benchmarks for transposed vs standard KV cache layout#18711kimishpatel wants to merge 1 commit intogh/kimishpatel/231/basefrom
Conversation
Benchmarks comparing transposed [B, H, S, D] vs standard [B, S, H, D] KV cache layouts in custom_sdpa and update_cache ops using Llama 3 8B config (32 Q heads, 8 KV heads, D=128). Both C++ (Google Benchmark) and Python benchmarks are included, covering decode (seq_len=1) at various cache fill levels and prefill scenarios. Results on Apple M-series show transposed cache significantly improves SDPA performance at longer cache fills (1.64x at start_pos=1024, 1.13x for prefill seq_len=512) due to better memory locality in the attn_score @ V GEMM — V stride along S_kv changes from H*D to D. Authored with Claude. Differential Revision: [D99677680](https://our.internmc.facebook.com/intern/diff/D99677680/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18711
Note: Links to docs will display an error until the docs builds have been completed. ❌ 9 New Failures, 2 Cancelled JobsAs of commit 2bd6e06 with merge base fb1618e ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
submitted by accident, not meant to land immedidately |
Stack from ghstack (oldest at bottom):
Benchmarks comparing transposed [B, H, S, D] vs standard [B, S, H, D] KV
cache layouts in custom_sdpa and update_cache ops using Llama 3 8B config
(32 Q heads, 8 KV heads, D=128). Both C++ (Google Benchmark) and Python
benchmarks are included, covering decode (seq_len=1) at various cache fill
levels and prefill scenarios.
Results on Apple M-series show transposed cache significantly improves SDPA
performance at longer cache fills (1.64x at start_pos=1024, 1.13x for
prefill seq_len=512) due to better memory locality in the attn_score @ V
GEMM — V stride along S_kv changes from H*D to D.
Authored with Claude.
Differential Revision: D99677680