Skip to content

Commit d644388

Browse files
committed
Update base for Update on "Add quantized input support to cpu_sdpa"
cpu_sdpa (unfused SDPA) previously only supported float inputs. When the model uses quantized Q/K/V (int8 with per-channel scales and zero_points), decode fell back to cpu_flash_attention, missing the ~25-30% throughput improvement from unfused SDPA. This adds quantized support to cpu_sdpa by: - Accepting optional quantization params (zero_points, scales for Q/K/V) - Using _q_at_k_gemm for QK^T (handles both int8 and float) - Using _qk_at_v_gemm for scoresV (handles both int8 and float) - Applying scaling factor separately (fused with mask add or max reduction) - Allocating a dequantization buffer for V when quantized The dispatch in op_sdpa.cpp is updated to route quantized decode (seq_len==1) through cpu_sdpa instead of cpu_flash_attention. Differential Revision: [D96044310](https://our.internmc.facebook.com/intern/diff/D96044310/) [ghstack-poisoned]
1 parent 550e894 commit d644388

0 file changed

File tree

    0 commit comments

    Comments
     (0)