Skip to content

opencl: flash attention improvement#25069

Open
wanghqc wants to merge 10 commits into
ggml-org:masterfrom
qualcomm:hq/fa-rework
Open

opencl: flash attention improvement#25069
wanghqc wants to merge 10 commits into
ggml-org:masterfrom
qualcomm:hq/fa-rework

Conversation

@wanghqc

@wanghqc wanghqc commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Overview

Rework the FA for OpenCL backend to improve precision and performance, support quantized KV cache. Tested with gpt-oss-20b model. Works well with models with head_dim of 64. For larger head_dim, the main benefit is the data traffic savings.

Additional information

This targets the Adreno GPUs, tested with Adreno GPUs in flagship android devices, and Windows on Snapdragon (WoS) (X1,and X2 GPUs).

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, used for prototyping manually reviewed and refactored

wanghqc and others added 10 commits June 25, 2026 18:49
- flash_attn_kv_pad_f16    pads the tail KV tile to a BLOCK_N multiple
- flash_attn_mask_pad_f16  pads the matching mask tile
- flash_attn_blk_f16       classifies each KV tile per query block as
                           fully masked / mixed / fully unmasked, so
                           the main kernel can skip fully-masked tiles
                           and the mask lookup for fully-unmasked ones
@wanghqc wanghqc requested a review from a team as a code owner June 26, 2026 23:44
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jun 26, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Hi @wanghqc, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants