opencl: flash attention improvement by wanghqc · Pull Request #25069 · ggml-org/llama.cpp

wanghqc · 2026-06-26T23:44:02Z

Overview

Rework the FA for OpenCL backend to improve precision and performance, support quantized KV cache. Tested with gpt-oss-20b model. Works well with models with head_dim of 64. For larger head_dim, the main benefit is the data traffic savings.

Additional information

This targets the Adreno GPUs, tested with Adreno GPUs in flagship android devices, and Windows on Snapdragon (WoS) (X1,and X2 GPUs).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, used for prototyping manually reviewed and refactored

- flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple - flash_attn_mask_pad_f16 pads the matching mask tile - flash_attn_blk_f16 classifies each KV tile per query block as fully masked / mixed / fully unmasked, so the main kernel can skip fully-masked tiles and the mask lookup for fully-unmasked ones

ggml-gh-bot · 2026-06-26T23:48:21Z

Hi @wanghqc, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

wanghqc and others added 10 commits June 25, 2026 18:49

opencl: rework FA kernel for f16 and f32

de97345

opencl: FA kernels for q4_0 and q8_0

bd05512

opencl: set_rows for f32 to q8_0/q4_0

0fc396d

opencl: dequant kernels for q4_0 and q8_0

1ca6acf

opencl: add FA tile tuning table with override

350d26d

opencl: wire host side for FA

6088e8b

opencl: q4_0 MoE tensors are also SOA'ed

00c1ffb

opencl: cosmetic fix

5d59efb

opencl: refactor, also clarify some code paths in comments

7110431

wanghqc requested a review from a team as a code owner June 26, 2026 23:44

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opencl: flash attention improvement#25069

opencl: flash attention improvement#25069
wanghqc wants to merge 10 commits into
ggml-org:masterfrom
qualcomm:hq/fa-rework

wanghqc commented Jun 26, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wanghqc commented Jun 26, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants