Feature/vulkan fa large buffer by Yvi71 · Pull Request #181 · TheTom/llama-cpp-turboquant

Yvi71 · 2026-06-13T00:50:33Z

Overview

This PR fixes a crash/abort in the Vulkan backend when utilizing large context sequence lengths (or high parallel slot counts) where the required intermediate split_k buffer or model inputs exceed the GPU's maxStorageBufferRange (capped at 4GB on AMD discrete GPUs like the RX 6800).
Although the scheduler's supports_op already correctly allows tensors larger than maxStorageBufferRange when shader_64b_indexing is enabled, the Flash Attention implementation (ggml_vk_flash_attn) was still hard-aborting on preallocation size limits and did not select the 64-bit indexing pipeline variants at runtime.
This change:

Relaxes the preallocation limit check to validate against max_buffer_size instead of maxStorageBufferRange if the device supports shader_64b_indexing.
Checks if 64-bit indexing is required based on the size of the bound buffers.
Retrieves and dispatches the 64-bit indexing variants of both the main Flash Attention pipeline and the split-k reduction pipeline (pipeline_flash_attn_split_k_reduce).

Additional information

Bypasses the 4GB maxStorageBufferRange limit by leveraging the existing 64-bit indexing pipeline variants already compiled in the Vulkan backend.
Successfully verified with a prompt containing 80,000 tokens on dual AMD Radeon RX 6800 GPUs.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - Used an AI coding assistant (Antigravity) in an assistive capacity to help identify the missing 64-bit pipeline selections in the Flash Attention code path and draft the initial patch. All changes have been manually reviewed, compiled, and tested on physical hardware.

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers (K_f16, V_f16) stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context lengths on HIP/ROCm where VMM is unavailable. Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[] for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (all consumer RDNA 3/4 GPUs), the pool permanently consumes peak VRAM. Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with cudaFree (via RAII wrapper) instead of the pool. Memory is released after the FA kernel completes via cudaStreamSynchronize. Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K). Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only). Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT) Fixes: ggml-org#22107

hip: bypass memory pool for FA f16 temp buffers

…o3/turbo4)

…o feature/vulkan-turboquant-kv-cache

* vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault

…o feature/vulkan-turboquant-kv-cache

TheTom and others added 12 commits April 20, 2026 09:14

Merge pull request TheTom#92 from TheTom/fix/hip-fa-pool-retention

57f6b93

hip: bypass memory pool for FA f16 temp buffers

Merge origin/master into feature/turboquant-kv-cache

bd0d153

feat(vulkan): add native Vulkan support for TurboQuant KV Cache (turb…

19b1864

…o3/turbo4)

Merge origin/tturney/vulkan-rdna4-repro resolving conflicts via --theirs

b071787

Merge remote-tracking branch 'origin/feature/turboquant-kv-cache' int…

210f76e

…o feature/vulkan-turboquant-kv-cache

vulkan: fast path for walsh-hadamard transform (ggml-org#23687)

d316c52

* vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault

vulkan: fix wrong index variable in inner loop (ggml-org#23665)

1a8ab21

Merge remote-tracking branch 'origin/feature/turboquant-kv-cache' int…

ed3deaa

…o feature/vulkan-turboquant-kv-cache

Merge remote-tracking branch origin/feature/turboquant-kv-cache

fd6052b

Merge remote-tracking branch 'origin/feature/turboquant-kv-cache' int…

e2c0fa8

…o feature/vulkan-turboquant-kv-cache

vulkan: enable large flash attention buffers (>4GB) via 64-bit indexing

9a7b53d

github-actions Bot added Nvidia GPU ggml Vulkan testing labels Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/vulkan fa large buffer#181

Feature/vulkan fa large buffer#181
Yvi71 wants to merge 12 commits into
TheTom:feature/turboquant-kv-cachefrom
Yvi71:feature/vulkan-fa-large-buffer

Yvi71 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Yvi71 commented Jun 13, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants