Skip to content

Feature/vulkan fa large buffer#181

Open
Yvi71 wants to merge 12 commits into
TheTom:feature/turboquant-kv-cachefrom
Yvi71:feature/vulkan-fa-large-buffer
Open

Feature/vulkan fa large buffer#181
Yvi71 wants to merge 12 commits into
TheTom:feature/turboquant-kv-cachefrom
Yvi71:feature/vulkan-fa-large-buffer

Conversation

@Yvi71

@Yvi71 Yvi71 commented Jun 13, 2026

Copy link
Copy Markdown

Overview

This PR fixes a crash/abort in the Vulkan backend when utilizing large context sequence lengths (or high parallel slot counts) where the required intermediate split_k buffer or model inputs exceed the GPU's maxStorageBufferRange (capped at 4GB on AMD discrete GPUs like the RX 6800).
Although the scheduler's supports_op already correctly allows tensors larger than maxStorageBufferRange when shader_64b_indexing is enabled, the Flash Attention implementation (ggml_vk_flash_attn) was still hard-aborting on preallocation size limits and did not select the 64-bit indexing pipeline variants at runtime.
This change:

  1. Relaxes the preallocation limit check to validate against max_buffer_size instead of maxStorageBufferRange if the device supports shader_64b_indexing.
  2. Checks if 64-bit indexing is required based on the size of the bound buffers.
  3. Retrieves and dispatches the 64-bit indexing variants of both the main Flash Attention pipeline and the split-k reduction pipeline (pipeline_flash_attn_split_k_reduce).

Additional information

  • Bypasses the 4GB maxStorageBufferRange limit by leveraging the existing 64-bit indexing pipeline variants already compiled in the Vulkan backend.
  • Successfully verified with a prompt containing 80,000 tokens on dual AMD Radeon RX 6800 GPUs.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - Used an AI coding assistant (Antigravity) in an assistive capacity to help identify the missing 64-bit pipeline selections in the Flash Attention code path and draft the initial patch. All changes have been manually reviewed, compiled, and tested on physical hardware.

TheTom and others added 12 commits April 20, 2026 09:14
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized
allocations permanently. For quantized KV flash attention, the f16
dequant temp buffers (K_f16, V_f16) stay allocated in the pool after
use, consuming more VRAM than the KV compression saves. This causes
quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context
lengths on HIP/ROCm where VMM is unavailable.

Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[]
for reuse and never calls cudaFree. On CUDA with VMM the OS can
reclaim unused virtual memory. On HIP without VMM (all consumer RDNA
3/4 GPUs), the pool permanently consumes peak VRAM.

Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with
cudaFree (via RAII wrapper) instead of the pool. Memory is released
after the FA kernel completes via cudaStreamSynchronize.

Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K).

Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only).
Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT)
Fixes: ggml-org#22107
hip: bypass memory pool for FA f16 temp buffers
* vulkan: fast path for walsh-hadamard transform

* disable for intel due to segfault
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants