hip: bypass memory pool for flash attention f16 temp buffers

TheTom · TheTom · commit 0b059740a5cd · 2026-04-20T09:14:57.000-05:00
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers (K_f16, V_f16) stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context lengths on HIP/ROCm where VMM is unavailable. Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[] for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (all consumer RDNA 3/4 GPUs), the pool permanently consumes peak VRAM. Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with cudaFree (via RAII wrapper) instead of the pool. Memory is released after the FA kernel completes via cudaStreamSynchronize. Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K). Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only). Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT) Fixes: ggml-org#22107
diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
@@ -1299,14 +1299,19 @@ void launch_fattn(
 
 #ifdef GGML_USE_HIP
     // HIP/ROCm: bypass the memory pool for f16 temp buffers.
-    // The legacy pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently.
-    // For quantized KV dequant, this means the f16 temp buffer stays allocated,
-    // consuming more VRAM than the quantized KV compression saves — causing OOM.
-    // Using raw alloc+free ensures the memory is released after the kernel completes.
+    // The legacy pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently
+    // because free() stores buffers for reuse rather than releasing them.
+    // On HIP without VMM support (RDNA 3/4), this means the f16 dequant temp buffers
+    // for quantized KV stay allocated after use, consuming more VRAM than the KV
+    // compression saves — causing OOM before f16 at equivalent context lengths.
+    // Using raw cudaMalloc/cudaFree ensures memory is released after the kernel completes.
+    // Ref: https://github.com/ggml-org/llama.cpp/issues/22107
     struct hip_f16_alloc {
         half * ptr = nullptr;
         cudaStream_t stream;
         hip_f16_alloc(cudaStream_t s) : stream(s) {}
+        hip_f16_alloc(const hip_f16_alloc &) = delete;
+        hip_f16_alloc & operator=(const hip_f16_alloc &) = delete;
         ~hip_f16_alloc() {
             if (ptr) {
                 cudaStreamSynchronize(stream);