hip: bypass memory pool for flash attention f16 temp buffers

TheTom · TheTom · commit 30c3c23d570e · 2026-04-19T20:53:18.000-05:00
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers (K_f16, V_f16) stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context lengths on HIP/ROCm where VMM is unavailable. Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[] for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (all consumer RDNA 3/4 GPUs), the pool permanently consumes peak VRAM. Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with cudaFree (via RAII wrapper) instead of the pool. Memory is released after the FA kernel completes via cudaStreamSynchronize. Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K). Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only). Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT) Fixes: ggml-org#22107
diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
@@ -946,8 +946,37 @@ void launch_fattn(
     const int cc  = ggml_cuda_info().devices[id].cc;
     const int nsm = ggml_cuda_info().devices[id].nsm;
 
+#ifdef GGML_USE_HIP
+    // HIP/ROCm: bypass the memory pool for f16 temp buffers.
+    // The legacy pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently
+    // because free() stores buffers for reuse rather than releasing them.
+    // On HIP without VMM support (RDNA 3/4), this means the f16 dequant temp buffers
+    // for quantized KV stay allocated after use, consuming more VRAM than the KV
+    // compression saves — causing OOM before f16 at equivalent context lengths.
+    // Using raw cudaMalloc/cudaFree ensures memory is released after the kernel completes.
+    // Ref: https://github.com/ggml-org/llama.cpp/issues/22107
+    struct hip_f16_alloc {
+        half * ptr = nullptr;
+        cudaStream_t stream;
+        hip_f16_alloc(cudaStream_t s) : stream(s) {}
+        hip_f16_alloc(const hip_f16_alloc &) = delete;
+        hip_f16_alloc & operator=(const hip_f16_alloc &) = delete;
+        ~hip_f16_alloc() {
+            if (ptr) {
+                cudaStreamSynchronize(stream);
+                cudaFree(ptr);
+            }
+        }
+        void alloc(size_t nelements) {
+            CUDA_CHECK(cudaMalloc(&ptr, nelements * sizeof(half)));
+        }
+    };
+    hip_f16_alloc K_f16(main_stream);
+    hip_f16_alloc V_f16(main_stream);
+#else
     ggml_cuda_pool_alloc<half>   K_f16(pool);
     ggml_cuda_pool_alloc<half>   V_f16(pool);
+#endif
     ggml_cuda_pool_alloc<int>    KV_max(pool);
     ggml_cuda_pool_alloc<float>  dst_tmp(pool);
     ggml_cuda_pool_alloc<float2> dst_tmp_meta(pool);