Commit 0b05974
committed
hip: bypass memory pool for flash attention f16 temp buffers
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized
allocations permanently. For quantized KV flash attention, the f16
dequant temp buffers (K_f16, V_f16) stay allocated in the pool after
use, consuming more VRAM than the KV compression saves. This causes
quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context
lengths on HIP/ROCm where VMM is unavailable.
Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[]
for reuse and never calls cudaFree. On CUDA with VMM the OS can
reclaim unused virtual memory. On HIP without VMM (all consumer RDNA
3/4 GPUs), the pool permanently consumes peak VRAM.
Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with
cudaFree (via RAII wrapper) instead of the pool. Memory is released
after the FA kernel completes via cudaStreamSynchronize.
Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K).
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only).
Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT)
Fixes: ggml-org#221071 parent d3271ac commit 0b05974
1 file changed
Lines changed: 9 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1299 | 1299 | | |
1300 | 1300 | | |
1301 | 1301 | | |
1302 | | - | |
1303 | | - | |
1304 | | - | |
1305 | | - | |
| 1302 | + | |
| 1303 | + | |
| 1304 | + | |
| 1305 | + | |
| 1306 | + | |
| 1307 | + | |
| 1308 | + | |
1306 | 1309 | | |
1307 | 1310 | | |
1308 | 1311 | | |
1309 | 1312 | | |
| 1313 | + | |
| 1314 | + | |
1310 | 1315 | | |
1311 | 1316 | | |
1312 | 1317 | | |
| |||
0 commit comments