Commit 30c3c23
committed
hip: bypass memory pool for flash attention f16 temp buffers
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized
allocations permanently. For quantized KV flash attention, the f16
dequant temp buffers (K_f16, V_f16) stay allocated in the pool after
use, consuming more VRAM than the KV compression saves. This causes
quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context
lengths on HIP/ROCm where VMM is unavailable.
Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[]
for reuse and never calls cudaFree. On CUDA with VMM the OS can
reclaim unused virtual memory. On HIP without VMM (all consumer RDNA
3/4 GPUs), the pool permanently consumes peak VRAM.
Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with
cudaFree (via RAII wrapper) instead of the pool. Memory is released
after the FA kernel completes via cudaStreamSynchronize.
Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K).
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only).
Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT)
Fixes: ggml-org#221071 parent 23b8cc4 commit 30c3c23
1 file changed
Lines changed: 29 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
946 | 946 | | |
947 | 947 | | |
948 | 948 | | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
949 | 977 | | |
950 | 978 | | |
| 979 | + | |
951 | 980 | | |
952 | 981 | | |
953 | 982 | | |
| |||
0 commit comments