hip: bypass memory pool for FA f16 temp buffers by TheTom · Pull Request #92 · TheTom/llama-cpp-turboquant

TheTom · 2026-04-20T01:52:45Z

Backport of upstream draft PR ggml-org#22094.

Bypasses the legacy memory pool for flash attention f16 temp buffers on HIP/ROCm to prevent OOM with quantized KV cache types (q8_0, q4_0).

Includes code review feedback: deleted copy constructor and assignment operator on the RAII wrapper to prevent accidental double-free.

Single file: ggml/src/ggml-cuda/fattn-common.cuh, #ifdef GGML_USE_HIP only.

Refs: ggml-org#22107, ROCm/rocm-systems#2516

actions/labeler@v6 removed the `all:` / `any:` composition keys. The `server/webui` and `server` entries used `all:` to combine `any-glob-to-any-file` with negated `all-globs-to-all-files`, which now errors on every PR with: Unknown config options were under "changed-files": all Flatten both entries to a single `any-glob-to-any-file`. PRs touching both webui and other server files will now receive both labels instead of only `server/webui`. Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com>

* sycl : add flash-attn support for head size 512 This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512. Changes: - Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels. - Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256). - Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`. - Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization. - Added necessary template instances for the new 512 head size across various quantization types. * remove defunct mxfp4 reorder from setting buffer type

…gml-org#21034)

Co-authored-by: AUTOMATIC <->

…lf-filtering (ggml-org#21623) * feat: jinja engine improvements for reka-edge Port three Jinja engine improvements needed for the reka-edge model: 1. Python-style string repetition ("ab" * 3 → "ababab") 2. ensure_ascii=true support for tojson filter (escapes non-ASCII to \uXXXX) 3. int() builtin on value_int_t (identity, needed for Reka Edge template) * fix: escape invalid utf8 bytes when ensure_ascii=true The json_ensure_ascii_preserving_format function does not correctly handle an edge case where if UTF-8 parsing fails, it adds the non-ascii character back to the output as a raw byte. This commit fixes that by adding the unicode standard replacement character \\ufffd to the output instead. This is the standard behavior for various programming languages like Python, Rust, Go, etc. * chore: address PR comments 1. Add todo comment for supporting string repetition for array/tuples 2. Add support for float identity operation 3. Move invalid ascii test case to test_fuzzing * chore: accept suggestion for common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests! * Remove unnecessary hash from update script. * minor: move constant

* convert gguf * clip impl * fix conversion * wip * corrections * update docs * add gguf to test script

* model: fix multimodal padding token for gemma3n/gemma4 * nits

* common : simplify autoparser tagged parser rules * cont : remove upper limit on optional args * cont : revert changes to parsing at the end * cont : undo arbitrary ordering of optional args * cont : fix uninitialized required parameters * revert to simplify merge * re-apply patches * restore flexible optional arg ordering tests

* common : fix ambiguous grammar rule in gemma4 * cont : fix missing comma...

* webui: make Enter to send chat a setting * Shorten description * Use isMobile hook from $lib/hooks * Rebuild static output

* requirements : update transformers to 5.5.0 This commit updates the transformers dependency to version 5.5.0. The motivation for this is that transformers 5.5.0 includes support for Gemma4 and is required to be able to convert Gemma4 models. This is also causing issues for user of gguf-my-repo. Refs: https://huggingface.co/spaces/ggml-org/gguf-my-repo/discussions/202 * fix huggingface_hub version * set version of transformers to 5.5.0 * convert : add ty ignore directives to convert_hf_to_gguf.py This commit adds `ty: ignore` directives to transformers tokenizers field/methods to avoid type check errors. There might be better ways to handle this and perhaps this can be done in a follow up commit. The motivation for this is that it looks like in transformers 5.5.0 AutoTokenizer.from_pretrained can return generic tokenizer types or None and the type checker now produces an error when the conversion script accesses field like tokenizer.vocab. * convert : add ty ignore to suppress type check errors * convert : remove incorrect type ignores * convert : fix remaining python checks I was running a newer version of ty locally but I've switched to version 0.0.26 which is what CI uses and I was then able to reproduce the errors. Sorry about the noise. * update transformers version to 5.5.1

…y all return cudaError_t) (ggml-org#21676) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…org#21570) Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support: - vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__ - common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros - mma.cuh: Route CDNA4 to compatible MFMA instructions: * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950) * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3) * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3) - mmq.cuh: Include CDNA4 in stream-k kernel dispatch CDNA4 is largely compatible with CDNA3 except: - No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path - Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1: - Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950 - llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU): * f16+FA: 40,013 tok/s prefill, 254 tok/s decode * q8_0+FA: functional - Flash attention: works correctly - MMQ: works correctly with stream-k dispatch Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* vulkan: Support Q1_0 * use get_dm

* fix: enable reasoning budget sampler for gemma4 Add thinking_start_tag and thinking_end_tag to common_chat_params_init_gemma4(). Without these, the reasoning budget sampler never activates for gemma4. Make the newline after "thought" optional in the PEG parser to handle budget=0 (sampler forces end tag before the newline). Add test case for empty thinking block. Fixes ggml-org#21487 * use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser

* refactor: Build improvements * chore: Formatting + package lock update

…ml-org#21670) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

I'm not sure what the purpose of keeping `--alias` was when using `--models-preset`, but the result is really weird, as shown in the following logs: $ build/bin/llama-server --models-preset preset.ini --alias "Gemma 4 E4B UD Q8_K_XL" ... init: using 31 threads for HTTP server srv load_models: Loaded 2 cached model presets srv load_models: Loaded 1 custom model presets from preset.ini main: failed to initialize router models: alias 'Gemma 4 E4B UD Q8_K_XL' for model 'angt/test-split-model-stories260K:F32' conflicts with existing model name So I propose to simply ignore `--alias` too in this case. With this commit, the server starts in routing mode correctly. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…agment (ggml-org#21521) * ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after ggml-org#20618, and remove the busy webgpu log * Merge with upstream * Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants * Update Unary wgsl EXP and EXPM1 for f16 stability * Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization * Fix numerical percision for unary sqrt when working with f16 * Fix NaN canonicalization for packed integers using f16 * Update err threshold for binary div ops when using f16 * backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend * clean: uncomment existing code logs * clean: clean the unncessary debug info * Refactor and generalize dequant helpers * Remove deprecated quant structs * Refactor shader defines to reduce repetition * Remove error override for F16 type * fix: fix the accidential removal of the proper initialization of ctx * clean: clean legacy and format code * fix: did not modify tests ops --------- Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>

ggml-org#21669)

…gml-org#21739)

The TurboFlash two-pass fused attention kernel produces garbage output on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by default routes turbo3 through the standard FA path which works correctly. Users can opt-in with TURBO_FLASH=1 for testing/debugging. No perf regression — standard FA path matches TurboFlash speed within noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: gate turbo V unpad on V type, not K type

The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers (K_f16, V_f16) stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context lengths on HIP/ROCm where VMM is unavailable. Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[] for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (all consumer RDNA 3/4 GPUs), the pool permanently consumes peak VRAM. Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with cudaFree (via RAII wrapper) instead of the pool. Memory is released after the FA kernel completes via cudaStreamSynchronize. Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K). Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only). Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT) Fixes: ggml-org#22107

0cc4m and others added 30 commits April 9, 2026 07:31

vulkan: unify type macros to use Vx instead of _VECx (ggml-org#21605)

8a132fa

webui: Add option to pre-encode conversation for faster next turns (g…

75511a8

…gml-org#21034)

server : fix grammar commandline args (ggml-org#21543)

3ee9da0

Co-authored-by: AUTOMATIC <->

fix: Model Selector choice sync (ggml-org#21628)

9949ad0

metal : add missing mm-id specializations for q1_0 (ggml-org#21662)

5e9c635

vocab: add gemma4 tokenizer tests, fix edge case (ggml-org#21534)

0ec191e

* YATF (Yet Another Tokenizer Fix) for Gemma 4. With tests! * Remove unnecessary hash from update script. * minor: move constant

mtmd: support dots.ocr (ggml-org#17575)

501aeed

* convert gguf * clip impl * fix conversion * wip * corrections * update docs * add gguf to test script

model: fix multimodal padding token for gemma3n/gemma4 (ggml-org#21625)

057dba3

* model: fix multimodal padding token for gemma3n/gemma4 * nits

common : fix ambiguous grammar rule in gemma4 (ggml-org#21661)

ddf03c6

* common : fix ambiguous grammar rule in gemma4 * cont : fix missing comma...

webui: add "Send message on Enter" setting (ggml-org#21577)

4ef9301

* webui: make Enter to send chat a setting * Shorten description * Use isMobile hook from $lib/hooks * Rebuild static output

ggml : check return value of CUB calls used in argsort and top-k (the…

009a113

…y all return cudaError_t) (ggml-org#21676) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

CUDA: fuse muls (ggml-org#21665)

e34f042

common : add fluidity to the progress bar (ggml-org#21671)

e095a48

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

vulkan: Support Q1_0 (ggml-org#21539)

7b69125

* vulkan: Support Q1_0 * use get_dm

docs : fix broken link to ggml-openvino in OPENVINO.md (ggml-org#21709)

3f8752b

webui: Static build output improvements (ggml-org#21667)

f989a6e

* refactor: Build improvements * chore: Formatting + package lock update

common: mark --split-mode tensor as experimental (ggml-org#21684)

0893f50

common : fix when loading a cached HF models with unavailable API (gg…

fb38d6f

…ml-org#21670) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ggml-webgpu: support non-square subgroup matrix configs for Intel GPUs (

bfd1f45

ggml-org#21669)

model : make Gemma 4 shared-KV tail attn_k tensors optional on load (g…

e62fa13

…gml-org#21739)

github-actions Bot added ggml examples server Apple Metal Vulkan testing devops python script model OpenCL SYCL build nix jinja parser Ascend NPU Hexagon WebGPU IBM zDNN AMD ZenDNN server/webui OpenVINO android labels Apr 20, 2026

TheTom force-pushed the fix/hip-fa-pool-retention branch from f4a6fdb to 30c3c23 Compare April 20, 2026 01:53

TheTom and others added 4 commits April 20, 2026 08:45

fix: gate turbo V unpad on V type, not K type

6112eb4

fix: gate turbo V unpad on V type + disable TurboFlash on Apple10 (#91)

d3271ac

fix: gate turbo V unpad on V type, not K type

TheTom force-pushed the fix/hip-fa-pool-retention branch from 30c3c23 to 0b05974 Compare April 20, 2026 14:15

TheTom merged commit 57f6b93 into master Apr 20, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hip: bypass memory pool for FA f16 temp buffers#92

hip: bypass memory pool for FA f16 temp buffers#92
TheTom merged 312 commits intomasterfrom
fix/hip-fa-pool-retention

TheTom commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

TheTom commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants