Performance of llama.cpp on AMD ROCm (HIP) #15021
Replies: 62 comments 114 replies
-
RX 7800 XT (Sapphire Pulse 280W)ggml_cuda_init: found 1 ROCm devices:
build: 00131d6 (6031) ggml_vulkan: Found 1 Vulkan devices:
build: baad948 (6056) Notes:
|
Beta Was this translation helpful? Give feedback.
-
|
Happy to replicate: ggml_cuda_init: found 1 ROCm devices:
build: 9c35706 (6060) On Linux |
Beta Was this translation helpful? Give feedback.
-
RX 7600 XTggml_cuda_init: found 1 ROCm devices:
build: 9c35706 (6060) Running on Linux 6.12.32, mainline amdgpu, ROCm 6.4.1. ggml_vulkan: Found 1 Vulkan devices:
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
-
|
AMD MI60. Happy to contribute.
I will post FA=1 and vulkan results once I have time during the weekend. |
Beta Was this translation helpful? Give feedback.
-
MI100Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
build: 9c35706 (6060) I'm running Ubuntu 24.04.2 and ROCm 6.4.1 |
Beta Was this translation helpful? Give feedback.
-
AMD Instinct MI300Xroot@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: 2bf3fbf (6069) Ref: #14640 |
Beta Was this translation helpful? Give feedback.
-
Pro V620Why does FA slow down the V620 so much? Been a question I've been trying to answer for a while now.
build: 03d4698 (6074) Linux, ROCm 6.4.1 ( will try upgrading soon) |
Beta Was this translation helpful? Give feedback.
-
Powercolor Hellhound RX 7900 XTX (400W power limit)Opensuse tumbleweed system with rocm packages from
build: 5c0eb5e (6075) Sapphire Nitro 7900 XTX (400W power limit)In a different PC unfortunately because these GPUs are too chonky to fit in a regular case
build: 9c35706 (6060) |
Beta Was this translation helpful? Give feedback.
-
Powercolor Red Devil 7900XTXAdrenalin 25.8.1 just came out, so time to test again
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32
build: 2572689 (6099) Still lower than the historical highs on May 26th (3599 and 3743), and a loss and a win against July 22nd (3529 and 3598). |
Beta Was this translation helpful? Give feedback.
-
|
RX 7900 XTX (ASUS TUF)
build: 6c7e9a5 (6118) |
Beta Was this translation helpful? Give feedback.
-
RX 6800 (16GB 203W)ROCm 6.3.4 on Ubuntu 24.04 in a Docker container
build: 79c1160 (6123) Bonus benchmarksI ran these to compare ROCm versions on various models. Obviously the results are specific to my RX 6800 and shouldn't be used to make any judgments about ROCm performance in general, especially on RDNA3 and later gpus. I use 6.3.4 because I don't care about LLama 3 8B. Note how fast the new MoE models are - gpt-oss-20B even at Q6_K_XL is faster than this 7B Q4_0 model. (Do make sure that you have a fixed version because the original gpt-oss releases had some issues - I used https://huggingface.co/unsloth/gpt-oss-20b-GGUF). ROCm 6.3.4
ROCm 6.4.3
|
Beta Was this translation helpful? Give feedback.
-
|
RX 7900 XTX (ASUS TUF a bit overclocked for 100 mhz for core and VRAM) ./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: 648ebcd (6146) |
Beta Was this translation helpful? Give feedback.
-
|
RX 6900 XT AMD Reference Card (Stock clocks) Debian Testing llama.cpp version: gguf-v0.17.1-386-gfd1234cb
|
Beta Was this translation helpful? Give feedback.
-
|
GigaByte R9700
|
Beta Was this translation helpful? Give feedback.
-
Radeon RX 9070 (non-XT)
build: 65349f2 (6183) I tried to enable the use of rocwmma with Still surprising that these numbers are better than the 9070 XT. |
Beta Was this translation helpful? Give feedback.
-
|
Phew went through some issues here. ASRock Radeon AI PRO R9700. Rocm 7.2Stay away from rocm7.2: Rocm 7.1.1 |
Beta Was this translation helpful? Give feedback.
-
|
Running on Manjaro Linux, using the ROCm docker image:
build: 1a29907 (8202) |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 7900 XTX (Sapphire Pulse 7900 XTX)System details
ROCm results
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
build: 8ff0207 (8400) Vulkan results
ggml_vulkan: Found 1 Vulkan devices:
build: bcdc4eb (8400) |
Beta Was this translation helpful? Give feedback.
-
|
No ROCMinfo on Windows, but: AMD Radeon RX 7900 XTX reference card, watercooled: ROCM 7.2:
build: 1e64534 (8429) Vulkan:
build: c9ced49 (7549) Newer Vulkan version gives me the same speed as ROCM |
Beta Was this translation helpful? Give feedback.
-
|
Vulkan Instance Version: 1.4.341 Linux fedora 6.19.6-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC |
Beta Was this translation helpful? Give feedback.
-
|
./llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
build: 406f4e3 (8514) Results above are from using the pre-built binaries, using aur's llama.cpp-hip script gives higher tg128 but slower pp512 llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
build: unknown (8514) |
Beta Was this translation helpful? Give feedback.
-
|
Adrenalin 26.3.1 WSL Debian 13.3 ROCDXG ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
build: d43375f (8611) |
Beta Was this translation helpful? Give feedback.
-
|
HP g1a laptop with APU capped at 70W. AMD RYZEN AI MAX+ PRO 395 with 128GB ram using TheRock 7.12 $ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024
$ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,1024,2048,4096 -n 128,256
$ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 8192 -n 1024
|
Beta Was this translation helpful? Give feedback.
-
|
llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
build: e365e65 (8851) |
Beta Was this translation helpful? Give feedback.
-
|
ROCm 7.2.3 /workspace/llama.cpp/build/bin/llama-bench -m /data2/llm/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512 -n 128 --device ROCm0
build: 046e284 (9085) 2 card on PCI 4
build: 046e284 (9085) |
Beta Was this translation helpful? Give feedback.
-
|
rocm Version : 7.13.0a20260508-1
|
Beta Was this translation helpful? Give feedback.
-
Radeon 890M
Running the more detailed test of: Getting results of:
build: 338ea1e (9127) |
Beta Was this translation helpful? Give feedback.
-
|
A data point from the image-diffusion side, since the rocWMMA flash-attn numbers here are (understandably) all LLM token-generation. Diffusion attention is a different shape — non-GQA ( Setup: AMD Radeon AI PRO R9700 (Navi 48 / RDNA4 / gfx1201, 32 GB),
So ~6% on the attention kernel, ~2.5% on total wall, output verified correct at the same seed. Two things worth noting for other AMD/diffusion users:
The small gain is consistent with this workload being memory-bandwidth-bound at 1024² (O(n²) over 4096 tokens), not matmul-FLOP-bound — matrix cores help, just not a lot here. The same card sees a bigger relative benefit at smaller, more compute-bound configs. Happy to run more diffusion points (resolutions/steps/other RDNA cards) if it's useful to the rocWMMA tuning work in #16827 — it's a fixed-context, non-GQA stress case the LLM benchmarks don't cover. |
Beta Was this translation helpful? Give feedback.
-
|
Quick follow-up to my earlier diffusion numbers — the rocWMMA flash-attn win on RDNA4 turns out to be strongly config-dependent, which lines up nicely with the LLM results others have posted in this thread. Same setup as before (R9700 / gfx1201 / ROCm 7.0.2 / rocWMMA 2.0, FLUX.1 Krea Q4,
So at small / compute-bound configs the matrix-core path is a big win (~2.4×), but as resolution grows the attention becomes memory-bandwidth-bound (O(n²) K/V streaming) and the gain shrinks to a few percent. Same shape as the LLM side here — large on long-context prefill, modest on short prompts / decode. Net for image-diffusion users on RDNA4: definitely worth building with rocWMMA fattn on (ROCm 7 / rocWMMA 2.0) — just be aware the upside is very workload-dependent. For reference, with FA on the R9700 lands right next to an A6000 (CUDA) at the same diffusion recipe. |
Beta Was this translation helpful? Give feedback.
-
|
OS: CachyOS
build: ba5b911 (9518) Just for gigs, Vulkan, it is quite a bit faster.
|
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on Nvidia CUDA and Performance of llama.cpp with Vulkan, but for ROCm! I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our ROCm(HIP) releases. If you have multiple GPUs please run the test on a single GPU using
-sm none -mg YOUR_GPU_NUMBERunless the model is too big to fit in VRAM.Share your llama-bench results along with the git hash and ROCm info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same device I'll prioritize newer commits with substantial ROCm updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!
ROCm Scoreboard for Llama 2 7B, Q4_0 (no FA)
ROCm Scoreboard for Llama 2 7B, Q4_0 (with FA)
More detailed test
The main idea of this test is to show a decrease in performance with increasing size.
Beta Was this translation helpful? Give feedback.
All reactions