Performance of llama.cpp on AMD ROCm (HIP) #15021

olegshulyakov · 2025-08-01T20:10:19Z

olegshulyakov
Aug 1, 2025

This is similar to the Performance of llama.cpp on Apple Silicon M-series, Performance of llama.cpp on Nvidia CUDA and Performance of llama.cpp with Vulkan, but for ROCm! I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our ROCm(HIP) releases. If you have multiple GPUs please run the test on a single GPU using -sm none -mg YOUR_GPU_NUMBER unless the model is too big to fit in VRAM.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

Share your llama-bench results along with the git hash and ROCm info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device I'll prioritize newer commits with substantial ROCm updates, otherwise I'll pick the one with the highest overall score at my discretion. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same. For integrated graphics note that your memory speed and number of channels will greatly affect your inference speed!

ROCm Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11476.40 ± 72.79	232.92 ± 0.53	`ee3a9fc`	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3552.27 ± 101.96	167.11 ± 0.50	`2f0c2db`	@Diablo-D3
Instinct MI210	64 GB / HBM2e / 4096 bit	2486.22 ± 9.58	124.51 ± 0.04	`8160b38`	@65a
Pro W7900	48 GB / GDDR6 / 384 bit	3213.17 ± 80.47	121.18 ± 0.06	`8160b38`	@65a
RX 7900 XT	20 GB / GDDR6 / 320 bit	3098.38 ± 24.02	116.15 ± 0.06	`1e15bfd`	@AdamNiederer
RX 9070	16 GB / GDDR6 / 256 bit	2381.77 ± 3.68	114.48 ± 0.60	`d0660f2`	@andj1210
Instinct MI100	32 GB / HBM2 / 4096 bit	2732.83 ± 1.98	110.48 ± 0.14	`9c35706`	@firefox42
RX 9070 XT	16 GB / GDDR6 / 256 bit	5055.19 ± 109.58	101.27 ± 0.27	`583cb83`	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2151.81 + 17.94	100.94 + 0.10	`00131d6`	@olegshulyakov
Instinct MI50	32 GB / HBM2 / 4096 bit	1057.24 ± 0.53	98.95 ± 0.25	`97d5117`	@wtarreau
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1456.98 ± 12.39	96.07 ± 0.10	6fa3b55	@MihaiBojescu
AI PRO R9700	32 GB / GDDR6 / 256 bit	4443.54 ± 339.25	93.84 ± 0.26	`bd4ef13`	@gogich77
Instinct MI60	32 GB / HBM2 / 4096 bit	1289.11 ± 0.62	91.46 ± 0.13	`504af20`	@Said-Akbar
RX 6900 XT	16 GB / GDDR6 / 256 bit	1889.84 ± 31.21	88.49 ± 0.00	`a972fae`	@notgood
Pro VII	16 GB / HBM2 / 4096 bit	1064.99 ± 1.18	87.45 ± 0.04	`2739a71`	@8XXD8
RX 6800 XT	16 GB / GDDR6 / 256 bit	1447.07 ± 1.36	83.92 ± 0.03	`79c1160`	@MrLavender
Pro V620	32 GB / GDDR6 / 256 bit	1803.65 ± 2.54	74.66 ± 0.01	`5c0eb5e`	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1419.67 ± 3.64	67.58 ± 0.24	`a0e13dc`	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	354.17 ± 0.18	67.55 ± 0.04	`c05e8c9`	@daniandtheweb
Instinct MI25	16 GB / HBM2 / 2048 bit	409.83 ± 0.23	63.94 ± 0.06	`2739a71`	@8XXD8
AI Max+ 395	128 GB / LPDDR5	911.36 ± 1.79	50.01 ± 0.07	e60f241	@firefox42
RX 7600 XT	16 GB / GDDR6 / 128 bit	1099.64 ± 2.05	48.58 ± 0.06	`9c35706`	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	240.68 ± 0.09	48.46 ± 0.09	`ec428b0`	@davispuh
Radeon 8060S	System Shared / DDR5	351.36 ± 0.67	47.97 ± 0.33	`1d0125b`	@hspak
Radeon 880M	System Shared / DDR5	163.25 ± 13.86	12.97 ± 1.63	`c55d53a`	@Hedede

ROCm Scoreboard for Llama 2 7B, Q4_0 (with FA)

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11945.97 ± 54.29	218.53 ± 0.09	`ee3a9fc`	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3874.25 ± 11.92	170.12 ± 0.56	`2f0c2db`	@Diablo-D3
Instinct MI210	64 GB / HBM2e / 4096 bit	2571.82 ± 2.89	130.18 ± 0.06	`8160b38`	@65a
Pro W7900	48 GB / GDDR6 / 384 bit	3472.86 ± 52.86	127.43 ± 0.12	`8160b38`	@65a
RX 9070	16 GB / GDDR6 / 256 bit	2452.68 ± 1.33	115.32 ± 0.52	`d0660f2`	@andj1210
RX 7900 XT	20 GB / GDDR6 / 320 bit	3261.75 ± 9.09	112.30 ± 0.06	`1e15bfd`	@AdamNiederer
Instinct MI100	32 GB / HBM2 / 4096 bit	2755.00 ± 3.68	104.71 ± 0.10	`9c35706`	@firefox42
Instinct MI50	32 GB / HBM2 / 4096 bit	1129.43 ± 0.15	105.82 ± 0.07	`97d5117`	@wtarreau
AI PRO R9700	32 GB / GDDR6 / 256 bit	4773.07 ± 49.30	97.98 ± 0.13	`bd4ef13`	@gogich77
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1598.79 ± 11.48	97.53 ± 0.06	6fa3b55	@MihaiBojescu
RX 9070 XT	16 GB / GDDR6 / 256 bit	4903.51 ± 96.36	97.28 ± 0.13	`583cb83`	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2304.63 + 2.85	95.99 + 0.21	`00131d6`	@olegshulyakov
RX 6900 XT	16 GB / GDDR6 / 256 bit	1948.31 ± 13.51	85.04 ± 0.02	`a972fae`	@notgood
Pro V620	32 GB / GDDR6 / 256 bit	1256.86 ± 0.55	70.83 ± 0.02	`5c0eb5e`	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1479.27 ± 0.71	65.42 ± 0.19	`a0e13dc`	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	314.17 ± 0.29	62.02 ± 0.05	`c05e8c9`	@daniandtheweb
AI Max+ 395	128 GB / LPDDR5	1003.53 ± 2.91	49.87 ± 0.02	e60f241	@firefox42
Radeon 8060S	System Shared / DDR5	366.08 ± 1.44	48.97 ± 0.15	`1d0125b`	@hspak
RX 7600 XT	16 GB / GDDR6 / 128 bit	1199.16 ± 1.07	47.65 ± 0.06	`9c35706`	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	153.17 ± 0.72	42.46 ± 0.40	`ec428b0`	@davispuh
Radeon 880M	System Shared / DDR5	213.31 ± 14.05	16.16 ± 1.41	`c55d53a`	@Hedede

More detailed test

The main idea of this test is to show a decrease in performance with increasing size.

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

olegshulyakov · 2025-08-01T20:20:29Z

olegshulyakov
Aug 1, 2025
Author

RX 7800 XT (Sapphire Pulse 280W)

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	pp512	2151.81 + 17.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	tg128	100.94 + 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	pp512	2304.63 + 2.85
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	tg128	95.99 + 0.21

build: 00131d6 (6031)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	pp512	2145.60 + 23.14
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	0	tg128	96.89 + 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	pp512	2063.66 + 2.92
llama 7B Q4_0	3.56 GiB	6.74 B	RPC,Vulkan	99	1	tg128	96.03 + 0.09

build: baad948 (6056)

Notes:

Sapphire RX 7800 XT Pulse (Power Limit +15% - 280W)
Windows 10.
Drivers - Radeon Pro.

1 reply

olegshulyakov Aug 8, 2025
Author

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-hip.dll
load_backend: loaded RPC backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\oleg.llama.cpp\llama-b6104-bin-win-hip-radeon-x64\ggml-cpu-icelake.dll

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp512	2109.38 + 15.79
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp1024	1749.56 + 12.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp2048	1165.15 + 1.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp4096	997.83 + 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp8192	789.89 + 0.46
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	pp16384	196.02 + 0.96
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg128	99.55 + 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg256	98.16 + 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg512	90.29 + 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg1024	80.56 + 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	0	tg2048	62.58 + 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp512	2296.72 + 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp1024	2225.68 + 2.84
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp2048	2069.86 + 2.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp4096	1814.41 + 2.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp8192	1423.62 + 0.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	pp16384	992.13 + 0.81
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg128	96.80 + 1.30
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg256	95.98 + 0.60
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg512	95.92 + 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg1024	91.30 + 0.74
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	none	1	tg2048	85.44 + 0.29

build: e725a1a (6104)

AdamNiederer · 2025-08-01T20:24:57Z

AdamNiederer
Aug 1, 2025

Happy to replicate:

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	2967.12 ± 31.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	116.00 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3163.24 ± 4.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	112.75 ± 0.04

build: 9c35706 (6060)

On Linux

0 replies

wbruna · 2025-08-01T21:53:04Z

wbruna
Aug 1, 2025

RX 7600 XT

ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	1099.64 ± 2.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	48.58 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1199.16 ± 1.07
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	47.65 ± 0.06

build: 9c35706 (6060)

Running on Linux 6.12.32, mainline amdgpu, ROCm 6.4.1.

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	606.24 ± 0.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	52.84 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	612.33 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	53.70 ± 0.33

build: 9c35706 (6060)

1 reply

wbruna Aug 5, 2025

@olegshulyakov , the 7600 XT actually has a 128 bit memory bus.

Said-Akbar · 2025-08-02T02:11:59Z

Said-Akbar
Aug 2, 2025

AMD MI60.

Happy to contribute.
I am on Ubuntu 24.04 and ROCm 6.3.4. GPU is connected at 8x PCIE4.0 speed. AMD 5950x CPU with 96GB RAM at 3200Mhz. Flash attention is disabled (FA=0).

model	size	params	backend	ngl	sm	test	t/s	build
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	none	pp512	1289.11 ± 0.62	`504af20` (4476)
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	none	tg128	91.46 ± 0.13	`504af20` (4476)

I will post FA=1 and vulkan results once I have time during the weekend.

0 replies

firefox42 · 2025-08-02T07:40:51Z

firefox42
Aug 2, 2025

MI100

Using ./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0
gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	2732.83 ± 1.98
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	110.48 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	2755.00 ± 3.68
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	104.71 ± 0.10

build: 9c35706 (6060)

I'm running Ubuntu 24.04.2 and ROCm 6.4.1

2 replies

olegshulyakov Aug 2, 2025
Author

I expected it to be faster than RX 7800 XT because of HBM2... Have you tried to launch with a single device only?

IMbackK Aug 18, 2025
Collaborator

Bandwidth utilization is still fairly low on the gcn/cdna parts (gcn/cdna = same thing for tg).
GCN/CDNA is quite difficult to get decent utilization on as they are very register starved and have very small caches.
mi100 also dosent really have 1.2TB/s bandwith, it is limited to a sustained 1024GB/s by its fabric bandwidth

yeahdongcn · 2025-08-02T14:10:55Z

yeahdongcn
Aug 2, 2025
Collaborator

AMD Instinct MI300X

root@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	11476.40 ± 72.79
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	218.87 ± 0.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	4037.07 ± 8.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	158.12 ± 0.21

build: 2bf3fbf (6069)

Ref: #14640

7 replies

rohan-sircar Aug 5, 2025

I'm just referring to the rocWMMA flag from the build instructions: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip

To enhance flash attention performance on RDNA3+ or CDNA architectures, you can utilize the rocWMMA library by enabling the -DGGML_HIP_ROCWMMA_FATTN=ON option. This requires rocWMMA headers to be installed on the build system.

It should work for CDNA too but we have only tested with our RDNA3 cards (7900 XTX) and saw huge performance jumps in PP with FA on: #10879 (reply in thread)

Please try it out because 1/3rd the performance in PP with FA on is just... strange at best

yeahdongcn Aug 5, 2025
Collaborator

With -DGGML_HIP_ROCWMMA_FATTN=ON:

root@0-4-9-gpu-mi300x1-192gb-devcloud-atl1:~/llama.cpp# ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Instinct MI300X VF, gfx942:sramecc+:xnack- (0x942), VMM: no, Wave Size: 64

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	11021.13 ± 210.87
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	232.92 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	11945.97 ± 54.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	218.53 ± 0.09

build: ee3a9fc (6090)

olegshulyakov Aug 5, 2025
Author

Performance is in the middle between RTX 4090 and RTX 5090.

rohan-sircar Aug 5, 2025

So rocWMMA does work for CDNA in FA :)

IMbackK Aug 7, 2025
Collaborator

not very well

samteezy · 2025-08-02T19:58:08Z

samteezy
Aug 2, 2025

Pro V620

Why does FA slow down the V620 so much? Been a question I've been trying to answer for a while now.

root@llama:/mnt/models# /root/llama-builds/llama.cpp/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
  Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon (TM) Pro WX 3200 Series (RADV POLARIS12) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon PRO V620 (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

model	size	params	backend	threads	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	0	pp512	1801.16 ± 3.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	0	tg128	74.48 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	1	pp512	1258.12 ± 0.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan,BLAS	10	none	1	tg128	70.74 ± 0.02

build: 03d4698 (6074)

Linux, ROCm 6.4.1 ( will try upgrading soon)

2 replies

olegshulyakov Aug 2, 2025
Author

@samteezy Can you please run per each device PRO V620/Pro WX 3200 and ROCm only backend?

samteezy Aug 2, 2025

@olegshulyakov The numbers come out the same. Forcing -sm none mg 0 ensures only the V620 is running. I don't benchmark the WX 3200.

root@llama:~# /root/llama-builds/llama.cpp/bin/llama-bench -m /mnt/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -mg 0 -sm none
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
Device 0: AMD Radeon PRO V620, gfx1030 (0x1030), VMM: no, Wave Size: 32
Device 1: AMD Radeon (TM) Pro WX 3200 Series, gfx803 (0x803), VMM: no, Wave Size: 64

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	1803.65 ± 2.54
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	74.66 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	1256.86 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	70.83 ± 0.02

build: 5c0eb5e (6075)

rohan-sircar · 2025-08-05T03:56:34Z

rohan-sircar
Aug 5, 2025

Powercolor Hellhound RX 7900 XTX (400W power limit)

Opensuse tumbleweed system with rocm packages from AMD ROCm repository installed

Information for package rocm-hip:
---------------------------------
Repository     : AMD ROCm (openSUSE_Factory)
Name           : rocm-hip
Version        : 6.4.1-6.5
Arch           : x86_64
Vendor         : obs://build.opensuse.org/science
Installed Size : 25.5 MiB
Installed      : Yes
Status         : up-to-date
Source package : rocclr-6.4.1-6.5.src
Upstream URL   : https://github.com/ROCm/clr
Summary        : ROCm HIP platform and device tool
Description    : 
    HIP is a C++ Runtime API and Kernel Language that allows developers to create
    portable applications for AMD and NVIDIA GPUs from the same source code.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3243.15 ± 10.32
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	125.84 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3557.68 ± 13.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	122.71 ± 0.11

build: 5c0eb5e (6075)

Sapphire Nitro 7900 XTX (400W power limit)

In a different PC unfortunately because these GPUs are too chonky to fit in a regular case
So no TP for now but it serves my use case of running an LLM on one and STT/TTS on the other card to get a fully local voice-to-voice chatbot (Just tried with Amica and it works great! Very entertaining!)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3369.65 ± 10.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	122.06 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3573.30 ± 14.31
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	118.71 ± 0.14

build: 9c35706 (6060)

12 replies

davispuh Sep 17, 2025

So what are recommend settings/what to do to get best performance on 7900 XTX ? I have SAPPHIRE NITRO+ AMD Radeon RX 7900 XTX Vapor-X 24GB and without changing anything I get WAY worse results.
Using Arch Linux with everything updated to latest (ROCm 6.4.3) and freshly compiled llama.cpp

$ llama-bench -ngl 99 -fa 0,1 -m ~/.cache/llama.cpp/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	pp512	2817.81 ± 26.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	0	tg128	112.18 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	pp512	3053.77 ± 17.38
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	99	1	tg128	110.70 ± 0.08

build: b9be58d (1005)

And this is what LACT showed while running it

IMbackK Sep 17, 2025
Collaborator

At these high speeds with fast gpus the cpu gets important for results, your results and his would be within the expected variance for differing cpu performance.

Not that it really matters mutch as the cpu gets almost irrelevant when a model of a size to fill the device is used.

Benchmarking llama 7B Q4_0 is not really that great, as it dosent reflect actual usage much, this hurts the most on cdna devices which scale better than you would expect performance wise when increasing the number parameters.

fassn Sep 18, 2025

Go for ROCm 7.0, it has officially been released now. And compile llama.cpp with ROCWMMA enabled, see #15021 (comment) . You should get much better results.

rohan-sircar Sep 18, 2025

ROCm 7 is released? that's great news! I'll try it out as soon at is lands in my distro's package manager.

That said I don't think it'll give any performance improvements for the 7900 XTX. The reason 9070 XT gets a boost with rocm7 is because pre rocm7 WMMA is not implemented for RDNA4. But we'll see.

BTW FYI in my benchmarks, the powercolor 7900XTX was paired with a ryzen 2700 with 64GB RAM, and the sapphire 7900XTX was paired with a 5700X3d also 64GB RAM. I did not see an appreciable performance difference between the two setups in LLM inference.

davispuh Sep 23, 2025

I see. To me that 3573 vs 3053 seems big difference. I'm running this on AMD Threadripper 1920X (12 core) with PCIe 3.0 x16. What benchmark I could use to compare more accurate real world inference performance?
For ROCm 7 it's not yet in Arch repos so I'll have to wait a bit. Also it wouldn't be accurate comparison with this older result.

Now when I booted with amdgpu.ppfeaturemask=0xffffffff kernel parameter I can increase TDP up to 402W but I don't see any performance impact at all - 305W limit vs 402W gives same benchmark so looks like it doesn't matter.

Diablo-D3 · 2025-08-06T11:33:29Z

Diablo-D3
Aug 6, 2025

Powercolor Red Devil 7900XTX

Adrenalin 25.8.1 just came out, so time to test again
Ryzen 9800x3D
Windows 11 24H2 26100.4652

llama-win-hip/llama-bench.exe -m ./models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf -ngl 100 -r 100 -fa 0,1

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	0	pp512	3434.01 ± 38.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	0	tg128	153.91 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	1	pp512	3633.86 ± 10.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,RPC	100	1	tg128	145.23 ± 0.10

build: 2572689 (6099)

Still lower than the historical highs on May 26th (3599 and 3743), and a loss and a win against July 22nd (3529 and 3598).

0 replies

totaldev · 2025-08-08T10:40:33Z

totaldev
Aug 8, 2025

RX 7900 XTX (ASUS TUF)
Ubuntu 24.04.2
Rocm 6.4.2

./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3386.75 ± 5.33
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	128.25 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3674.25 ± 11.35
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	124.61 ± 0.06

build: 6c7e9a5 (6118)

0 replies

MrLavender · 2025-08-11T15:38:14Z

MrLavender
Aug 11, 2025

RX 6800 (16GB 203W)

ROCm 6.3.4 on Ubuntu 24.04 in a Docker container

llama-bench --prio 1 -m /llama-cpp/models/local/llama-2-7b-Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp512	1447.07 ± 1.36
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	tg128	83.92 ± 0.03

build: 79c1160 (6123)

Bonus benchmarks

I ran these to compare ROCm versions on various models. Obviously the results are specific to my RX 6800 and shouldn't be used to make any judgments about ROCm performance in general, especially on RDNA3 and later gpus. I use 6.3.4 because I don't care about LLama 3 8B.

Note how fast the new MoE models are - gpt-oss-20B even at Q6_K_XL is faster than this 7B Q4_0 model. (Do make sure that you have a fixed version because the original gpt-oss releases had some issues - I used https://huggingface.co/unsloth/gpt-oss-20b-GGUF).

ROCm 6.3.4

~4% performance regression in Llama 3 8B prompt processing. This is noted as a known issue

Lower than expected performance may be observed while running Llama 3 8B inference workloads with Llama.cpp

ROCm 6.4.3

The Llama 3 8B issue still exists
~8% performance regression in qwen2 14B Q6_K and qwen3 14B Q6_K prompt processing

1 reply

lzivadinovic Jun 4, 2026

RX 6800 (16GB 203W)

ROCm 7.2.4 arch 7.0.10-arch1-1

llama-b9505/llama-bench -m models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 16368 MiB
load_backend: loaded ROCm backend from #############//llama-b9505/libggml-hip.so
load_backend: loaded RPC backend from #############//llama-b9505/libggml-rpc.so
load_backend: loaded CPU backend from #############//llama-b9505/libggml-cpu-haswell.so
| model                          |       size |     params | backend    | ngl |  fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |   0 |           pp512 |      1382.74 ± 21.98 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |   0 |           tg128 |         79.78 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |   1 |           pp512 |      1675.31 ± 13.21 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |   1 |           tg128 |         86.91 ± 0.25 |

totaldev · 2025-08-13T10:42:07Z

totaldev
Aug 13, 2025

RX 7900 XTX (ASUS TUF a bit overclocked for 100 mhz for core and VRAM)
Ubuntu 24.04.2
Rocm 7.0-rc1

./build/bin/llama-bench -m /home/vk/Downloads/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3473.24 ± 12.30
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	132.17 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3698.73 ± 17.60
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	127.43 ± 0.04

build: 648ebcd (6146)

0 replies

tdjb · 2025-08-15T06:39:58Z

tdjb
Aug 15, 2025

RX 6900 XT AMD Reference Card (Stock clocks)
Ryzen 7 5800X3D with 32GB 3600MHz C18 ram

Debian Testing
Using Docker image rocm/rocm-terminal with additions.

llama.cpp version: gguf-v0.17.1-386-gfd1234cb

./src/llama.cpp/build/bin/llama-bench -m models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -sm none -mg 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32 Device 1: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	1824.47 ± 1.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	83.02 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	1250.68 ± 0.73
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	80.45 ± 0.02

4 replies

olegshulyakov Aug 15, 2025
Author

@tdjb Results are pretty low, can you re-test using llama.cpp standalone without Docker?

tdjb Aug 15, 2025

Just a quick test, installed the llama.cpp build from Debian sid (was surprised to even find a build to be honest), which appears to be b5882 and the results came in quite similar. I tried the benchmark on both of my devices, as one is on a slower PCIe 4x slot, the results below are from the faster run.

Why do you think the 6900 XT should perform better?
Seeing the 6800 XT results above being a little slower made mine seem reasonable.
While reading the post again, I saw those were also being run using Docker.

model	size	params	backend	ngl	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	pp512	1835.54 ± 2.20
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	0	tg128	74.90 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	pp512	1314.84 ± 0.77
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	none	1	tg128	68.28 ± 0.07

Happy to run further tests.

olegshulyakov Aug 15, 2025
Author

It should be 10% better on my understanding, according to specs: RX 6900 XT and RX 6800 XT

IMbackK Aug 18, 2025
Collaborator

the rdna2 results are mostly surprisingly high, given the hardware capabilities, not low.

TheyreEatingTheGeese · 2025-08-16T06:54:23Z

TheyreEatingTheGeese
Aug 16, 2025

GigaByte R9700
build: e2c1bff (6177) | llama.cpp vulkan and rocm docker containers

note both vulkan and rocm results below
vulkan benchmarks showed WARNING: radv is not a conformant Vulkan implementation, testing use only.

llama-cli --bench --model /models/Qwen3-32B-Q4_K_M.gguf -ngl 100 -fa 0 -p 512,1024,2048,4096,8192,16384,30720 -n 128,256,512,1024

Vulkan 32K prompt ran out of memory so changed it to 30K
ROCM, 16K+ prompt also had errors (though not out of memory)

model	size	params	backend	ngl	test	t/s
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp512	196.90 ± 0.43
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp1024	193.73 ± 0.22
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp2048	191.62 ± 0.36
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp4096	184.77 ± 0.14
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp8192	171.50 ± 0.08
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp16384	149.20 ± 0.11
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp30720	118.38 ± 1.08
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	pp512	203.35 ± 0.47
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg128	28.20 ± 0.03
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg256	28.14 ± 0.01
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg512	27.96 ± 0.01
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	Vulkan	100	tg1024	27.67 ± 0.01
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp512	498.66 ± 0.59
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp1024	473.24 ± 0.84
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp2048	435.33 ± 0.62
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp4096	380.48 ± 0.39
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp8192	304.56 ± 0.15
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	pp512	501.91 ± 0.66
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg128	24.03 ± 0.04
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg256	24.06 ± 0.02
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg512	23.67 ± 0.02
qwen3 32B Q4_K - Medium	18.40 GiB	32.76 B	ROCm	100	tg1024	22.88 ± 0.01

llama-cli --bench --model /models/llama-2-7b.Q4\_0.gguf -ngl 100 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024

ROCM, 32K prompt had errors

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp512	1943.56 ± 6.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp1024	1879.03 ± 6.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp2048	1758.15 ± 2.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp4096	1507.73 ± 2.83
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp8192	1078.38 ± 0.53
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp16384	832.26 ± 0.67
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	pp32768	466.09 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg128	124.13 ± 0.95
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg256	123.30 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg512	119.96 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	0	tg1024	114.71 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	1863.64 ± 6.66
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp1024	1780.54 ± 7.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp2048	1640.52 ± 3.72
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp4096	1417.17 ± 4.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp8192	1119.76 ± 0.41
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp16384	786.26 ± 0.83
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp32768	490.12 ± 0.47
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	124.65 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg256	124.72 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg512	122.66 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg1024	119.27 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp512	2746.39 ± 57.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp1024	2672.60 ± 7.19
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp2048	2475.62 ± 9.50
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp4096	2059.84 ± 0.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp8192	1333.60 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp16384	1014.06 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp24576	769.31 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg128	92.29 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg256	92.34 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg512	90.28 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg1024	86.91 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	1300.26 ± 3.04
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp1024	1009.69 ± 1.54
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp2048	695.68 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp4096	428.36 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp8192	242.06 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp16384	129.46 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp24576	88.34 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	93.28 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg256	93.22 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg512	91.31 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg1024	88.87 ± 0.35

3 replies

davispuh Aug 16, 2025

So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.

TheyreEatingTheGeese Aug 16, 2025

We'll see when someone a bit more experienced gives it a shot. My benchmarks are about as vanilla as it gets. Threw it in an unraid server (12700k and 128GB DDR4-2133), made docker images and ran benchmarks. Many of the 7900 XTX results are baremetal, have factory overclock or are manually overclocked, installed additional drivers, and/or have raised power limits. I bet someone will beat my benchmarks shortly.

IMbackK Aug 18, 2025
Collaborator

So 7900 XTX has better performance, that's sad. Also weird that Vulkan performs way worse on pp than ROCm but better on tg.

there is zero reason to expect 9070(xt) to perform better than the xtx

prototypicall · 2025-08-17T04:55:27Z

prototypicall
Aug 17, 2025

Radeon RX 9070 (non-XT)

ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070, gfx1201 (0x1201), VMM: no, Wave Size: 32

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp512	2361.10 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg128	99.39 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	1147.66 ± 1.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	97.10 ± 0.33

build: 65349f2 (6183)

I tried to enable the use of rocwmma with DGGML_HIP_ROCWMMA_FATTN=ON but I don't think it worked. cmake complained that it couldn't find the header so provided the include path but didn't check if the compiler was able to use that.

Still surprising that these numbers are better than the 9070 XT.

0 replies

krampenschiesser · 2026-03-01T20:23:57Z

krampenschiesser
Mar 1, 2026

Phew went through some issues here. ASRock Radeon AI PRO R9700.

Rocm 7.2

Stay away from rocm7.2:

  Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  0 |           pp512 |      1520.24 ± 31.48 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  0 |           tg128 |         92.28 ± 0.09 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  1 |           pp512 |       1601.25 ± 5.70 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  1 |           tg128 |         99.96 ± 0.06 |

build: 319146247 (8184)

Rocm 7.1.1

  Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  0 |           pp512 |      4932.17 ± 67.62 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  0 |           tg128 |         95.04 ± 0.33 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  1 |           pp512 |      5025.17 ± 19.49 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  99 |  1 |           tg128 |         98.71 ± 0.10 |

build: 319146247 (8184)

7 replies

krampenschiesser Mar 3, 2026

that worked! getting similar performance with rocm 7.2 now

pedapudi Mar 15, 2026

I'm seeing that the --amdgpu-unroll-threshold-local=600 flag is being ignored at compile time when llama.cpp is building the hip.so artifacts. Is this a more recent change and is the regression back with rocm7.2?

aviallon Mar 18, 2026

@pedapudi You have to set it in -DCMAKE_HIP_FLAGS. -DCMAKE_CXX_FLAGS won't do.

IMbackK Mar 18, 2026
Collaborator

*on linux, on windows you must set DCMAKE_CXX_FLAGS instead

pedapudi Mar 18, 2026

Yeah, I set it using -DCMAKE_HIP_FLAGS on Ubuntu. -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm --mllvm -amdgpu-unroll-threshold-local=600"

cody-vibe · 2026-03-05T13:31:18Z

cody-vibe
Mar 5, 2026

Running on Manjaro Linux, using the ROCm docker image:

./llama-bench -m /models/TheBloke_Llama-2-7B-GGUF_llama-2-7b.Q4_0.gguf -ngl 100 -fa 0,1
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 890M, gfx1150 (0x1150), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from /app/libggml-hip.so
load_backend: loaded CPU backend from /app/libggml-cpu-zen4.so

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	pp512	121.84 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	0	tg128	18.24 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	110.39 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	19.08 ± 0.01

build: 1a29907 (8202)

0 replies

MihaiBojescu · 2026-03-18T11:18:12Z

MihaiBojescu
Mar 18, 2026

AMD Radeon RX 7900 XTX (Sapphire Pulse 7900 XTX)

System details

CPU: AMD Ryzen 7 7700x
RAM: 64GB DDR5
OS: Arch Linux x86-64
ROCm: 7.2.0-2

ROCm results

$ HIP_VISIBLE_DEVICES=0 llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3623.53 ± 22.80
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	139.89 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3966.95 ± 19.81
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	145.32 ± 0.07

build: 8ff0207 (8400)

Vulkan results

$ GGML_VK_VISIBLE_DEVICES=0 llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	3164.42 ± 29.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	161.53 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	3374.48 ± 12.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	170.44 ± 0.07

build: bcdc4eb (8400)

0 replies

daMustermann · 2026-03-20T00:53:22Z

daMustermann
Mar 20, 2026

No ROCMinfo on Windows, but:

AMD Radeon RX 7900 XTX reference card, watercooled:

ROCM 7.2:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3552.47 ± 39.58
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	144.84 ± 0.36
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3458.31 ± 33.75
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	150.40 ± 0.87

build: 1e64534 (8429)

Vulkan:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	pp512	6695.94 ± 211.86
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	tg128	308.88 ± 0.73
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	pp512	6508.13 ± 118.87
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	tg128	310.91 ± 1.26

build: c9ced49 (7549)

Newer Vulkan version gives me the same speed as ROCM

8 replies

rohan-sircar Mar 20, 2026

what vulkan veresion?

daMustermann Mar 20, 2026

I somehow used Vulkan and ROCM together, guess that's a huge speed up. Yeah, that is insane.

rohan-sircar Mar 20, 2026

:O m8 write a guide on it or something

daMustermann Mar 20, 2026

That is so funny. I know how to do it. I guess I will write a guide tomorrow.

PS C:\Users<username>\Downloads\llama-b8429-bin-win-hip-radeon-x64> llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
HIP Library Path: C:\Windows\SYSTEM32\amdhip64_7.dll
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
load_backend: loaded ROCm backend from C:\Users<username>\Downloads\llama-b8429-bin-win-hip-radeon-x64\ggml-hip.dll
load_backend: loaded RPC backend from C:\Users<username>\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users<username>\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users<username>\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	pp512	3245.11 ± 107.60
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	tg128	134.46 ± 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	pp512	3321.11 ± 85.61
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	tg128	139.81 ± 0.40

build: 07ba6d2 (8417)

PS C:\Users<username>\Downloads\llama-b8429-bin-win-hip-radeon-x64> llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
HIP Library Path: C:\Windows\SYSTEM32\amdhip64_7.dll
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 24560 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
load_backend: loaded ROCm backend from C:\Users<username>\Downloads\llama-b8429-bin-win-hip-radeon-x64\ggml-hip.dll
load_backend: loaded RPC backend from C:\Users<username>\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\Users<username>\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users<username>\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	pp512	6695.40 ± 228.49
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	tg128	298.50 ± 6.80
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	pp512	6287.12 ± 324.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	tg128	278.22 ± 13.25

build: c9ced49 (7549)
PS C:\Users<username>\Downloads\llama-b8429-bin-win-hip-radeon-x64>

I had an older Vulkan version installed via winget (build: c9ced49 (7549) - so this is on PATH) and when i use llama-bench in the newest llama HIP version folder it combines them and doubles everything in terms of speed, but that should not work. But it doesl With a newer Vulkan (build: 07ba6d2 (8417))version it doesn't work. I guess it loads the model 2 times and blasts both?

aviallon Mar 22, 2026

@daMustermann what kind of performance do you get for Qwen3.5-35b-a3b on this card? I have an extremely bad 250 t/s for pp! And I use ROCm 7.1.1.
This happens with both my own builds and the ggml-org/llama.cpp:server-rocm build!

jschoch · 2026-03-22T17:04:00Z

jschoch
Mar 22, 2026

Vulkan Instance Version: 1.4.341

Linux fedora 6.19.6-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC

./llama-bench -m  /home/schoch/dev/llama.cpp/bench/llama-2-7b.Q4_0.gguf -fa 1 -ngl 99 -p 512,1024,2048,4096,8192,16384 -n 128,256,512,1024
load_backend: loaded RPC backend from /home/schoch/dev/build_llama.cpp/vulcan/llama-b8472/libggml-rpc.so
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon AI PRO R9700 (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Ryzen 9 9950X 16-Core Processor (RADV RAPHAEL_MENDOCINO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/schoch/dev/build_llama.cpp/vulcan/llama-b8472/libggml-vulkan.so
load_backend: loaded CPU backend from /home/schoch/dev/build_llama.cpp/vulcan/llama-b8472/libggml-cpu-zen4.so
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |     4164.89 ± 741.52 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp1024 |       5194.62 ± 6.23 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |       5027.57 ± 3.13 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp4096 |       4692.49 ± 6.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp8192 |      4109.16 ± 10.75 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |         pp16384 |       3310.65 ± 3.68 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        128.26 ± 0.45 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg256 |        127.51 ± 0.07 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg512 |        125.44 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          tg1024 |        122.26 ± 0.09 |

2 replies

IMbackK Mar 22, 2026
Collaborator

This thread is for HIP results not vulkan.

piotrp88 Mar 27, 2026

@jschoch HINT: try disabling CPU and you will probably see a decrease in pp and an increase in tg

Hadrianneue · 2026-03-25T16:21:06Z

Hadrianneue
Mar 25, 2026

./llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 16304 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	5167.28 ± 106.65
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	116.50 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	4753.07 ± 57.42
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	106.70 ± 0.08

build: 406f4e3 (8514)

Results above are from using the pre-built binaries, using aur's llama.cpp-hip script gives higher tg128 but slower pp512

llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 16304 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	4824.33 ± 27.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	119.97 ± 0.50
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	4661.14 ± 53.96
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	109.81 ± 0.30

build: unknown (8514)

0 replies

bloodsweatncode · 2026-04-01T20:01:02Z

bloodsweatncode
Apr 1, 2026

Adrenalin 26.3.1 WSL Debian 13.3 ROCDXG

./llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16200 MiB):
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 16200 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	4766.48 ± 51.84
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	108.45 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1728.29 ± 26.50
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	32.78 ± 1.77

build: d43375f (8611)

0 replies

pakar · 2026-04-03T10:47:33Z

pakar
Apr 3, 2026

HP g1a laptop with APU capped at 70W.

AMD RYZEN AI MAX+ PRO 395 with 128GB ram using TheRock 7.12
llama.cpp version b8639
kernel 6.19.10-arch1-1

$ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 100864 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 100864 MiB

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp512	1106.09 ± 17.99
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp1024	943.35 ± 31.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp2048	825.25 ± 0.63
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp4096	678.23 ± 1.59
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp8192	457.73 ± 0.57
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp16384	285.17 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	pp32768	149.55 ± 0.68
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	tg128	44.15 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	tg256	44.06 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	tg512	40.62 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	tg1024	34.24 ± 0.01

$ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,1024,2048,4096 -n 128,256
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 100864 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 100864 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1223.09 ± 32.28
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp1024	1222.64 ± 1.41
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp2048	1155.04 ± 0.86
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp4096	1027.97 ± 1.46
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	49.50 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg256	49.49 ± 0.05

$ llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 8192 -n 1024
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 100864 MiB):
Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 100864 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp8192	843.04 ± 2.52
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg1024	46.15 ± 0.06

0 replies

Hadrianneue · 2026-04-20T01:31:06Z

Hadrianneue
Apr 20, 2026

llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 16304 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	5166.48 ± 100.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	116.46 ± 0.50
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	4732.96 ± 59.88
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	106.63 ± 0.30

build: e365e65 (8851)

1 reply

Hadrianneue May 9, 2026

llama-bench -m ~/models/llama-2-7b.Q4_0.gguf -fa 1,0

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	5135.83 ± 24.28
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	120.86 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	4721.81 ± 66.83
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	110.69 ± 0.30

build: 5757c4d (9090)

gogich77 · 2026-05-09T18:17:28Z

gogich77
May 9, 2026

ROCm 7.2.3
AMD Radeon Graphics, gfx1201 (0x1201) -> AI PRO R9700 (ASRock)-> PCI 5x16

/workspace/llama.cpp/build/bin/llama-bench -m /data2/llm/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512 -n 128 --device ROCm0
ggml_cuda_init: found 2 ROCm devices (Total VRAM: 65248 MiB):
Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
Device 1: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	ROCm0	pp512	4174.85 ± 316.24
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	ROCm0	tg128	112.57 ± 0.27
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	ROCm0	pp512	4945.17 ± 17.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	ROCm0	tg128	121.69 ± 0.11

build: 046e284 (9085)

2 card on PCI 4
################
AMD Radeon Graphics, gfx1201 (0x1201) -> AI PRO R9700 (ASRock)-> PCI 4x4
/workspace/llama.cpp/build/bin/llama-bench -m /data2/llm/models/llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512 -n 128 --device ROCm1
ggml_cuda_init: found 2 ROCm devices (Total VRAM: 65248 MiB):
Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
Device 1: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	ROCm1	pp512	4309.94 ± 2.70
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	ROCm1	tg128	102.80 ± 0.76
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	ROCm1	pp512	4989.28 ± 46.93
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	ROCm1	tg128	117.18 ± 1.07

build: 046e284 (9085)

1 reply

JonRickSanchez Jun 4, 2026

Is the "AMD Radeon Graphics, gfx1201 (0x1201) -> AI PRO R9700 (ASRock)-> PCI 4x4" by any chance on chipset PCIe or via M.2 ? I found Chipset PCIe VERY unstable and rocm-bandwidth-test -A (or -a) very inconsistent and wildly wayyyy below it's theoretical speed. I've ended up using bifurcation on my B650M and running PCIe 4.0 x8/x8 (wigh iGPU disabled as it was also messing the bandwidth somehow).

wadeflaw · 2026-05-09T22:47:11Z

wadeflaw
May 9, 2026

rocm Version : 7.13.0a20260508-1
os: cachy
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 8176 MiB):
Device 0: AMD Radeon RX 7600S, gfx1102 (0x1102), VMM: no, Wave Size: 32, VRAM: 8176 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	783.23 ± 5.71
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	45.02 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	915.55 ± 4.70
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	49.62 ± 0.05

0 replies

imrehg · 2026-05-13T09:11:58Z

imrehg
May 13, 2026

Radeon 890M

Framework 13" with Ryzen AI 9 HX 370 / Radeon 890M GPU
DDR5-5600 - 96GB
ArchLinux with llama.cpp-hip b9127-1

Running the more detailed test of:

llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1 -p 512,1024,2048,4096,8192,16384,32768 -n 128,256,512,1024,2048

Getting results of:

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 76800 MiB):
  Device 0: AMD Radeon 890M Graphics, gfx1150 (0x1150), VMM: no, Wave Size: 32, VRAM: 76800 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	417.37 ± 2.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp1024	387.80 ± 3.20
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp2048	322.97 ± 4.57
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp4096	267.51 ± 5.69
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp8192	201.26 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp16384	131.99 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp32768	65.26 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	16.12 ± 0.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg256	15.88 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg512	15.14 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg1024	14.15 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg2048	12.64 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	421.30 ± 6.79
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp1024	406.40 ± 0.99
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp2048	386.99 ± 0.62
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp4096	342.35 ± 0.22
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp8192	265.00 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp16384	170.29 ± 0.09
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp32768	96.47 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	16.73 ± 0.11
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg256	17.05 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg512	16.50 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg1024	15.92 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg2048	14.97 ± 0.00

build: 338ea1e (9127)

0 replies

RapidMark · 2026-06-01T15:24:32Z

RapidMark
Jun 1, 2026

A data point from the image-diffusion side, since the rocWMMA flash-attn numbers here are (understandably) all LLM token-generation. Diffusion attention is a different shape — non-GQA (ncols2 == 1), a single fixed ~4096-token context, no KV-cache growth — so I figured the routing/payoff was worth a number.

Setup: AMD Radeon AI PRO R9700 (Navi 48 / RDNA4 / gfx1201, 32 GB), rocm/dev-ubuntu-24.04:7.0.2-complete (rocWMMA 2.0). Workload via stable-diffusion.cpp (vendors ggml): FLUX.1 Q4_K_S, 1024×1024, 20 steps, flash-attn on. Two builds, identical except -DGGML_HIP_ROCWMMA_FATTN:

Build	FA kernel dispatched	Attention	Total wall
`ROCWMMA_FATTN=OFF`	`flash_attn_tile` (no matrix cores)	7.76 s	98.0 s
`ROCWMMA_FATTN=ON`	`flash_attn_ext_f16` (rocWMMA)	7.27 s	95.5 s

So ~6% on the attention kernel, ~2.5% on total wall, output verified correct at the same seed.

Two things worth noting for other AMD/diffusion users:

Without the flag, non-GQA attention falls to the generic flash_attn_tile — the mma path is gated off for non-GQA (fattn.cu gqa_opt_applies, and fattn-mma-f16.cuh bails on ncols2 == 1), so on AMD the rocWMMA fattn-wmma-f16 path is the only way non-GQA attention reaches the matrix cores.
On RDNA4 that path needs rocWMMA ≥ 2.0 (ROCm 7) — rocWMMA 1.7 (ROCm 6.4.2) won't compile the RDNA4 WMMA fattn instantiation. Might be worth a one-line note next to the existing RDNA3/CDNA guidance in docs/build.md, since the RDNA4-needs-ROCm-7 requirement isn't obvious.

The small gain is consistent with this workload being memory-bandwidth-bound at 1024² (O(n²) over 4096 tokens), not matmul-FLOP-bound — matrix cores help, just not a lot here. The same card sees a bigger relative benefit at smaller, more compute-bound configs. Happy to run more diffusion points (resolutions/steps/other RDNA cards) if it's useful to the rocWMMA tuning work in #16827 — it's a fixed-context, non-GQA stress case the LLM benchmarks don't cover.

1 reply

IMbackK Jun 2, 2026
Collaborator

it should not displatch the tile kernel for the non rocwmma case for most shapes, it should dispatch the mma kernel:

llama.cpp/ggml/src/ggml-cuda/fattn.cu

Line 519 in 354ebac

    
           if ((amd_wmma_available(cc) && gqa_opt_applies && Q->ne[0] <= 128) && Q->ne[0] != 40 && Q->ne[0] != 72 && Q->ne[1] * gqa_ratio_eff > 8) {

for CDNA this is the fastest kernel and for RDNA4 this kernel should also be fastest, at least for the shapes used in llms.

RapidMark · 2026-06-01T17:32:26Z

RapidMark
Jun 1, 2026

Quick follow-up to my earlier diffusion numbers — the rocWMMA flash-attn win on RDNA4 turns out to be strongly config-dependent, which lines up nicely with the LLM results others have posted in this thread.

Same setup as before (R9700 / gfx1201 / ROCm 7.0.2 / rocWMMA 2.0, FLUX.1 Krea Q4, -DGGML_HIP_ROCWMMA_FATTN=ON), sampling time only:

config	FA off	FA on	speedup
512² / 4 steps (compute-bound)	6.76 s	2.82 s	~2.4×
1024² / 20 steps (memory-bound)	—	—	~6%

So at small / compute-bound configs the matrix-core path is a big win (~2.4×), but as resolution grows the attention becomes memory-bandwidth-bound (O(n²) K/V streaming) and the gain shrinks to a few percent. Same shape as the LLM side here — large on long-context prefill, modest on short prompts / decode.

Net for image-diffusion users on RDNA4: definitely worth building with rocWMMA fattn on (ROCm 7 / rocWMMA 2.0) — just be aware the upside is very workload-dependent. For reference, with FA on the R9700 lands right next to an A6000 (CUDA) at the same diffusion recipe.

2 replies

IMbackK Jun 2, 2026
Collaborator

you should probably not be useing rocwmma fattn at all on gfx12, its slower than the default fattn implementation on this arch at least for llms

patrickzel Jun 5, 2026

I get better results with rocwmma on my R9700 and Gemma 4 31B for prompt processing.

JonRickSanchez · 2026-06-04T22:09:59Z

JonRickSanchez
Jun 4, 2026

OS: CachyOS
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16304 MiB):
Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 16304 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	4733.95 ± 129.90
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	116.95 ± 0.85
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	5541.12 ± 60.59
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	128.38 ± 0.28

build: ba5b911 (9518)

Just for gigs, Vulkan, it is quite a bit faster.
GGML_VK_VISIBLE_DEVICES=0 llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 9070 XT (RADV GFX1201) (radv) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	5673.65 ± 64.11
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	136.64 ± 0.61
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	6004.06 ± 24.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	142.88 ± 0.26

3 replies

sswtodo Jun 5, 2026

9070 xt > 7900 xtx ? I'm so surprised... by 162% more PP ;-)

JonRickSanchez Jun 5, 2026

Funnily enough these are the results for my 7900 XT on the same PCIe 4.0 x8 slot, it has quite a bit improved TG from what I see. Vulkan is more than 10% faster than ROCm in TG and just a smidget slower in PP.

GGML_VK_VISIBLE_DEVICES=1 llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XT (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	3052.66 ± 17.27
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	142.47 ± 0.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	3208.19 ± 13.59
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	149.22 ± 0.37

And for ROCm results:

CUDA_VISIBLE_DEVICES=0 llama-bench -m llama-2-7b.Q4_0.gguf -ngl 99 -fa 0,1
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 20464 MiB):
Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 20464 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	pp512	3157.90 ± 56.43
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	0	tg128	126.05 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	3389.57 ± 32.50
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	130.43 ± 0.27

JonRickSanchez Jun 5, 2026

9070 xt > 7900 xtx ? I'm so surprised... by 162% more PP ;-)

In all honesty I prefer TG any day than PP :) I am more surprised how well Vulcan backend performs, AMD should be ashamed.

Performance of llama.cpp on AMD ROCm (HIP) #15021

Uh oh!

Uh oh!

Instructions

ROCm Scoreboard for Llama 2 7B, Q4_0 (no FA)

ROCm Scoreboard for Llama 2 7B, Q4_0 (with FA)

More detailed test

Replies: 62 comments · 114 replies

Uh oh!

Uh oh!

olegshulyakov Aug 1, 2025 Author

RX 7800 XT (Sapphire Pulse 280W)

Uh oh!

olegshulyakov Aug 8, 2025 Author

Uh oh!

Uh oh!

RX 7600 XT

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MI100

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

IMbackK Aug 18, 2025 Collaborator

Uh oh!

Uh oh!

yeahdongcn Aug 2, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

yeahdongcn Aug 5, 2025 Collaborator

Uh oh!

Uh oh!

olegshulyakov Aug 5, 2025 Author

Uh oh!

Uh oh!

IMbackK Aug 7, 2025 Collaborator

Uh oh!

Uh oh!

Pro V620

Uh oh!

olegshulyakov Aug 2, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Powercolor Hellhound RX 7900 XTX (400W power limit)

Sapphire Nitro 7900 XTX (400W power limit)

Uh oh!

Uh oh!

IMbackK Sep 17, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Replies: 62 comments 114 replies

olegshulyakov
Aug 1, 2025
Author

olegshulyakov Aug 8, 2025
Author

olegshulyakov Aug 2, 2025
Author

IMbackK Aug 18, 2025
Collaborator

yeahdongcn
Aug 2, 2025
Collaborator

yeahdongcn Aug 5, 2025
Collaborator

olegshulyakov Aug 5, 2025
Author

IMbackK Aug 7, 2025
Collaborator

olegshulyakov Aug 2, 2025
Author

IMbackK Sep 17, 2025
Collaborator