Name and Version
version: 8685 (0988acc)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA, CPU, Vulkan
Hardware
Ryzen 7 5700X + GeForce RTX 5060 Ti Driver Version: 580.126.20 CUDA Version: 13.0
Models
Bonsai-8B.gguf
Problem description & steps to reproduce
After adding support for Bonsai models and Q1_0 quantization, the Bonsai-8B model inference speed on all the new versions of llama.cpp is about 0.3 - 0.5 t/s on CPU/Vulkan/NVidia.
Tested on various hardware, the speed is the same. It appears no hardware acceleration used.
The inference speed of the original fork is about 130-140 t/s on the reported hardware.
First Bad Commit
No response
Relevant log output
Logs
llama-cli -m Bonsai-8B.gguf -ngl 99 -p "Tell me your name"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15825 MiB):
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15825 MiB
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8685-0988accf8
model : Bonsai-8B.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
> Tell me your name
My name is Bonsai. I am an AI assistant developed by PrismML. How can I help you today?
[ Prompt: 0,3 t/s | Generation: 0,3 t/s ]
>
Name and Version
version: 8685 (0988acc)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA, CPU, Vulkan
Hardware
Ryzen 7 5700X + GeForce RTX 5060 Ti Driver Version: 580.126.20 CUDA Version: 13.0
Models
Bonsai-8B.gguf
Problem description & steps to reproduce
After adding support for Bonsai models and Q1_0 quantization, the Bonsai-8B model inference speed on all the new versions of llama.cpp is about 0.3 - 0.5 t/s on CPU/Vulkan/NVidia.
Tested on various hardware, the speed is the same. It appears no hardware acceleration used.
The inference speed of the original fork is about 130-140 t/s on the reported hardware.
First Bad Commit
No response
Relevant log output
Logs