Skip to content

Eval bug: Very slow inference of Q1_0 Bonsai model #21574

@nbkgroup

Description

@nbkgroup

Name and Version

version: 8685 (0988acc)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA, CPU, Vulkan

Hardware

Ryzen 7 5700X + GeForce RTX 5060 Ti Driver Version: 580.126.20 CUDA Version: 13.0

Models

Bonsai-8B.gguf

Problem description & steps to reproduce

After adding support for Bonsai models and Q1_0 quantization, the Bonsai-8B model inference speed on all the new versions of llama.cpp is about 0.3 - 0.5 t/s on CPU/Vulkan/NVidia.

Tested on various hardware, the speed is the same. It appears no hardware acceleration used.

The inference speed of the original fork is about 130-140 t/s on the reported hardware.

First Bad Commit

No response

Relevant log output

Logs
llama-cli -m Bonsai-8B.gguf -ngl 99 -p "Tell me your name"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15825 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15825 MiB

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8685-0988accf8
model      : Bonsai-8B.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> Tell me your name

My name is Bonsai. I am an AI assistant developed by PrismML. How can I help you today?

[ Prompt: 0,3 t/s | Generation: 0,3 t/s ]

> 

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions