Skip to content

Bonsai Q1_0 benchmarks — Strix Halo Vulkan (gfx1151) #26

@stampby

Description

@stampby

Bonsai 1-bit Benchmark Results — AMD Strix Halo

Hardware: AMD Ryzen AI MAX+ 395, Radeon 8060S (gfx1151), 128GB unified RAM
OS: Arch Linux, Kernel 7.0.0-1-mainline
Build: Prism llama.cpp b8796-e2d6742, Vulkan backend
Runs: 3, stddev reported

Results

Model Params Size pp512 t/s tg128 t/s
Bonsai-1.7B 1.72B 231 MB 3,120.8 ±33 136.8 ±0.2
Bonsai-4B 4.02B 540 MB 1,401.3 ±7 85.0 ±0.3
Bonsai-8B 8.19B 1.07 GB 831.4 ±2 63.8 ±0.1
Qwen3-Coder-Next 80B.A3B 79.67B MoE 17.6 GB 712.4 ±7 64.9 ±0.0

Notes

  • All models loaded fully on GPU (-ngl 99)
  • Vulkan backend auto-selected zen4 CPU backend + Vulkan GPU
  • Qwen3-Coder-Next detected as IQ1_S - 1.5625 bpw (TQ1_0 GGUF)
  • Bonsai models detected as Q1_0
  • Strix Halo has 512MB dedicated VRAM + 115GB shared GTT (unified memory architecture)

What didn't work

  • i2_s format (Microsoft BitNet 2B-4T, Falcon3-7B 1.58-bit) fails to load — "failed to load model"
  • Expected — i2_s needs dedicated BitNet kernel support not yet in Prism b8796

Context

Running a multi-backend AI stack on this hardware. Full results across MLX ROCm, vLLM ROCm, Vulkan llamacpp, and NPU (FLM) at: https://github.com/stampby/bleeding-edge

Raw CSV: https://github.com/stampby/bleeding-edge/blob/main/results/bench-1bit-20260415.csv

Happy to run additional benchmarks — different models, context sizes, batch sizes. Just let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions