Supporting Ternary Bonsai in llama.cpp; group-128 ternary format discussion #22019

khosravipasha · 2026-04-16T21:54:28Z

khosravipasha
Apr 16, 2026

Follow-up to the 1-bit Q1_0 work we did earlier for family of Bonsai models. We just released Ternary Bonsai (1.7B / 4B / 8B) and would like to discuss to see what's the right path for llama.cpp support.

Collection: Ternary-Bonsai
White paper: Ternary-Bonsai-8B-Whitepaper

Live demo in your browser: webml/WebGPU

We get much better quality with a bit more overhead:

The format

Weights take values from {-1, 0, +1}, with one shared FP16 scale per group of 128 weights:

w_i = scale_g * t_i,    t_i ∈ {-1, 0, +1}

Similar formats

Q2_0: Was thinking of doing Q2_0 with group size 128, similar to what we did for Q1_0, but wanted to check here if there are better solutions before going that route. I only did a quick POC on NEON CPU and works great (need more through testing and tuning though).
TQ1_0 / TQ2_0: seems to be the closets match to our format, but group size was 256, which won't be native for our models have to fit two groups into 1 and might lose too much quality. This was already in the pipeline and did not realize will need group 256. And those seems to be targeting for CPU usecase (and might be hard to support efficiently for Metal/CUDA).
Q2_K: This one seems related to but need to see if can pack our model exactly in this format since the scales are also quantized in the superblocks. Might be able to figure out exactly foruma to pack into this format. But also bpw would be higher than Q2_0 or ternary. How's the speeds for this one?
Difference vs BitNet: Based on my understanding, BitNet format uses one scale per row, while we use one scale per group of 128, so won't be native support.

Questions

We know adding a new type increases the maintenance costs, so wanted to check here first to see what's the preferred route. We will be releasing more and more models in both formats (or even mixed) going forward so having native support would be nice to have.

Right now leaning towards Q2_0 with group size 128, but wanted to check here first to see if there are better solutions we've overlooked. Another advantage of Q2_0 is that in case we do 2-bits in future this would be useful, and it completes the Q1_0, Q2_0, Q4_0, Q8_0 family.

ggerganov · 2026-04-21T07:54:10Z

ggerganov
Apr 21, 2026
Maintainer

The Q2_0 with group size of 128 seems suboptimal. A group size of 64 would cost less than 6% extra memory and will be more practical (supports more tensor shapes) and better quality.

1 reply

khosravipasha Apr 22, 2026
Author

Make sense, I can make it to be group size 64 for the official Q2_0. Will work on the PRs this upcoming week, we have some implementation with group 128 in our fork in the meantime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting Ternary Bonsai in llama.cpp; group-128 ternary format discussion #22019

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Supporting Ternary Bonsai in llama.cpp; group-128 ternary format discussion #22019

Uh oh!

khosravipasha Apr 16, 2026

The format

Similar formats

Questions

Replies: 1 comment · 1 reply

Uh oh!

ggerganov Apr 21, 2026 Maintainer

Uh oh!

khosravipasha Apr 22, 2026 Author

khosravipasha
Apr 16, 2026

Replies: 1 comment 1 reply

ggerganov
Apr 21, 2026
Maintainer

khosravipasha Apr 22, 2026
Author