Supporting Ternary Bonsai in llama.cpp; group-128 ternary format discussion #22019
Unanswered
khosravipasha
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
The |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Follow-up to the 1-bit
Q1_0work we did earlier for family of Bonsai models. We just released Ternary Bonsai (1.7B / 4B / 8B) and would like to discuss to see what's the right path for llama.cpp support.Collection: Ternary-Bonsai
White paper: Ternary-Bonsai-8B-Whitepaper
Live demo in your browser: webml/WebGPU
We get much better quality with a bit more overhead:

The format
Weights take values from
{-1, 0, +1}, with one shared FP16 scale per group of 128 weights:Similar formats
Q2_0: Was thinking of doingQ2_0with group size 128, similar to what we did forQ1_0, but wanted to check here if there are better solutions before going that route. I only did a quick POC on NEON CPU and works great (need more through testing and tuning though).TQ1_0/TQ2_0: seems to be the closets match to our format, but group size was 256, which won't be native for our models have to fit two groups into 1 and might lose too much quality. This was already in the pipeline and did not realize will need group 256. And those seems to be targeting for CPU usecase (and might be hard to support efficiently for Metal/CUDA).Q2_K: This one seems related to but need to see if can pack our model exactly in this format since the scales are also quantized in the superblocks. Might be able to figure out exactly foruma to pack into this format. But also bpw would be higher than Q2_0 or ternary. How's the speeds for this one?Difference vs BitNet: Based on my understanding, BitNet format uses one scale per row, while we use one scale per group of 128, so won't be native support.
Questions
We know adding a new type increases the maintenance costs, so wanted to check here first to see what's the preferred route. We will be releasing more and more models in both formats (or even mixed) going forward so having native support would be nice to have.
Right now leaning towards
Q2_0with group size 128, but wanted to check here first to see if there are better solutions we've overlooked. Another advantage ofQ2_0is that in case we do 2-bits in future this would be useful, and it completes the Q1_0, Q2_0, Q4_0, Q8_0 family.Beta Was this translation helpful? Give feedback.
All reactions