Hi, thanks for the great work on Bonsai-8B and the Q1_0 quantization.
I'd like to ask about plans for running Q1_0 models in the browser via WebGPU.
Current state (as far as I can see):
- The Q1_0 custom kernels exist for CUDA (.cu), Metal (.metal), and Vulkan (GLSL/SPIR-V), plus CPU.
- Upstream ggml-org/llama.cpp has an active WebGPU backend in development (multiple PRs merged in recent weeks, including Intel subgroup matrix support).
- However, the WebGPU backend upstream only targets standard quant types; there is no WGSL implementation of the Q1_0 kernel yet.
Questions:
- Is a WGSL port of the Q1_0 kernel on your roadmap, or under consideration?
- If not planned internally, would you be open to a community PR once the upstream WebGPU backend stabilizes?
- Any preference between waiting for upstream WebGPU to mature vs. an earlier experimental port?
Running Bonsai-8B purely in-browser (no server, no install) would be a great showcase of the 1-bit quantization story — curious where this sits in your plans.
Thanks!
Hi, thanks for the great work on Bonsai-8B and the Q1_0 quantization.
I'd like to ask about plans for running Q1_0 models in the browser via WebGPU.
Current state (as far as I can see):
Questions:
Running Bonsai-8B purely in-browser (no server, no install) would be a great showcase of the 1-bit quantization story — curious where this sits in your plans.
Thanks!