You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is loading FP16 LoRA adapters at runtime via --lora (and POST /lora-adapters on llama-server) tested or expected to work with prism's Q1_0_g128 / Q2_0_g128 quantization formats? Or is the LoRA path on this fork untested for the prism formats specifically?
Context
We're using Bonsai-1.7B as a specialist slot in a local-first application (D&D companion AI for a tactical game), running it via this fork to take advantage of prism quantization. We want to LoRA-fine-tune the model on a few narrow tasks (~2-3K examples each, ~1 hour training per task on a 7.6 GB GPU) by distilling from a stronger teacher, and then deploy the adapters on top of the prism-quantized base.
For the deployment shape, we're following the QVAC Fabric BitNet b1.58 + LoRA pattern from their March 2026 Hugging Face blog post:
Train LoRA at FP16 against prism-ml/Ternary-Bonsai-1.7B-unpacked
Keep the LoRA as an FP16 GGUF sidecar (via convert_lora_to_gguf.py)
Do not merge the adapter back into the base
Load it at inference via --lora adapter.gguf on top of an unmodified prism-quantized base
Upstream llama.cpp has supported this pattern since PR ggml-org#8332 (mid-2024) for standard quantization formats (Q4_K_M, Q5_K_M, etc.) and QVAC's blog confirms it works for upstream BitNet TQ1_0/TQ2_0. The question is whether it works for prism's Q1_0_g128 / Q2_0_g128 specifically, since those formats are unique to this fork.
Specific questions
Has --lora been tested against Q1_0_g128 or Q2_0_g128 base models in this fork? If yes, are there caveats (precision loss, runtime overhead, format constraints)?
Does the POST /lora-adapters endpoint on llama-server work correctly with prism quantized bases at runtime?
Does set_lora_adapter_scale (per-adapter scale at runtime) compose correctly with the prism dequantization path?
Are there known issues with LoRA-modified weights when the base uses prism's group-128 format vs upstream's group-256?
What we'd test if no published answer exists
We're prepared to run a three-way comparison ourselves and report results (and PR documentation back to this fork if useful):
Train a trivial test LoRA (~200 examples teaching a distinctive style marker, ~30 min on a 7.6 GB GPU) on prism-ml/Ternary-Bonsai-1.7B-unpacked.
Convert the FP16 LoRA to GGUF via convert_lora_to_gguf.py.
Compare on a 100-prompt eval set:
Reference: merged-FP16 inference (ground truth — what the LoRA is supposed to do)
Measure: token-level KL between Route 2 and merged-FP16, IFEval delta across all three, output divergence on the test prompts.
If maintainers have already done this, we'd love to skip the work and get a pointer. If not, happy to run it and post results here.
Why we're asking before training
The whole training plan (~four task-specific LoRAs, ~9-14K total synthesized examples, ~1 hour each) is gated on this answer. If --lora doesn't work cleanly with the prism formats today, we'd rather know before generating training data than after. We've seen Discussion ggml-org#22019 about the prism Q2_0 group-64 transition; happy to test against either group size or against Q1_0_g128 if that's preferred.
Thanks for the work on this fork — the prism quantization recipe has been a great fit for our deployment shape.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
Is loading FP16 LoRA adapters at runtime via
--lora(andPOST /lora-adaptersonllama-server) tested or expected to work with prism'sQ1_0_g128/Q2_0_g128quantization formats? Or is the LoRA path on this fork untested for the prism formats specifically?Context
We're using
Bonsai-1.7Bas a specialist slot in a local-first application (D&D companion AI for a tactical game), running it via this fork to take advantage of prism quantization. We want to LoRA-fine-tune the model on a few narrow tasks (~2-3K examples each, ~1 hour training per task on a 7.6 GB GPU) by distilling from a stronger teacher, and then deploy the adapters on top of the prism-quantized base.For the deployment shape, we're following the QVAC Fabric BitNet b1.58 + LoRA pattern from their March 2026 Hugging Face blog post:
prism-ml/Ternary-Bonsai-1.7B-unpackedconvert_lora_to_gguf.py)--lora adapter.ggufon top of an unmodified prism-quantized baseUpstream llama.cpp has supported this pattern since PR ggml-org#8332 (mid-2024) for standard quantization formats (
Q4_K_M,Q5_K_M, etc.) and QVAC's blog confirms it works for upstream BitNetTQ1_0/TQ2_0. The question is whether it works for prism'sQ1_0_g128/Q2_0_g128specifically, since those formats are unique to this fork.Specific questions
--lorabeen tested againstQ1_0_g128orQ2_0_g128base models in this fork? If yes, are there caveats (precision loss, runtime overhead, format constraints)?POST /lora-adaptersendpoint onllama-serverwork correctly with prism quantized bases at runtime?set_lora_adapter_scale(per-adapter scale at runtime) compose correctly with the prism dequantization path?What we'd test if no published answer exists
We're prepared to run a three-way comparison ourselves and report results (and PR documentation back to this fork if useful):
prism-ml/Ternary-Bonsai-1.7B-unpacked.convert_lora_to_gguf.py.prism-ml/Ternary-Bonsai-1.7B-gguf(Q2_0_g128) +--lora adapter.ggufIf maintainers have already done this, we'd love to skip the work and get a pointer. If not, happy to run it and post results here.
Why we're asking before training
The whole training plan (~four task-specific LoRAs, ~9-14K total synthesized examples, ~1 hour each) is gated on this answer. If
--loradoesn't work cleanly with the prism formats today, we'd rather know before generating training data than after. We've seen Discussion ggml-org#22019 about the prism Q2_0 group-64 transition; happy to test against either group size or againstQ1_0_g128if that's preferred.Thanks for the work on this fork — the prism quantization recipe has been a great fit for our deployment shape.
Beta Was this translation helpful? Give feedback.
All reactions