Does --lora work with Q1_0_g128 / Q2_0_g128 quantized models? #30

TomSadeh · 2026-05-04T09:11:26Z

TomSadeh
May 4, 2026

TL;DR

Is loading FP16 LoRA adapters at runtime via --lora (and POST /lora-adapters on llama-server) tested or expected to work with prism's Q1_0_g128 / Q2_0_g128 quantization formats? Or is the LoRA path on this fork untested for the prism formats specifically?

Context

We're using Bonsai-1.7B as a specialist slot in a local-first application (D&D companion AI for a tactical game), running it via this fork to take advantage of prism quantization. We want to LoRA-fine-tune the model on a few narrow tasks (~2-3K examples each, ~1 hour training per task on a 7.6 GB GPU) by distilling from a stronger teacher, and then deploy the adapters on top of the prism-quantized base.

For the deployment shape, we're following the QVAC Fabric BitNet b1.58 + LoRA pattern from their March 2026 Hugging Face blog post:

Train LoRA at FP16 against prism-ml/Ternary-Bonsai-1.7B-unpacked
Keep the LoRA as an FP16 GGUF sidecar (via convert_lora_to_gguf.py)
Do not merge the adapter back into the base
Load it at inference via --lora adapter.gguf on top of an unmodified prism-quantized base

Upstream llama.cpp has supported this pattern since PR ggml-org#8332 (mid-2024) for standard quantization formats (Q4_K_M, Q5_K_M, etc.) and QVAC's blog confirms it works for upstream BitNet TQ1_0/TQ2_0. The question is whether it works for prism's Q1_0_g128 / Q2_0_g128 specifically, since those formats are unique to this fork.

Specific questions

Has --lora been tested against Q1_0_g128 or Q2_0_g128 base models in this fork? If yes, are there caveats (precision loss, runtime overhead, format constraints)?
Does the POST /lora-adapters endpoint on llama-server work correctly with prism quantized bases at runtime?
Does set_lora_adapter_scale (per-adapter scale at runtime) compose correctly with the prism dequantization path?
Are there known issues with LoRA-modified weights when the base uses prism's group-128 format vs upstream's group-256?

What we'd test if no published answer exists

We're prepared to run a three-way comparison ourselves and report results (and PR documentation back to this fork if useful):

Train a trivial test LoRA (~200 examples teaching a distinctive style marker, ~30 min on a 7.6 GB GPU) on prism-ml/Ternary-Bonsai-1.7B-unpacked.
Convert the FP16 LoRA to GGUF via convert_lora_to_gguf.py.
Compare on a 100-prompt eval set:
- Reference: merged-FP16 inference (ground truth — what the LoRA is supposed to do)
- Route 2 candidate: prism-ml/Ternary-Bonsai-1.7B-gguf (Q2_0_g128) + --lora adapter.gguf
- Negative control: base prism Q2_0_g128 alone
Measure: token-level KL between Route 2 and merged-FP16, IFEval delta across all three, output divergence on the test prompts.

If maintainers have already done this, we'd love to skip the work and get a pointer. If not, happy to run it and post results here.

Why we're asking before training

The whole training plan (~four task-specific LoRAs, ~9-14K total synthesized examples, ~1 hour each) is gated on this answer. If --lora doesn't work cleanly with the prism formats today, we'd rather know before generating training data than after. We've seen Discussion ggml-org#22019 about the prism Q2_0 group-64 transition; happy to test against either group size or against Q1_0_g128 if that's preferred.

Thanks for the work on this fork — the prism quantization recipe has been a great fit for our deployment shape.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does --lora work with Q1_0_g128 / Q2_0_g128 quantized models? #30

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Does --lora work with Q1_0_g128 / Q2_0_g128 quantized models? #30

Uh oh!

TomSadeh May 4, 2026

TL;DR

Context

Specific questions

What we'd test if no published answer exists

Why we're asking before training

Replies: 0 comments

TomSadeh
May 4, 2026