Scratch/try laguna merge by Vect0rM · Pull Request #30 · AtomicBot-ai/atomic-llama-cpp-turboquant

Vect0rM · 2026-07-03T08:32:22Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

HF->GGUF conversion for poolside's Laguna (XS.2 and M.1): per-expert MoE tensor stacking, sigmoid routing with a score-correction bias and a shared expert, the attention output gate (per-head on XS.2, per-element on M.1), QK-norm, and per-layer-type RoPE (YaRN on full-attention layers, plain RoPE on sliding-window layers; M.1 is full-attention only). - conversion/laguna.py: LagunaModel converter. eos_token_id is a list [2, 24]; token 2 (which also serves as BOS) is kept as eos and token 24 (</assistant>, the chat turn-end) is registered as eot so generation halts natively. - gguf-py: MODEL_ARCH.LAGUNA plus its tensor list and the attention-gate / expert-score-correction tensor mappings. - convert_hf_to_gguf_update.py: register the laguna BPE pre-tokenizer. The gate type (per-head vs per-element) is resolved from the config and cross-checked against the g_proj weight width; a contradiction fails the conversion rather than silently mis-encoding the gate. (cherry picked from commit e457ddf)

PEG/autoparser chat handler for Laguna (XS.2 and M.1 share the format). GLM-4-MoE-style tool calls: <tool_call>{name} <arg_key>{k}</arg_key> <arg_value>{v}</arg_value> ... </tool_call> with string args emitted raw and all other args as JSON, plus <think>...</think> reasoning; the turn ends with </assistant>. An optional </assistant> exit marker in the trigger rule lets the lazy grammar complete after the tool call(s), so the model terminates cleanly instead of looping additional calls under parallel_tool_calls. Detection keys on <arg_key>/<arg_value> together with the <assistant> role tags. Adds a test-chat case (models/templates/laguna.jinja). (cherry picked from commit d211291)

@CISC

- Use the shared logger from .base instead of a local logging.getLogger. - Call super().set_gguf_parameters() and override only head_count with the per-layer array, instead of re-emitting the base keys by hand (overwriting an already-set value is fine). Keep vocab_size, which the base does not emit for the gpt2 vocab path. Addresses @CISC review comments on conversion/laguna.py. (cherry picked from commit ea288fa)

…t_gguf_parameters) Signed-off-by: Joe Rowell <joerowell4@gmail.com> (cherry picked from commit 588d27a)

Graph builder for poolside's Laguna (XS.2 and M.1). Softplus attention output gate (per-head or per-element, taken from the g_proj tensor width and required to be exactly one of the two valid widths), QK-RMSNorm, optional hybrid full / sliding-window attention with per-layer head counts, partial-rotary YaRN on full-attention layers and plain RoPE on sliding-window layers, and sigmoid-routed MoE with a score-correction bias, sum-normalization, a routed scaling factor and an always-on shared expert. Validated on Laguna-XS.2 (4-shot GSM8K in line with the reference implementation) and Laguna-M.1. (cherry picked from commit b0c66b6)

…tion, optional expert_shared_count) (cherry picked from commit 348fd3e)

… ratios The non-Volta MMA path in ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2 chose the GQA head-packing factor (ncols2) by threshold (gqa_ratio > 4 -> 8, > 2 -> 4, > 1 -> 2) rather than by divisibility. When gqa_ratio is not a multiple of the chosen factor, the kernel packs more query heads per KV head than actually share one, giving wrong KV indexing and corrupted output -- but only when the GQA optimization is active (padded K/V during decode), which is why prefill looked correct. E.g. 48 query / 8 KV heads (gqa_ratio = 6) hits 6 > 4 -> ncols2 = 8, packing 8 where only 6 share. Switch the non-Volta branch to divisibility (% 8 == 0 -> 8, % 4 == 0 -> 4, % 2 == 0 -> 2), matching the Volta branch above; gqa_ratio = 6 now correctly selects ncols2 = 2. Power-of-two ratios are unchanged. (cherry picked from commit c4aa4af)

joerowell added 7 commits July 2, 2026 13:30

conversion : address review feedback (chat_template.jinja + dedupe se…

821e3f8

…t_gguf_parameters) Signed-off-by: Joe Rowell <joerowell4@gmail.com> (cherry picked from commit 588d27a)

laguna : address review (dedicated vocab pre-type, converter registra…

027c4b1

…tion, optional expert_shared_count) (cherry picked from commit 348fd3e)

Vect0rM merged commit 519f0c5 into feature/turboquant-kv-cache Jul 3, 2026
14 of 40 checks passed

github-actions Bot added testing ggml model CUDA conversion labels Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scratch/try laguna merge#30

Scratch/try laguna merge#30
Vect0rM merged 7 commits into
feature/turboquant-kv-cachefrom
scratch/try-laguna-merge

Vect0rM commented Jul 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Vect0rM commented Jul 3, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants