Skip to content

Scratch/try laguna merge#30

Merged
Vect0rM merged 7 commits into
feature/turboquant-kv-cachefrom
scratch/try-laguna-merge
Jul 3, 2026
Merged

Scratch/try laguna merge#30
Vect0rM merged 7 commits into
feature/turboquant-kv-cachefrom
scratch/try-laguna-merge

Conversation

@Vect0rM

@Vect0rM Vect0rM commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Overview

Additional information

Requirements

joerowell added 7 commits July 2, 2026 13:30
HF->GGUF conversion for poolside's Laguna (XS.2 and M.1): per-expert MoE tensor
stacking, sigmoid routing with a score-correction bias and a shared expert, the
attention output gate (per-head on XS.2, per-element on M.1), QK-norm, and
per-layer-type RoPE (YaRN on full-attention layers, plain RoPE on sliding-window
layers; M.1 is full-attention only).

- conversion/laguna.py: LagunaModel converter. eos_token_id is a list [2, 24];
  token 2 (which also serves as BOS) is kept as eos and token 24 (</assistant>,
  the chat turn-end) is registered as eot so generation halts natively.
- gguf-py: MODEL_ARCH.LAGUNA plus its tensor list and the attention-gate /
  expert-score-correction tensor mappings.
- convert_hf_to_gguf_update.py: register the laguna BPE pre-tokenizer.

The gate type (per-head vs per-element) is resolved from the config and
cross-checked against the g_proj weight width; a contradiction fails the
conversion rather than silently mis-encoding the gate.

(cherry picked from commit e457ddf)
PEG/autoparser chat handler for Laguna (XS.2 and M.1 share the format).
GLM-4-MoE-style tool calls:

  <tool_call>{name}
  <arg_key>{k}</arg_key>
  <arg_value>{v}</arg_value>
  ...
  </tool_call>

with string args emitted raw and all other args as JSON, plus <think>...</think>
reasoning; the turn ends with </assistant>. An optional </assistant> exit marker
in the trigger rule lets the lazy grammar complete after the tool call(s), so the
model terminates cleanly instead of looping additional calls under
parallel_tool_calls. Detection keys on <arg_key>/<arg_value> together with the
<assistant> role tags. Adds a test-chat case (models/templates/laguna.jinja).

(cherry picked from commit d211291)
- Use the shared logger from .base instead of a local logging.getLogger.
- Call super().set_gguf_parameters() and override only head_count with the
  per-layer array, instead of re-emitting the base keys by hand (overwriting
  an already-set value is fine). Keep vocab_size, which the base does not emit
  for the gpt2 vocab path.

Addresses @CISC review comments on conversion/laguna.py.

(cherry picked from commit ea288fa)
…t_gguf_parameters)

Signed-off-by: Joe Rowell <joerowell4@gmail.com>
(cherry picked from commit 588d27a)
Graph builder for poolside's Laguna (XS.2 and M.1). Softplus attention output
gate (per-head or per-element, taken from the g_proj tensor width and required
to be exactly one of the two valid widths), QK-RMSNorm, optional hybrid full /
sliding-window attention with per-layer head counts, partial-rotary YaRN on
full-attention layers and plain RoPE on sliding-window layers, and sigmoid-routed
MoE with a score-correction bias, sum-normalization, a routed scaling factor and
an always-on shared expert.

Validated on Laguna-XS.2 (4-shot GSM8K in line with the reference implementation)
and Laguna-M.1.

(cherry picked from commit b0c66b6)
…tion, optional expert_shared_count)

(cherry picked from commit 348fd3e)
… ratios

The non-Volta MMA path in ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2 chose
the GQA head-packing factor (ncols2) by threshold (gqa_ratio > 4 -> 8, > 2 -> 4,
> 1 -> 2) rather than by divisibility. When gqa_ratio is not a multiple of the
chosen factor, the kernel packs more query heads per KV head than actually share
one, giving wrong KV indexing and corrupted output -- but only when the GQA
optimization is active (padded K/V during decode), which is why prefill looked
correct. E.g. 48 query / 8 KV heads (gqa_ratio = 6) hits 6 > 4 -> ncols2 = 8,
packing 8 where only 6 share.

Switch the non-Volta branch to divisibility (% 8 == 0 -> 8, % 4 == 0 -> 4,
% 2 == 0 -> 2), matching the Volta branch above; gqa_ratio = 6 now correctly
selects ncols2 = 2. Power-of-two ratios are unchanged.

(cherry picked from commit c4aa4af)
@Vect0rM Vect0rM merged commit 519f0c5 into feature/turboquant-kv-cache Jul 3, 2026
14 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants