Scratch/try laguna merge#30
Merged
Merged
Conversation
HF->GGUF conversion for poolside's Laguna (XS.2 and M.1): per-expert MoE tensor stacking, sigmoid routing with a score-correction bias and a shared expert, the attention output gate (per-head on XS.2, per-element on M.1), QK-norm, and per-layer-type RoPE (YaRN on full-attention layers, plain RoPE on sliding-window layers; M.1 is full-attention only). - conversion/laguna.py: LagunaModel converter. eos_token_id is a list [2, 24]; token 2 (which also serves as BOS) is kept as eos and token 24 (</assistant>, the chat turn-end) is registered as eot so generation halts natively. - gguf-py: MODEL_ARCH.LAGUNA plus its tensor list and the attention-gate / expert-score-correction tensor mappings. - convert_hf_to_gguf_update.py: register the laguna BPE pre-tokenizer. The gate type (per-head vs per-element) is resolved from the config and cross-checked against the g_proj weight width; a contradiction fails the conversion rather than silently mis-encoding the gate. (cherry picked from commit e457ddf)
PEG/autoparser chat handler for Laguna (XS.2 and M.1 share the format).
GLM-4-MoE-style tool calls:
<tool_call>{name}
<arg_key>{k}</arg_key>
<arg_value>{v}</arg_value>
...
</tool_call>
with string args emitted raw and all other args as JSON, plus <think>...</think>
reasoning; the turn ends with </assistant>. An optional </assistant> exit marker
in the trigger rule lets the lazy grammar complete after the tool call(s), so the
model terminates cleanly instead of looping additional calls under
parallel_tool_calls. Detection keys on <arg_key>/<arg_value> together with the
<assistant> role tags. Adds a test-chat case (models/templates/laguna.jinja).
(cherry picked from commit d211291)
- Use the shared logger from .base instead of a local logging.getLogger. - Call super().set_gguf_parameters() and override only head_count with the per-layer array, instead of re-emitting the base keys by hand (overwriting an already-set value is fine). Keep vocab_size, which the base does not emit for the gpt2 vocab path. Addresses @CISC review comments on conversion/laguna.py. (cherry picked from commit ea288fa)
…t_gguf_parameters) Signed-off-by: Joe Rowell <joerowell4@gmail.com> (cherry picked from commit 588d27a)
Graph builder for poolside's Laguna (XS.2 and M.1). Softplus attention output gate (per-head or per-element, taken from the g_proj tensor width and required to be exactly one of the two valid widths), QK-RMSNorm, optional hybrid full / sliding-window attention with per-layer head counts, partial-rotary YaRN on full-attention layers and plain RoPE on sliding-window layers, and sigmoid-routed MoE with a score-correction bias, sum-normalization, a routed scaling factor and an always-on shared expert. Validated on Laguna-XS.2 (4-shot GSM8K in line with the reference implementation) and Laguna-M.1. (cherry picked from commit b0c66b6)
…tion, optional expert_shared_count) (cherry picked from commit 348fd3e)
… ratios The non-Volta MMA path in ggml_cuda_flash_attn_ext_mma_f16_switch_ncols2 chose the GQA head-packing factor (ncols2) by threshold (gqa_ratio > 4 -> 8, > 2 -> 4, > 1 -> 2) rather than by divisibility. When gqa_ratio is not a multiple of the chosen factor, the kernel packs more query heads per KV head than actually share one, giving wrong KV indexing and corrupted output -- but only when the GQA optimization is active (padded K/V during decode), which is why prefill looked correct. E.g. 48 query / 8 KV heads (gqa_ratio = 6) hits 6 > 4 -> ncols2 = 8, packing 8 where only 6 share. Switch the non-Volta branch to divisibility (% 8 == 0 -> 8, % 4 == 0 -> 4, % 2 == 0 -> 2), matching the Volta branch above; gqa_ratio = 6 now correctly selects ncols2 = 2. Power-of-two ratios are unchanged. (cherry picked from commit c4aa4af)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Additional information
Requirements