You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* perf(laguna): verify-path optimizations — bonus fold, fused domino, draft padding, AUTO width
- chain greedy: fold the bonus token into the next verify batch as the
seed (DDTree next_token contract); 1 target forward per step not 2
- fused domino head: one GPU graph (lm_head proj + unrolled GRU +
in-graph argmax->get_rows), one-time f16 embedding table (392MiB),
runs on the draft backend stream, async token readbacks
- draft graph: pad ctx to 64-aligned topology + persistent metadata
arena so ggml-cuda graph cache can replay (cache keys on tensor
addresses); full-layer pad mask; positions plumbed for padding
- AUTO verify width: round(EWMA)+1 capped at 3 (DFLASH_LAGUNA_VERIFY_WIDTH_MAX);
old formula drifted to w4/5 on high-AL workloads (HE 172 -> 188 tok/s)
- all changes env-gated; outputs byte-identical vs base (19-output harness)
* perf(laguna): fused adjacent QK and shexp gate/up weights for decode-width matmuls
The loader lays attn_q|attn_k and ffn_gate_shexp|ffn_up_shexp adjacently in
the weight buffer and binds a fused tensor over each pair (zero extra VRAM;
the split tensors remain valid views for all other paths). At decode widths
(n_tokens <= 8, MMVQ) the attention builder then runs:
- ONE matmul for Q|K, ONE rms_norm+mul with a fused per-head norm weight,
and ONE rope over all heads (Q and K use identical rope params), splitting
with views afterwards
- ONE matmul + ggml_swiglu for the shared expert instead of gate+up+glu
Bit-identical by construction at MMVQ widths: matmul rows, rms_norm rows,
the norm mul and rope are all per-row/per-head independent (verified with a
standalone concat-vs-split bitwise test, /tmp/test_concat_mm.cpp, and the
19-output e2e hash harness). MMQ (prefill, n_tokens > 8) partitions work by
total row count and is NOT bit-stable under row-concat, so prefill keeps the
split weights - that path is unchanged and byte-identical too.
Kill switch: DFLASH_LAGUNA_FUSED_QK=0 (loader). Debug: LUCE_QK_FUSE_MODE,
LUCE_QK_FUSE_LAYERS.
Measured (laguna-xs2 Q4_K_M + v23 drafter, RTX 3090, greedy):
HumanEval 187.5 -> 192.6 tok/s, GSM8K ~177 -> 181.6, all 19 harness
outputs byte-identical to the pre-change reference.
* deps: bump llama.cpp to perf/luce-verify-kernels (graph stats, q8 memo, diagnostics)
* fix(laguna): address review findings on the verify-path PR
- guard the full-layer pad-mask write (null on all-SWA drafts; the
causal SWA mask covers those layers)
- release fused-domino resources (f16 embedding table buffer/ctx,
dedicated backend instance) in ~LagunaDFlashTarget
- release the persistent draft metadata arena and reserve state in
step_graph_destroy (park/unpark kept ~32 MiB host per graph)
- fused domino: same vocab-compatibility guard as the legacy path,
clean fallback + one-shot warning on mismatch
Unit suite 2022/0; spec-decode output hash unchanged.
---------
Co-authored-by: mrciffa <davide@cifarelli.tech>
0 commit comments