You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Spec] Hoist quant packing + row-dequant out of sk.ainet.models.gemma into shared layers
Repos: SKaiNET (engine) + SKaiNET-transformers · Type: refactor / generalization · Unblocks: Gemma function-calling on the SL2610 (the remaining half of #178) and whisper int8 on the
box. Builds on:transformer-core extraction (must land first).
Model (llm-inference/*) — here the duplication lives: gemma GemmaPackedWeights/GemmaQuantLayout
(packGemmaKQuant) + RowDequantSource, llama QuantizedTensorFactory — each re-implements
GGUF-block → engine-BlockTensorData packing and (gemma only) row-dequant.
Result today: only Gemma keeps the embedding Q8_0/Q6_K (via its model-local RowDequantSource), and only
because its ownPerLayerEmbedding.compute checks the marker — the engine ops.gather does not know RowDequantSource, so a normal gather on a packed tensor materialises the full FP32 table (the ~0.67 GB token_embd that still OOMs, #178's remaining item). Whisper (tied 51864-vocab head + embedding) would hit
the identical wall.
Move the RowDequantSource interface (fun dequantRow(rowIdx: Int): FloatArray) to sk.ainet.lang.tensor.data (next to TensorData).
Teach ops.gather (and any embedding-lookup op): if (input.data is RowDequantSource) → gather rows
via dequantRow(idx) instead of copyToFloatArray(). This is the generic version of gemma's model-level
trick.
Acceptance: gemma token_embd stays Q-packed end-to-end (footprint drop, byte-identical decode via GemmaQ5KPackedParityTest); a new engine test gathers from a RowDequantSource tensor without full
materialise.
Mark a packed weight as already [out, in] block-major (the kernel's order).
LinearProjection (now in transformer-core) reads the marker and skips ops.transpose,
dispatching straight to chooseQuantizedMatmulHeap. (Avoids even the lazy transpose; the engine ops.transpose packed support stays as the fallback.)
Effect: packed matmul weights (lm_head, attn projections) never round-trip through transpose;
whisper's WhisperLinear (identical shape) benefits the same way.
Acceptance: packed lm_head matmul with the marker takes the quant kernel with no transpose allocation.
With (1)+(2): a whisper-io Prefer(Q8_0) policy + a RowDequantSource token-embedding → whisper runs int8
on the box (the Phase-1 accuracy-ceiling question), reusing the exact engine kernels + row-dequant Gemma
uses — no whisper-specific quant code.
Notes
Keep the eager FP32 path as the always-correct oracle (parity tests gate every step).
[Spec] Hoist quant packing + row-dequant out of
sk.ainet.models.gemmainto shared layersRepos: SKaiNET (engine) + SKaiNET-transformers · Type: refactor / generalization ·
Unblocks: Gemma function-calling on the SL2610 (the remaining half of #178) and whisper int8 on the
box. Builds on:
transformer-coreextraction (must land first).Why
Quantization generalizes along a layered stack:
skainet-lang-core) — already shared, all targets: packed dtypes (Q8_0TensorData,Q4_0/Q5_0/Q5_1/Q4_K/Q5_K/Q6_K),QuantizedMatmulkernels,ops.transposefor all packed dtypes(Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178 #736/#737, merged).
transformer-core) — dtype-agnostic (LinearProjection = ops.matmul(x, ops.transpose(w))).llm-inference/*) — here the duplication lives: gemmaGemmaPackedWeights/GemmaQuantLayout(
packGemmaKQuant) +RowDequantSource, llamaQuantizedTensorFactory— each re-implementsGGUF-block → engine-
BlockTensorDatapacking and (gemma only) row-dequant.Result today: only Gemma keeps the embedding Q8_0/Q6_K (via its model-local
RowDequantSource), and onlybecause its own
PerLayerEmbedding.computechecks the marker — the engineops.gatherdoes not knowRowDequantSource, so a normal gather on a packed tensor materialises the full FP32 table (the ~0.67 GBtoken_embdthat still OOMs, #178's remaining item). Whisper (tied 51864-vocab head + embedding) would hitthe identical wall.
The three hoists (each independent; do (1) first)
(1)
RowDequantSource→ engine (skainet-lang-core), wired intoops.gather⟵ highest leverageRowDequantSourceinterface (fun dequantRow(rowIdx: Int): FloatArray) tosk.ainet.lang.tensor.data(next toTensorData).ops.gather(and any embedding-lookup op):if (input.data is RowDequantSource) →gather rowsvia
dequantRow(idx)instead ofcopyToFloatArray(). This is the generic version of gemma's model-leveltrick.
token_embd, whisper tied embedding, llama) stays packed; thegather dequants only the touched rows. Closes Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178's last ~0.67 GB.
token_embdstays Q-packed end-to-end (footprint drop, byte-identical decode viaGemmaQ5KPackedParityTest); a new engine test gathers from aRowDequantSourcetensor without fullmaterialise.
(2) Generic block-relayout packing → shared
quantutil (transformers, depends onlang-core)*BlockTensorDatarelayout (theblockSize-parameterised packerEager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178 generalized) from gemma
GemmaQuantLayout.packGemmaKQuant+ llamaQuantizedTensorFactoryinto oneshared packer keyed by ggml quant type → engine
BlockTensorData.Prefer(Q8_0)DTypePolicy.round-trip test per dtype.
(3) Pre-transpose marker (#178 "Solution C") → engine tensor-data flag +
transformer-coreLinearProjection[out, in]block-major (the kernel's order).LinearProjection(now intransformer-core) reads the marker and skipsops.transpose,dispatching straight to
chooseQuantizedMatmulHeap. (Avoids even the lazy transpose; the engineops.transposepacked support stays as the fallback.)whisper's
WhisperLinear(identical shape) benefits the same way.Sequencing
transformer-corelands → (1) (unblocks Gemmatoken_embd+ the function-calling footprint, and whisperembedding) → (2) (de-dup, enables whisper packed loading) → (3) (perf, no-transpose packed matmul).
(1) alone closes #178's remaining board-fit item.
Payback for whisper
With (1)+(2): a whisper-io
Prefer(Q8_0)policy + aRowDequantSourcetoken-embedding → whisper runs int8on the box (the Phase-1 accuracy-ceiling question), reusing the exact engine kernels + row-dequant Gemma
uses — no whisper-specific quant code.
Notes
RowDequantSourcebecomes public engine API) — coordinate the releasepin like Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178 did (engine first, then bump the transformers
skainetpin).