Skip to content

Hoist quant packing + RowDequantSource out of sk.ainet.models.gemma into shared layers #184

@michalharakal

Description

@michalharakal

[Spec] Hoist quant packing + row-dequant out of sk.ainet.models.gemma into shared layers

Repos: SKaiNET (engine) + SKaiNET-transformers · Type: refactor / generalization ·
Unblocks: Gemma function-calling on the SL2610 (the remaining half of #178) and whisper int8 on the
box. Builds on: transformer-core extraction (must land first).

Why

Quantization generalizes along a layered stack:

  • Engine (skainet-lang-core) — already shared, all targets: packed dtypes (Q8_0TensorData,
    Q4_0/Q5_0/Q5_1/Q4_K/Q5_K/Q6_K), QuantizedMatmul kernels, ops.transpose for all packed dtypes
    (Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178 #736/#737, merged).
  • Primitives (transformer-core) — dtype-agnostic (LinearProjection = ops.matmul(x, ops.transpose(w))).
  • Model (llm-inference/*) — here the duplication lives: gemma GemmaPackedWeights/GemmaQuantLayout
    (packGemmaKQuant) + RowDequantSource, llama QuantizedTensorFactory — each re-implements
    GGUF-block → engine-BlockTensorData packing and (gemma only) row-dequant.

Result today: only Gemma keeps the embedding Q8_0/Q6_K (via its model-local RowDequantSource), and only
because its own PerLayerEmbedding.compute checks the marker — the engine ops.gather does not know
RowDequantSource, so a normal gather on a packed tensor materialises the full FP32 table (the ~0.67 GB
token_embd that still OOMs, #178's remaining item). Whisper (tied 51864-vocab head + embedding) would hit
the identical wall.

The three hoists (each independent; do (1) first)

(1) RowDequantSource → engine (skainet-lang-core), wired into ops.gather ⟵ highest leverage

  • Move the RowDequantSource interface (fun dequantRow(rowIdx: Int): FloatArray) to
    sk.ainet.lang.tensor.data (next to TensorData).
  • Teach ops.gather (and any embedding-lookup op): if (input.data is RowDequantSource) → gather rows
    via dequantRow(idx) instead of copyToFloatArray(). This is the generic version of gemma's model-level
    trick.
  • Effect: any packed embedding (gemma token_embd, whisper tied embedding, llama) stays packed; the
    gather dequants only the touched rows. Closes Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178's last ~0.67 GB.
  • Acceptance: gemma token_embd stays Q-packed end-to-end (footprint drop, byte-identical decode via
    GemmaQ5KPackedParityTest); a new engine test gathers from a RowDequantSource tensor without full
    materialise.

(2) Generic block-relayout packing → shared quant util (transformers, depends on lang-core)

  • Extract the common GGUF-block → *BlockTensorData relayout (the blockSize-parameterised packer
    Eager NATIVE_OPTIMIZED: keep Q8_0 matmul weights packed (pre-transpose marker) so gemma fits + runs fast on the SL2610 #178 generalized) from gemma GemmaQuantLayout.packGemmaKQuant + llama QuantizedTensorFactory into one
    shared packer keyed by ggml quant type → engine BlockTensorData.
  • Per-model keeps only weight selection + naming (which tensors to pack, ggml-name map).
  • Effect: de-dups gemma+llama; whisper-io gets the same packer for a Prefer(Q8_0) DTypePolicy.
  • Acceptance: gemma + llama pack via the shared util (their tests stay green); the util has its own
    round-trip test per dtype.

(3) Pre-transpose marker (#178 "Solution C") → engine tensor-data flag + transformer-core LinearProjection

  • Mark a packed weight as already [out, in] block-major (the kernel's order).
  • LinearProjection (now in transformer-core) reads the marker and skips ops.transpose,
    dispatching straight to chooseQuantizedMatmulHeap. (Avoids even the lazy transpose; the engine
    ops.transpose packed support stays as the fallback.)
  • Effect: packed matmul weights (lm_head, attn projections) never round-trip through transpose;
    whisper's WhisperLinear (identical shape) benefits the same way.
  • Acceptance: packed lm_head matmul with the marker takes the quant kernel with no transpose allocation.

Sequencing

transformer-core lands → (1) (unblocks Gemma token_embd + the function-calling footprint, and whisper
embedding) → (2) (de-dup, enables whisper packed loading) → (3) (perf, no-transpose packed matmul).
(1) alone closes #178's remaining board-fit item.

Payback for whisper

With (1)+(2): a whisper-io Prefer(Q8_0) policy + a RowDequantSource token-embedding → whisper runs int8
on the box (the Phase-1 accuracy-ceiling question), reusing the exact engine kernels + row-dequant Gemma
uses — no whisper-specific quant code.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions