Hoist quant packing + RowDequantSource out of sk.ainet.models.gemma into shared layers

# [Spec] Hoist quant packing + row-dequant out of `sk.ainet.models.gemma` into shared layers

**Repos:** SKaiNET (engine) + SKaiNET-transformers · **Type:** refactor / generalization ·
**Unblocks:** Gemma function-calling on the SL2610 (the remaining half of #178) **and** whisper int8 on the
box. **Builds on:** `transformer-core` extraction (must land first).

## Why
Quantization generalizes along a layered stack:
- **Engine** (`skainet-lang-core`) — already shared, all targets: packed dtypes (`Q8_0TensorData`,
  `Q4_0/Q5_0/Q5_1/Q4_K/Q5_K/Q6_K`), `QuantizedMatmul` kernels, `ops.transpose` for all packed dtypes
  (#178 #736/#737, merged).
- **Primitives** (`transformer-core`) — dtype-agnostic (`LinearProjection = ops.matmul(x, ops.transpose(w))`).
- **Model** (`llm-inference/*`) — here the duplication lives: gemma `GemmaPackedWeights`/`GemmaQuantLayout`
  (`packGemmaKQuant`) + `RowDequantSource`, llama `QuantizedTensorFactory` — each re-implements
  GGUF-block → engine-`BlockTensorData` packing and (gemma only) row-dequant.

Result today: only Gemma keeps the *embedding* Q8_0/Q6_K (via its model-local `RowDequantSource`), and only
because its **own** `PerLayerEmbedding.compute` checks the marker — the engine `ops.gather` does **not** know
`RowDequantSource`, so a normal gather on a packed tensor materialises the full FP32 table (the ~0.67 GB
`token_embd` that still OOMs, #178's remaining item). Whisper (tied 51864-vocab head + embedding) would hit
the identical wall.

## The three hoists (each independent; do (1) first)

### (1) `RowDequantSource` → engine (`skainet-lang-core`), wired into `ops.gather`  ⟵ highest leverage
- **Move** the `RowDequantSource` interface (`fun dequantRow(rowIdx: Int): FloatArray`) to
  `sk.ainet.lang.tensor.data` (next to `TensorData`).
- **Teach `ops.gather`** (and any embedding-lookup op): `if (input.data is RowDequantSource) →` gather rows
  via `dequantRow(idx)` instead of `copyToFloatArray()`. This is the generic version of gemma's model-level
  trick.
- **Effect:** any packed embedding (gemma `token_embd`, whisper tied embedding, llama) stays packed; the
  gather dequants only the touched rows. Closes #178's last ~0.67 GB.
- **Acceptance:** gemma `token_embd` stays Q-packed end-to-end (footprint drop, byte-identical decode via
  `GemmaQ5KPackedParityTest`); a new engine test gathers from a `RowDequantSource` tensor without full
  materialise.

### (2) Generic block-relayout packing → shared `quant` util (transformers, depends on `lang-core`)
- **Extract** the common GGUF-block → `*BlockTensorData` relayout (the `blockSize`-parameterised packer
  #178 generalized) from gemma `GemmaQuantLayout.packGemmaKQuant` + llama `QuantizedTensorFactory` into one
  shared packer keyed by ggml quant type → engine `BlockTensorData`.
- **Per-model keeps** only weight *selection* + naming (which tensors to pack, ggml-name map).
- **Effect:** de-dups gemma+llama; whisper-io gets the same packer for a `Prefer(Q8_0)` `DTypePolicy`.
- **Acceptance:** gemma + llama pack via the shared util (their tests stay green); the util has its own
  round-trip test per dtype.

### (3) Pre-transpose marker (#178 "Solution C") → engine tensor-data flag + `transformer-core` `LinearProjection`
- **Mark** a packed weight as already `[out, in]` block-major (the kernel's order).
- **`LinearProjection`** (now in `transformer-core`) reads the marker and **skips `ops.transpose`**,
  dispatching straight to `chooseQuantizedMatmulHeap`. (Avoids even the lazy transpose; the engine
  `ops.transpose` packed support stays as the fallback.)
- **Effect:** packed matmul weights (lm_head, attn projections) never round-trip through transpose;
  whisper's `WhisperLinear` (identical shape) benefits the same way.
- **Acceptance:** packed lm_head matmul with the marker takes the quant kernel with no transpose allocation.

## Sequencing
`transformer-core` lands → **(1)** (unblocks Gemma `token_embd` + the function-calling footprint, and whisper
embedding) → **(2)** (de-dup, enables whisper packed loading) → **(3)** (perf, no-transpose packed matmul).
(1) alone closes #178's remaining board-fit item.

## Payback for whisper
With (1)+(2): a whisper-io `Prefer(Q8_0)` policy + a `RowDequantSource` token-embedding → whisper runs int8
on the box (the Phase-1 accuracy-ceiling question), reusing the *exact* engine kernels + row-dequant Gemma
uses — no whisper-specific quant code.

## Notes
- Keep the eager FP32 path as the always-correct oracle (parity tests gate every step).
- (1) is an **engine** API addition (`RowDequantSource` becomes public engine API) — coordinate the release
  pin like #178 did (engine first, then bump the transformers `skainet` pin).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoist quant packing + RowDequantSource out of sk.ainet.models.gemma into shared layers #184

[Spec] Hoist quant packing + row-dequant out of `sk.ainet.models.gemma` into shared layers

Why

The three hoists (each independent; do (1) first)

(1) `RowDequantSource` → engine (`skainet-lang-core`), wired into `ops.gather` ⟵ highest leverage

(2) Generic block-relayout packing → shared `quant` util (transformers, depends on `lang-core`)

(3) Pre-transpose marker (#178 "Solution C") → engine tensor-data flag + `transformer-core` `LinearProjection`

Sequencing

Payback for whisper

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hoist quant packing + RowDequantSource out of sk.ainet.models.gemma into shared layers #184

Description

[Spec] Hoist quant packing + row-dequant out of sk.ainet.models.gemma into shared layers

Why

The three hoists (each independent; do (1) first)

(1) RowDequantSource → engine (skainet-lang-core), wired into ops.gather ⟵ highest leverage

(2) Generic block-relayout packing → shared quant util (transformers, depends on lang-core)

(3) Pre-transpose marker (#178 "Solution C") → engine tensor-data flag + transformer-core LinearProjection

Sequencing

Payback for whisper

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Spec] Hoist quant packing + row-dequant out of `sk.ainet.models.gemma` into shared layers

(1) `RowDequantSource` → engine (`skainet-lang-core`), wired into `ops.gather` ⟵ highest leverage

(2) Generic block-relayout packing → shared `quant` util (transformers, depends on `lang-core`)

(3) Pre-transpose marker (#178 "Solution C") → engine tensor-data flag + `transformer-core` `LinearProjection`