Apertus: NATIVE_OPTIMIZED Q4_K end-to-end inference broken — needs block-major tensor-data wrappers

## Summary

`ApertusNetworkLoader.fromGguf(quantPolicy = NATIVE_OPTIMIZED).load()` succeeds (after PR #98), but a single forward pass through the loaded model fails because `ApertusWeightLoader.streamingTensorToTensor` stores quantized weights as raw byte-level rank-1 `Int8` tensors. The standard transformer DSL forward path then hits two failures in sequence:

1. **Embedding gather** on `token_embd` — fixed in PR #98 by force-dequant'ing the token embedding to FP32 in `loadStreamingTensor` / `loadReaderTensor` regardless of `quantPolicy`. (Embedding lookup needs the logical `[vocab, dim]` shape; byte-shape doesn't work for `gather`.)

2. **Attention Q/K/V/O and FFN projections** — `linearProject(ops, x, W) = ops.matmul(x, ops.transpose(W))` (`llm-core/.../transformer/LinearProjection.kt:30`) doesn't know about quantized weights. `ops.transpose(byteShape Int8)` errors out with `Transpose requires at least 2 dimensions` because the byte tensor is rank 1.

```
java.lang.IllegalArgumentException: Transpose requires at least 2 dimensions
  at sk.ainet.exec.tensor.ops.DefaultCpuOpsBase.transpose(DefaultCpuOps.kt:455)
  at sk.ainet.lang.nn.transformer.LinearProjectionKt.linearProject(LinearProjection.kt:34)
  at sk.ainet.lang.nn.transformer.MultiHeadAttention.onForward(MultiHeadAttention.kt:185)
  at sk.ainet.apps.llm.HybridTransformerBlock.directForward(HybridTransformerBlock.kt:172)
  ...
```

This is the **same problem Gemma solved**. Gemma's loader stores Q4_K weights as `Q4_KBlockTensorData(logicalShape = Shape(rows, cols), blockMajorBytes)` — a quant-aware TensorData that retains the logical rank-2 shape. `transpose(Q4_KTensorData)` is overridden to be lazy, and `matmul` dispatches via `JvmQuantizedVectorKernels.matmulQ4_KVec`. See `GemmaDslQ4KTest`, `relayoutQ4_KRowMajorToBlockMajor`, and `GemmaMemSegConverter` for the pattern.

## Repro (after PR #98)

```kotlin
// SKaiNET-transformers/llm-inference/apertus/src/jvmTest/.../ApertusRealGgufLoadingTest.kt
// Run with -PapertusTestMaxHeap=12g and unsloth/Apertus-8B-Instruct-2509-GGUF Q4_K_S in HF cache.
val ctx = DirectCpuExecutionContext.create()
val model = ApertusNetworkLoader.fromGguf(
    randomAccessProvider = { JvmRandomAccessSource.open(file) },
    quantPolicy = QuantPolicy.NATIVE_OPTIMIZED
).load<FP32, Float>(ctx)

OptimizedLLMRuntime(model, ctx, OptimizedLLMMode.DIRECT, FP32::class).forward(bos)
//                  ^^^ throws IllegalArgumentException at first MHA Q-projection
```

## Why this matters

After cleanup commit `8a7e0ff` removed `ApertusQuantizedRuntime`, the canonical path for running Apertus models is `OptimizedLLMRuntime + apertusNetwork()`. Combined with this bug, **there is currently no working path to actually run an Apertus-8B Q4_K_S model end-to-end on a normal-sized JVM**:

- `DEQUANTIZE_TO_FP32` → ~32 GB heap for Apertus-8B (won't fit on a 16 GB box).
- `NATIVE_OPTIMIZED` → fails at the first projection in the forward pass (this issue).
- `RAW_BYTES` → identical byte-shape problem.
- `loadQuantized()` → returns `ApertusQuantizedWeights`, but the runtime that consumed them was deleted.

PR #98 verifies loading; but inference is blocked.

## Proposed fix

Mirror Gemma's path. In `ApertusWeightLoader.streamingTensorToTensor` / `readerTensorToTensor`, when `quantPolicy == NATIVE_OPTIMIZED` and the tensor is a block-quantized type (Q4_K / Q5_K / Q6_K / Q8_0 / IQ4_NL / IQ4_XS / Q2_K / Q3_K / TQ1_0 / TQ2_0), wrap as the appropriate `*BlockTensorData` from `skainet-lang-core` with the logical `[out, in]` shape and pre-relayout to block-major. Each format needs:

- Row-major → block-major relayout (`relayoutQ4_KRowMajorToBlockMajor` exists for Q4_K; the other formats need analogous helpers if they don't already)
- A lazy `transpose` override on the TensorData (Gemma's `Q4_KTensorData` has this already)
- `matmul` dispatch via the right native / Panama Vector kernel

Apertus-8B-Instruct-2509 Q4_K_S contains 185 Q4_K tensors, 8 Q5_K, 1 Q6_K, and 130 F32. Q4_K and F32 paths exist already in skainet-lang-core; Q5_K / Q6_K need parity work. (Or fall back to dequant for the Q5_K / Q6_K outliers — only 9 tensors total in this quant.)

## Scope split

This is a multi-day chunk:

1. ApertusWeightLoader gains a `wrapAsBlockTensorData(tensorType, shape, bytes)` switch that produces the right `*BlockTensorData` per quant type.
2. Verify Q5_K / Q6_K wrappers exist in skainet-lang-core (or add them, mirroring Q4_K).
3. The FFN `down` projection's output dim isn't a multiple of the K-quant block size in some Apertus quants — check whether Gemma's relayout handles padded blocks or panic-routes to dequant.
4. End-to-end smoke test that the same `ApertusRealGgufLoadingTest.fromGguf` path now produces finite logits, matching dequantized FP32 within Q4_K tolerance.

## Out of scope

- Tool calling — the chat-template + parser are unit-tested and don't depend on this.
- Numeric parity with llama.cpp — separate measurement task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apertus: NATIVE_OPTIMIZED Q4_K end-to-end inference broken — needs block-major tensor-data wrappers #100

Summary

Repro (after PR #98)

Why this matters

Proposed fix

Scope split

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Apertus: NATIVE_OPTIMIZED Q4_K end-to-end inference broken — needs block-major tensor-data wrappers #100

Description

Summary

Repro (after PR #98)

Why this matters

Proposed fix

Scope split

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions