Skip to content

Apertus: NATIVE_OPTIMIZED Q4_K end-to-end inference broken — needs block-major tensor-data wrappers #100

@michalharakal

Description

@michalharakal

Summary

ApertusNetworkLoader.fromGguf(quantPolicy = NATIVE_OPTIMIZED).load() succeeds (after PR #98), but a single forward pass through the loaded model fails because ApertusWeightLoader.streamingTensorToTensor stores quantized weights as raw byte-level rank-1 Int8 tensors. The standard transformer DSL forward path then hits two failures in sequence:

  1. Embedding gather on token_embd — fixed in PR fix(apertus): real-model loading — UInt metadata + quantized shape #98 by force-dequant'ing the token embedding to FP32 in loadStreamingTensor / loadReaderTensor regardless of quantPolicy. (Embedding lookup needs the logical [vocab, dim] shape; byte-shape doesn't work for gather.)

  2. Attention Q/K/V/O and FFN projectionslinearProject(ops, x, W) = ops.matmul(x, ops.transpose(W)) (llm-core/.../transformer/LinearProjection.kt:30) doesn't know about quantized weights. ops.transpose(byteShape Int8) errors out with Transpose requires at least 2 dimensions because the byte tensor is rank 1.

java.lang.IllegalArgumentException: Transpose requires at least 2 dimensions
  at sk.ainet.exec.tensor.ops.DefaultCpuOpsBase.transpose(DefaultCpuOps.kt:455)
  at sk.ainet.lang.nn.transformer.LinearProjectionKt.linearProject(LinearProjection.kt:34)
  at sk.ainet.lang.nn.transformer.MultiHeadAttention.onForward(MultiHeadAttention.kt:185)
  at sk.ainet.apps.llm.HybridTransformerBlock.directForward(HybridTransformerBlock.kt:172)
  ...

This is the same problem Gemma solved. Gemma's loader stores Q4_K weights as Q4_KBlockTensorData(logicalShape = Shape(rows, cols), blockMajorBytes) — a quant-aware TensorData that retains the logical rank-2 shape. transpose(Q4_KTensorData) is overridden to be lazy, and matmul dispatches via JvmQuantizedVectorKernels.matmulQ4_KVec. See GemmaDslQ4KTest, relayoutQ4_KRowMajorToBlockMajor, and GemmaMemSegConverter for the pattern.

Repro (after PR #98)

// SKaiNET-transformers/llm-inference/apertus/src/jvmTest/.../ApertusRealGgufLoadingTest.kt
// Run with -PapertusTestMaxHeap=12g and unsloth/Apertus-8B-Instruct-2509-GGUF Q4_K_S in HF cache.
val ctx = DirectCpuExecutionContext.create()
val model = ApertusNetworkLoader.fromGguf(
    randomAccessProvider = { JvmRandomAccessSource.open(file) },
    quantPolicy = QuantPolicy.NATIVE_OPTIMIZED
).load<FP32, Float>(ctx)

OptimizedLLMRuntime(model, ctx, OptimizedLLMMode.DIRECT, FP32::class).forward(bos)
//                  ^^^ throws IllegalArgumentException at first MHA Q-projection

Why this matters

After cleanup commit 8a7e0ff removed ApertusQuantizedRuntime, the canonical path for running Apertus models is OptimizedLLMRuntime + apertusNetwork(). Combined with this bug, there is currently no working path to actually run an Apertus-8B Q4_K_S model end-to-end on a normal-sized JVM:

  • DEQUANTIZE_TO_FP32 → ~32 GB heap for Apertus-8B (won't fit on a 16 GB box).
  • NATIVE_OPTIMIZED → fails at the first projection in the forward pass (this issue).
  • RAW_BYTES → identical byte-shape problem.
  • loadQuantized() → returns ApertusQuantizedWeights, but the runtime that consumed them was deleted.

PR #98 verifies loading; but inference is blocked.

Proposed fix

Mirror Gemma's path. In ApertusWeightLoader.streamingTensorToTensor / readerTensorToTensor, when quantPolicy == NATIVE_OPTIMIZED and the tensor is a block-quantized type (Q4_K / Q5_K / Q6_K / Q8_0 / IQ4_NL / IQ4_XS / Q2_K / Q3_K / TQ1_0 / TQ2_0), wrap as the appropriate *BlockTensorData from skainet-lang-core with the logical [out, in] shape and pre-relayout to block-major. Each format needs:

  • Row-major → block-major relayout (relayoutQ4_KRowMajorToBlockMajor exists for Q4_K; the other formats need analogous helpers if they don't already)
  • A lazy transpose override on the TensorData (Gemma's Q4_KTensorData has this already)
  • matmul dispatch via the right native / Panama Vector kernel

Apertus-8B-Instruct-2509 Q4_K_S contains 185 Q4_K tensors, 8 Q5_K, 1 Q6_K, and 130 F32. Q4_K and F32 paths exist already in skainet-lang-core; Q5_K / Q6_K need parity work. (Or fall back to dequant for the Q5_K / Q6_K outliers — only 9 tensors total in this quant.)

Scope split

This is a multi-day chunk:

  1. ApertusWeightLoader gains a wrapAsBlockTensorData(tensorType, shape, bytes) switch that produces the right *BlockTensorData per quant type.
  2. Verify Q5_K / Q6_K wrappers exist in skainet-lang-core (or add them, mirroring Q4_K).
  3. The FFN down projection's output dim isn't a multiple of the K-quant block size in some Apertus quants — check whether Gemma's relayout handles padded blocks or panic-routes to dequant.
  4. End-to-end smoke test that the same ApertusRealGgufLoadingTest.fromGguf path now produces finite logits, matching dequantized FP32 within Q4_K tolerance.

Out of scope

  • Tool calling — the chat-template + parser are unit-tested and don't depend on this.
  • Numeric parity with llama.cpp — separate measurement task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions