You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ApertusNetworkLoader.fromGguf(quantPolicy = NATIVE_OPTIMIZED).load() succeeds (after PR #98), but a single forward pass through the loaded model fails because ApertusWeightLoader.streamingTensorToTensor stores quantized weights as raw byte-level rank-1 Int8 tensors. The standard transformer DSL forward path then hits two failures in sequence:
Embedding gather on token_embd — fixed in PR fix(apertus): real-model loading — UInt metadata + quantized shape #98 by force-dequant'ing the token embedding to FP32 in loadStreamingTensor / loadReaderTensor regardless of quantPolicy. (Embedding lookup needs the logical [vocab, dim] shape; byte-shape doesn't work for gather.)
Attention Q/K/V/O and FFN projections — linearProject(ops, x, W) = ops.matmul(x, ops.transpose(W)) (llm-core/.../transformer/LinearProjection.kt:30) doesn't know about quantized weights. ops.transpose(byteShape Int8) errors out with Transpose requires at least 2 dimensions because the byte tensor is rank 1.
java.lang.IllegalArgumentException: Transpose requires at least 2 dimensions
at sk.ainet.exec.tensor.ops.DefaultCpuOpsBase.transpose(DefaultCpuOps.kt:455)
at sk.ainet.lang.nn.transformer.LinearProjectionKt.linearProject(LinearProjection.kt:34)
at sk.ainet.lang.nn.transformer.MultiHeadAttention.onForward(MultiHeadAttention.kt:185)
at sk.ainet.apps.llm.HybridTransformerBlock.directForward(HybridTransformerBlock.kt:172)
...
This is the same problem Gemma solved. Gemma's loader stores Q4_K weights as Q4_KBlockTensorData(logicalShape = Shape(rows, cols), blockMajorBytes) — a quant-aware TensorData that retains the logical rank-2 shape. transpose(Q4_KTensorData) is overridden to be lazy, and matmul dispatches via JvmQuantizedVectorKernels.matmulQ4_KVec. See GemmaDslQ4KTest, relayoutQ4_KRowMajorToBlockMajor, and GemmaMemSegConverter for the pattern.
// SKaiNET-transformers/llm-inference/apertus/src/jvmTest/.../ApertusRealGgufLoadingTest.kt// Run with -PapertusTestMaxHeap=12g and unsloth/Apertus-8B-Instruct-2509-GGUF Q4_K_S in HF cache.val ctx =DirectCpuExecutionContext.create()
val model =ApertusNetworkLoader.fromGguf(
randomAccessProvider = { JvmRandomAccessSource.open(file) },
quantPolicy =QuantPolicy.NATIVE_OPTIMIZED
).load<FP32, Float>(ctx)
OptimizedLLMRuntime(model, ctx, OptimizedLLMMode.DIRECT, FP32::class).forward(bos)
// ^^^ throws IllegalArgumentException at first MHA Q-projection
Why this matters
After cleanup commit 8a7e0ff removed ApertusQuantizedRuntime, the canonical path for running Apertus models is OptimizedLLMRuntime + apertusNetwork(). Combined with this bug, there is currently no working path to actually run an Apertus-8B Q4_K_S model end-to-end on a normal-sized JVM:
DEQUANTIZE_TO_FP32 → ~32 GB heap for Apertus-8B (won't fit on a 16 GB box).
NATIVE_OPTIMIZED → fails at the first projection in the forward pass (this issue).
RAW_BYTES → identical byte-shape problem.
loadQuantized() → returns ApertusQuantizedWeights, but the runtime that consumed them was deleted.
PR #98 verifies loading; but inference is blocked.
Proposed fix
Mirror Gemma's path. In ApertusWeightLoader.streamingTensorToTensor / readerTensorToTensor, when quantPolicy == NATIVE_OPTIMIZED and the tensor is a block-quantized type (Q4_K / Q5_K / Q6_K / Q8_0 / IQ4_NL / IQ4_XS / Q2_K / Q3_K / TQ1_0 / TQ2_0), wrap as the appropriate *BlockTensorData from skainet-lang-core with the logical [out, in] shape and pre-relayout to block-major. Each format needs:
Row-major → block-major relayout (relayoutQ4_KRowMajorToBlockMajor exists for Q4_K; the other formats need analogous helpers if they don't already)
A lazy transpose override on the TensorData (Gemma's Q4_KTensorData has this already)
matmul dispatch via the right native / Panama Vector kernel
Apertus-8B-Instruct-2509 Q4_K_S contains 185 Q4_K tensors, 8 Q5_K, 1 Q6_K, and 130 F32. Q4_K and F32 paths exist already in skainet-lang-core; Q5_K / Q6_K need parity work. (Or fall back to dequant for the Q5_K / Q6_K outliers — only 9 tensors total in this quant.)
Scope split
This is a multi-day chunk:
ApertusWeightLoader gains a wrapAsBlockTensorData(tensorType, shape, bytes) switch that produces the right *BlockTensorData per quant type.
The FFN down projection's output dim isn't a multiple of the K-quant block size in some Apertus quants — check whether Gemma's relayout handles padded blocks or panic-routes to dequant.
End-to-end smoke test that the same ApertusRealGgufLoadingTest.fromGguf path now produces finite logits, matching dequantized FP32 within Q4_K tolerance.
Out of scope
Tool calling — the chat-template + parser are unit-tested and don't depend on this.
Numeric parity with llama.cpp — separate measurement task.
Summary
ApertusNetworkLoader.fromGguf(quantPolicy = NATIVE_OPTIMIZED).load()succeeds (after PR #98), but a single forward pass through the loaded model fails becauseApertusWeightLoader.streamingTensorToTensorstores quantized weights as raw byte-level rank-1Int8tensors. The standard transformer DSL forward path then hits two failures in sequence:Embedding gather on
token_embd— fixed in PR fix(apertus): real-model loading — UInt metadata + quantized shape #98 by force-dequant'ing the token embedding to FP32 inloadStreamingTensor/loadReaderTensorregardless ofquantPolicy. (Embedding lookup needs the logical[vocab, dim]shape; byte-shape doesn't work forgather.)Attention Q/K/V/O and FFN projections —
linearProject(ops, x, W) = ops.matmul(x, ops.transpose(W))(llm-core/.../transformer/LinearProjection.kt:30) doesn't know about quantized weights.ops.transpose(byteShape Int8)errors out withTranspose requires at least 2 dimensionsbecause the byte tensor is rank 1.This is the same problem Gemma solved. Gemma's loader stores Q4_K weights as
Q4_KBlockTensorData(logicalShape = Shape(rows, cols), blockMajorBytes)— a quant-aware TensorData that retains the logical rank-2 shape.transpose(Q4_KTensorData)is overridden to be lazy, andmatmuldispatches viaJvmQuantizedVectorKernels.matmulQ4_KVec. SeeGemmaDslQ4KTest,relayoutQ4_KRowMajorToBlockMajor, andGemmaMemSegConverterfor the pattern.Repro (after PR #98)
Why this matters
After cleanup commit
8a7e0ffremovedApertusQuantizedRuntime, the canonical path for running Apertus models isOptimizedLLMRuntime + apertusNetwork(). Combined with this bug, there is currently no working path to actually run an Apertus-8B Q4_K_S model end-to-end on a normal-sized JVM:DEQUANTIZE_TO_FP32→ ~32 GB heap for Apertus-8B (won't fit on a 16 GB box).NATIVE_OPTIMIZED→ fails at the first projection in the forward pass (this issue).RAW_BYTES→ identical byte-shape problem.loadQuantized()→ returnsApertusQuantizedWeights, but the runtime that consumed them was deleted.PR #98 verifies loading; but inference is blocked.
Proposed fix
Mirror Gemma's path. In
ApertusWeightLoader.streamingTensorToTensor/readerTensorToTensor, whenquantPolicy == NATIVE_OPTIMIZEDand the tensor is a block-quantized type (Q4_K / Q5_K / Q6_K / Q8_0 / IQ4_NL / IQ4_XS / Q2_K / Q3_K / TQ1_0 / TQ2_0), wrap as the appropriate*BlockTensorDatafromskainet-lang-corewith the logical[out, in]shape and pre-relayout to block-major. Each format needs:relayoutQ4_KRowMajorToBlockMajorexists for Q4_K; the other formats need analogous helpers if they don't already)transposeoverride on the TensorData (Gemma'sQ4_KTensorDatahas this already)matmuldispatch via the right native / Panama Vector kernelApertus-8B-Instruct-2509 Q4_K_S contains 185 Q4_K tensors, 8 Q5_K, 1 Q6_K, and 130 F32. Q4_K and F32 paths exist already in skainet-lang-core; Q5_K / Q6_K need parity work. (Or fall back to dequant for the Q5_K / Q6_K outliers — only 9 tensors total in this quant.)
Scope split
This is a multi-day chunk:
wrapAsBlockTensorData(tensorType, shape, bytes)switch that produces the right*BlockTensorDataper quant type.downprojection's output dim isn't a multiple of the K-quant block size in some Apertus quants — check whether Gemma's relayout handles padded blocks or panic-routes to dequant.ApertusRealGgufLoadingTest.fromGgufpath now produces finite logits, matching dequantized FP32 within Q4_K tolerance.Out of scope