feat(apertus): block-major TensorData wiring for NATIVE_OPTIMIZED Q4_K E2E by michalharakal · Pull Request #102 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-05-02T19:16:47Z

Summary

Towards #100. Three connected fixes that mirror Gemma's Q4_K end-to-end path so the standard transformer DSL forward can consume Apertus quantized weights instead of crashing at the first matmul.

What's in

Per-tensor quant sidecar on ApertusWeights — new quantTypes / logicalShapes / quantBytes fields. Default empty; non-NATIVE_OPTIMIZED callers see no change.
New jvmMain ApertusMemSegConverter mirroring GemmaMemSegConverter:
- Q4_K → Q4_KBlockTensorData with row-major → block-major relayout (144 B/block)
- Q6_K → Q6_KBlockTensorData (210 B/block)
- Q4_0 / Q8_0 → Q4MemorySegmentTensorData / Q8MemorySegmentTensorData
- Q5_K → fallback to FP32 dequant (no packed kernel yet; Apertus-8B Q4_K_S has only 8 of these)
  Drains the loader's quantBytes as it goes and drops the sidecar maps from the result so we don't carry 5 GB of one-shot byte data through the runtime.
Loader produces logical [out, in] shapes, not GGUF [in, out] — fixes two long-standing latent bugs:
- GGUF lists dims fastest-varying-first; the rest of this codebase follows PyTorch [out, in]. Added logicalShape() helper that reverses (mirrors Gemma's reversedShape); replaced every loader callsite.
- createTensor was running transposeColumnMajorToRowMajor on every 2D tensor and flipping its shape on top of the un-reversed input. After the shape fix, that transpose was actively scrambling data. Simplified to a direct fromFloatArray with the logical shape.
  Neither tripped a unit test because Apertus DSL pipeline tests use synthetic FP32 weights via fromWeights, bypassing this whole loading path.

End-to-end status against `unsloth/Apertus-8B-Instruct-2509-GGUF` Q4_K_S

Stage	Result
`peek()` / tensor presence / `loadQuantized()`	✅
`ApertusNetworkLoader.fromGguf().load()` (module build)	✅ 12 GB heap
Embedding lookup (token_embd dequant + gather)	✅ produces correct `[1, dim]` per-token vector
Q4_K transpose + matmul through attention projections	✅ no longer throws
Full `forward(BOS)` runs to completion	✅ ~30 s, 18 GB heap
Logits are finite	❌ all-NaN — xIELU formula / param-space mismatch, tracked in #103

Why merge as-is

The activation (#103) is an independent thread — formula audit against the HF Apertus reference, not loader work. Everything in this PR is necessary and correct for the loader/converter side; punting the merge until #103 is solved would block any other Apertus loader work that depends on the corrected shapes and the converter API.

When #103 lands, the existing forward pass on real Apertus produces finite logits and greedy generate smoke tests in ApertusRealGgufLoadingTest should pass without additional loader changes.

Test plan

peek detects apertus architecture and reads metadata fields — ✅
streaming reader exposes every tensor required by the apertus loader — ✅
loadQuantized fully populates ApertusQuantizedWeights from real GGUF — ✅
ApertusNetworkLoader fromGguf builds module from real Q4_K_S GGUF — ✅ (12 GB heap)
forward pass on real Apertus produces finite logits of vocab size — ❌ NaN, blocked on Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103
greedy generate on real Apertus produces in-vocab token sequence — ❌ blocked on Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103
xIELU formula audit vs HF Apertus reference — Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103

🤖 Generated with Claude Code

…K E2E Mirrors Gemma's Q4_K end-to-end pattern (Issue #100). Three connected fixes that together let the standard transformer DSL forward path consume Apertus quantized weights instead of crashing at the first matmul. 1. Per-tensor quant sidecar on ApertusWeights New `quantTypes` / `logicalShapes` / `quantBytes` fields on ApertusWeights. The loader populates them under QuantPolicy.NATIVE_OPTIMIZED so a JVM-side converter can re-wrap each tensor in the right block-major TensorData. Default empty so non-NATIVE_OPTIMIZED callers are unaffected. 2. New jvmMain ApertusMemSegConverter Mirrors GemmaMemSegConverter: - Q4_K → Q4_KBlockTensorData with relayout from GGUF row-major [row, block] to the input-block-major [block, row] layout that JvmQuantizedVectorKernels.matmulQ4_KVec expects - Q6_K → Q6_KBlockTensorData (same relayout, 210B/block) - Q4_0 / Q8_0 → Q4MemorySegmentTensorData / Q8MemorySegmentTensorData - Q5_K → fallback to FP32 dequant (no packed kernel yet; Apertus-8B Q4_K_S has only 8 Q5_K tensors so the cost is small) Drains the loader's quantBytes map as it goes; drops the side-car maps from the result. Without this the forward pass blew up immediately with "Transpose requires at least 2 dimensions" because NATIVE_OPTIMIZED was storing 5 GB of byte-shape rank-1 Int8 tensors. 3. Loader produces logical [out, in] shapes, not GGUF [in, out] Two long-standing latent bugs that only surfaced once weights actually flowed through to a forward pass: - GGUF stores tensor dims fastest-varying-first ([4096, 131072] for a [vocab=131072, dim=4096] embedding); the rest of this codebase follows PyTorch [out, in] convention. Apertus loader was constructing tensors with the GGUF order. Added `logicalShape()` helper that reverses the dim list, mirroring Gemma's `reversedShape`. Replaced every callsite. - `createTensor` was running `transposeColumnMajorToRowMajor` on every 2D tensor and flipping its shape, on top of the un-reversed input. After the shape fix, that transpose was actively scrambling tensor data. Simplified createTensor to a direct fromFloatArray with the (now-correct) logical shape. Neither bug ever tripped a unit test because Apertus DSL pipeline tests use ApertusNetworkLoader.fromWeights with synthetic FP32 tensors, bypassing this whole loading path. End-to-end status against unsloth/Apertus-8B-Instruct-2509-GGUF (Q4_K_S, 4.7 GB on disk, -PapertusTestMaxHeap=18g): load → convert → fromWeights → OptimizedLLMRuntime.forward(BOS) - Module construction: ✅ 35 top-level modules, ~13 s - Loader populates 32 xIELU layers, 131 FP32 tensors, 193 quantized tensors with correct logical shapes - Forward pass: completes (~30 s) but produces NaN logits The NaN points at xIELU. Stored alpha_p values for early layers are extreme (alpha_p[0]=166, alpha_p[1]=174, …), and ApertusXIELU's softplus(alpha_p) reduces to ~alpha_p for any value > 20, so the activation evaluates to alpha_p_eff * x² ~= 166 * x² and overflows once x grows past a few. Either the unsloth GGUF stores these in a different transformed space than ApertusXIELU expects, or the formula in ApertusXIELU.kt diverges from the Apertus-8B reference implementation. Tracking as follow-up — the wiring change in this commit is what unblocks getting that far in the first place; a matching xIELU formula audit / unsloth-vs-HF-source comparison is its own thread. Verified loader-only smoke tests still pass (peek, tensor presence, loadQuantized, fromGguf module build). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal mentioned this pull request May 2, 2026

Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103

Open

4 tasks

michalharakal merged commit 52ff395 into develop May 2, 2026
2 checks passed

michalharakal deleted the feature/apertus-q4k-e2e branch May 2, 2026 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apertus): block-major TensorData wiring for NATIVE_OPTIMIZED Q4_K E2E#102

feat(apertus): block-major TensorData wiring for NATIVE_OPTIMIZED Q4_K E2E#102
michalharakal merged 1 commit intodevelopfrom
feature/apertus-q4k-e2e

michalharakal commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in

End-to-end status against unsloth/Apertus-8B-Instruct-2509-GGUF Q4_K_S

Why merge as-is

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michalharakal commented May 2, 2026 •

edited

Loading

End-to-end status against `unsloth/Apertus-8B-Instruct-2509-GGUF` Q4_K_S