feat(apertus): block-major TensorData wiring for NATIVE_OPTIMIZED Q4_K E2E#102
Merged
michalharakal merged 1 commit intodevelopfrom May 2, 2026
Merged
feat(apertus): block-major TensorData wiring for NATIVE_OPTIMIZED Q4_K E2E#102michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
…K E2E Mirrors Gemma's Q4_K end-to-end pattern (Issue #100). Three connected fixes that together let the standard transformer DSL forward path consume Apertus quantized weights instead of crashing at the first matmul. 1. Per-tensor quant sidecar on ApertusWeights New `quantTypes` / `logicalShapes` / `quantBytes` fields on ApertusWeights. The loader populates them under QuantPolicy.NATIVE_OPTIMIZED so a JVM-side converter can re-wrap each tensor in the right block-major TensorData. Default empty so non-NATIVE_OPTIMIZED callers are unaffected. 2. New jvmMain ApertusMemSegConverter Mirrors GemmaMemSegConverter: - Q4_K → Q4_KBlockTensorData with relayout from GGUF row-major [row, block] to the input-block-major [block, row] layout that JvmQuantizedVectorKernels.matmulQ4_KVec expects - Q6_K → Q6_KBlockTensorData (same relayout, 210B/block) - Q4_0 / Q8_0 → Q4MemorySegmentTensorData / Q8MemorySegmentTensorData - Q5_K → fallback to FP32 dequant (no packed kernel yet; Apertus-8B Q4_K_S has only 8 Q5_K tensors so the cost is small) Drains the loader's quantBytes map as it goes; drops the side-car maps from the result. Without this the forward pass blew up immediately with "Transpose requires at least 2 dimensions" because NATIVE_OPTIMIZED was storing 5 GB of byte-shape rank-1 Int8 tensors. 3. Loader produces logical [out, in] shapes, not GGUF [in, out] Two long-standing latent bugs that only surfaced once weights actually flowed through to a forward pass: - GGUF stores tensor dims fastest-varying-first ([4096, 131072] for a [vocab=131072, dim=4096] embedding); the rest of this codebase follows PyTorch [out, in] convention. Apertus loader was constructing tensors with the GGUF order. Added `logicalShape()` helper that reverses the dim list, mirroring Gemma's `reversedShape`. Replaced every callsite. - `createTensor` was running `transposeColumnMajorToRowMajor` on every 2D tensor and flipping its shape, on top of the un-reversed input. After the shape fix, that transpose was actively scrambling tensor data. Simplified createTensor to a direct fromFloatArray with the (now-correct) logical shape. Neither bug ever tripped a unit test because Apertus DSL pipeline tests use ApertusNetworkLoader.fromWeights with synthetic FP32 tensors, bypassing this whole loading path. End-to-end status against unsloth/Apertus-8B-Instruct-2509-GGUF (Q4_K_S, 4.7 GB on disk, -PapertusTestMaxHeap=18g): load → convert → fromWeights → OptimizedLLMRuntime.forward(BOS) - Module construction: ✅ 35 top-level modules, ~13 s - Loader populates 32 xIELU layers, 131 FP32 tensors, 193 quantized tensors with correct logical shapes - Forward pass: completes (~30 s) but produces NaN logits The NaN points at xIELU. Stored alpha_p values for early layers are extreme (alpha_p[0]=166, alpha_p[1]=174, …), and ApertusXIELU's softplus(alpha_p) reduces to ~alpha_p for any value > 20, so the activation evaluates to alpha_p_eff * x² ~= 166 * x² and overflows once x grows past a few. Either the unsloth GGUF stores these in a different transformed space than ApertusXIELU expects, or the formula in ApertusXIELU.kt diverges from the Apertus-8B reference implementation. Tracking as follow-up — the wiring change in this commit is what unblocks getting that far in the first place; a matching xIELU formula audit / unsloth-vs-HF-source comparison is its own thread. Verified loader-only smoke tests still pass (peek, tensor presence, loadQuantized, fromGguf module build). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Towards #100. Three connected fixes that mirror Gemma's Q4_K end-to-end path so the standard transformer DSL forward can consume Apertus quantized weights instead of crashing at the first matmul.
What's in
Per-tensor quant sidecar on
ApertusWeights— newquantTypes/logicalShapes/quantBytesfields. Default empty; non-NATIVE_OPTIMIZEDcallers see no change.New jvmMain
ApertusMemSegConvertermirroringGemmaMemSegConverter:Q4_K→Q4_KBlockTensorDatawith row-major → block-major relayout (144 B/block)Q6_K→Q6_KBlockTensorData(210 B/block)Q4_0/Q8_0→Q4MemorySegmentTensorData/Q8MemorySegmentTensorDataQ5_K→ fallback to FP32 dequant (no packed kernel yet; Apertus-8B Q4_K_S has only 8 of these)Drains the loader's
quantBytesas it goes and drops the sidecar maps from the result so we don't carry 5 GB of one-shot byte data through the runtime.Loader produces logical
[out, in]shapes, not GGUF[in, out]— fixes two long-standing latent bugs:[out, in]. AddedlogicalShape()helper that reverses (mirrors Gemma'sreversedShape); replaced every loader callsite.createTensorwas runningtransposeColumnMajorToRowMajoron every 2D tensor and flipping its shape on top of the un-reversed input. After the shape fix, that transpose was actively scrambling data. Simplified to a directfromFloatArraywith the logical shape.Neither tripped a unit test because Apertus DSL pipeline tests use synthetic FP32 weights via
fromWeights, bypassing this whole loading path.End-to-end status against
unsloth/Apertus-8B-Instruct-2509-GGUFQ4_K_Speek()/ tensor presence /loadQuantized()ApertusNetworkLoader.fromGguf().load()(module build)[1, dim]per-token vectorforward(BOS)runs to completionWhy merge as-is
The activation (#103) is an independent thread — formula audit against the HF Apertus reference, not loader work. Everything in this PR is necessary and correct for the loader/converter side; punting the merge until #103 is solved would block any other Apertus loader work that depends on the corrected shapes and the converter API.
When #103 lands, the existing
forward pass on real Apertus produces finite logitsandgreedy generatesmoke tests inApertusRealGgufLoadingTestshould pass without additional loader changes.Test plan
peek detects apertus architecture and reads metadata fields— ✅streaming reader exposes every tensor required by the apertus loader— ✅loadQuantized fully populates ApertusQuantizedWeights from real GGUF— ✅ApertusNetworkLoader fromGguf builds module from real Q4_K_S GGUF— ✅ (12 GB heap)forward pass on real Apertus produces finite logits of vocab size— ❌ NaN, blocked on Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103greedy generate on real Apertus produces in-vocab token sequence— ❌ blocked on Apertus xIELU formula / param-space mismatch — forward pass NaN's on real Apertus-8B Q4_K_S #103🤖 Generated with Claude Code