Skip to content

feat(apertus): block-major TensorData wiring for NATIVE_OPTIMIZED Q4_K E2E#102

Merged
michalharakal merged 1 commit intodevelopfrom
feature/apertus-q4k-e2e
May 2, 2026
Merged

feat(apertus): block-major TensorData wiring for NATIVE_OPTIMIZED Q4_K E2E#102
michalharakal merged 1 commit intodevelopfrom
feature/apertus-q4k-e2e

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

@michalharakal michalharakal commented May 2, 2026

Summary

Towards #100. Three connected fixes that mirror Gemma's Q4_K end-to-end path so the standard transformer DSL forward can consume Apertus quantized weights instead of crashing at the first matmul.

What's in

  1. Per-tensor quant sidecar on ApertusWeights — new quantTypes / logicalShapes / quantBytes fields. Default empty; non-NATIVE_OPTIMIZED callers see no change.

  2. New jvmMain ApertusMemSegConverter mirroring GemmaMemSegConverter:

    • Q4_KQ4_KBlockTensorData with row-major → block-major relayout (144 B/block)
    • Q6_KQ6_KBlockTensorData (210 B/block)
    • Q4_0 / Q8_0Q4MemorySegmentTensorData / Q8MemorySegmentTensorData
    • Q5_K → fallback to FP32 dequant (no packed kernel yet; Apertus-8B Q4_K_S has only 8 of these)
      Drains the loader's quantBytes as it goes and drops the sidecar maps from the result so we don't carry 5 GB of one-shot byte data through the runtime.
  3. Loader produces logical [out, in] shapes, not GGUF [in, out] — fixes two long-standing latent bugs:

    • GGUF lists dims fastest-varying-first; the rest of this codebase follows PyTorch [out, in]. Added logicalShape() helper that reverses (mirrors Gemma's reversedShape); replaced every loader callsite.
    • createTensor was running transposeColumnMajorToRowMajor on every 2D tensor and flipping its shape on top of the un-reversed input. After the shape fix, that transpose was actively scrambling data. Simplified to a direct fromFloatArray with the logical shape.
      Neither tripped a unit test because Apertus DSL pipeline tests use synthetic FP32 weights via fromWeights, bypassing this whole loading path.

End-to-end status against unsloth/Apertus-8B-Instruct-2509-GGUF Q4_K_S

Stage Result
peek() / tensor presence / loadQuantized()
ApertusNetworkLoader.fromGguf().load() (module build) ✅ 12 GB heap
Embedding lookup (token_embd dequant + gather) ✅ produces correct [1, dim] per-token vector
Q4_K transpose + matmul through attention projections ✅ no longer throws
Full forward(BOS) runs to completion ✅ ~30 s, 18 GB heap
Logits are finite ❌ all-NaN — xIELU formula / param-space mismatch, tracked in #103

Why merge as-is

The activation (#103) is an independent thread — formula audit against the HF Apertus reference, not loader work. Everything in this PR is necessary and correct for the loader/converter side; punting the merge until #103 is solved would block any other Apertus loader work that depends on the corrected shapes and the converter API.

When #103 lands, the existing forward pass on real Apertus produces finite logits and greedy generate smoke tests in ApertusRealGgufLoadingTest should pass without additional loader changes.

Test plan

🤖 Generated with Claude Code

…K E2E

Mirrors Gemma's Q4_K end-to-end pattern (Issue #100). Three connected
fixes that together let the standard transformer DSL forward path
consume Apertus quantized weights instead of crashing at the first
matmul.

1. Per-tensor quant sidecar on ApertusWeights
   New `quantTypes` / `logicalShapes` / `quantBytes` fields on
   ApertusWeights. The loader populates them under
   QuantPolicy.NATIVE_OPTIMIZED so a JVM-side converter can re-wrap
   each tensor in the right block-major TensorData. Default empty so
   non-NATIVE_OPTIMIZED callers are unaffected.

2. New jvmMain ApertusMemSegConverter
   Mirrors GemmaMemSegConverter:
   - Q4_K → Q4_KBlockTensorData with relayout from GGUF row-major
     [row, block] to the input-block-major [block, row] layout that
     JvmQuantizedVectorKernels.matmulQ4_KVec expects
   - Q6_K → Q6_KBlockTensorData (same relayout, 210B/block)
   - Q4_0 / Q8_0 → Q4MemorySegmentTensorData / Q8MemorySegmentTensorData
   - Q5_K → fallback to FP32 dequant (no packed kernel yet; Apertus-8B
     Q4_K_S has only 8 Q5_K tensors so the cost is small)
   Drains the loader's quantBytes map as it goes; drops the side-car
   maps from the result. Without this the forward pass blew up
   immediately with "Transpose requires at least 2 dimensions" because
   NATIVE_OPTIMIZED was storing 5 GB of byte-shape rank-1 Int8 tensors.

3. Loader produces logical [out, in] shapes, not GGUF [in, out]
   Two long-standing latent bugs that only surfaced once weights
   actually flowed through to a forward pass:

   - GGUF stores tensor dims fastest-varying-first ([4096, 131072] for
     a [vocab=131072, dim=4096] embedding); the rest of this codebase
     follows PyTorch [out, in] convention. Apertus loader was
     constructing tensors with the GGUF order. Added `logicalShape()`
     helper that reverses the dim list, mirroring Gemma's
     `reversedShape`. Replaced every callsite.

   - `createTensor` was running `transposeColumnMajorToRowMajor` on
     every 2D tensor and flipping its shape, on top of the
     un-reversed input. After the shape fix, that transpose was
     actively scrambling tensor data. Simplified createTensor to a
     direct fromFloatArray with the (now-correct) logical shape.

   Neither bug ever tripped a unit test because Apertus DSL pipeline
   tests use ApertusNetworkLoader.fromWeights with synthetic FP32
   tensors, bypassing this whole loading path.

End-to-end status against unsloth/Apertus-8B-Instruct-2509-GGUF
(Q4_K_S, 4.7 GB on disk, -PapertusTestMaxHeap=18g):

   load → convert → fromWeights → OptimizedLLMRuntime.forward(BOS)
   - Module construction: ✅ 35 top-level modules, ~13 s
   - Loader populates 32 xIELU layers, 131 FP32 tensors,
     193 quantized tensors with correct logical shapes
   - Forward pass: completes (~30 s) but produces NaN logits

The NaN points at xIELU. Stored alpha_p values for early layers are
extreme (alpha_p[0]=166, alpha_p[1]=174, …), and ApertusXIELU's
softplus(alpha_p) reduces to ~alpha_p for any value > 20, so the
activation evaluates to alpha_p_eff * x² ~= 166 * x² and overflows
once x grows past a few. Either the unsloth GGUF stores these in a
different transformed space than ApertusXIELU expects, or the
formula in ApertusXIELU.kt diverges from the Apertus-8B reference
implementation. Tracking as follow-up — the wiring change in this
commit is what unblocks getting that far in the first place; a
matching xIELU formula audit / unsloth-vs-HF-source comparison is
its own thread.

Verified loader-only smoke tests still pass (peek, tensor presence,
loadQuantized, fromGguf module build).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 52ff395 into develop May 2, 2026
2 checks passed
@michalharakal michalharakal deleted the feature/apertus-q4k-e2e branch May 2, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant