fix(apertus): real-model loading — UInt metadata + quantized shape by michalharakal · Pull Request #98 · SKaiNET-developers/SKaiNET-transformers

michalharakal · 2026-05-02T12:58:49Z

Summary

Two fixes that together unblock loading real-world Apertus GGUFs end-to-end, plus regression tests pinning both.

1. `UnifiedModelLoader.peek` reads UInt/ULong GGUF metadata

GGUFs emitted by recent llama.cpp converters store dimensions and counts as uint32. The reader preserves them as UInt, which does not extend kotlin.Number in Kotlin. The previous (value as? Number)?.toInt() pattern silently returned null for every modern GGUF, so peek() defaulted contextLength=4096, blockCount=0, embeddingLength=0 — i.e. a transformer with zero layers.

Local stopgap: a Any?.toIntValue() helper that handles Int / UInt / Long / ULong / Short / UShort / Byte / UByte. The same bug exists upstream in GgufModelMetadata.from() and is being addressed in SKaiNET-developers/SKaiNET#586 (target: 0.22.2). Once that lands, toIntValue here can be deleted.

2. `ApertusWeightLoader.streamingTensorToTensor` byte-level shape for quantized

The NATIVE_OPTIMIZED branch shared a body with RAW_BYTES and used the GGUF tensor's logical shape when wrapping the raw byte buffer. For block-quantized formats (Q4_K, Q8_0, …) the byte count differs from the logical element count (e.g. Q8_0 stores 34 bytes per 32 elements), so ctx.fromByteArray size-mismatched. Fix mirrors the LlamaWeightLoader pattern: wrap with Shape(bytes.size).

Commits

632d10c Fixing apertung weigth loaders — the two fixes themselves.
42b2210 test(apertus): pin weight-loader fixes with regression tests — UnifiedModelLoaderUIntMetadataTest (4 cases: uint32 / uint64 / int32 scalars + missing-fields default), ApertusWeightLoaderQuantizedShapeTest (Q8_0 + Q4_K). Wires jvmTest deps into llm-core. Promotes ApertusWeightLoader.streamingTensorToTensor to internal so the regression test can drive it.

Test plan

:llm-core:jvmTest — all green (UnifiedModelLoaderUIntMetadataTest + existing suite)
:llm-inference:apertus:jvmTest — all green (ApertusWeightLoaderQuantizedShapeTest + existing suite)
Combined: 96 tests, 0 failures across both modules
End-to-end load of a real Apertus GGUF on JVM (manual)

Follow-up

After SKaiNET#586 ships in 0.22.2, drop the Any?.toIntValue() helper in UnifiedModelLoader and switch to upstream getInt.

🤖 Generated with Claude Code

Add jvmTest coverage for the two fixes in 632d10c: - UnifiedModelLoaderUIntMetadataTest — peek() must read uint32/uint64 metadata fields, not silently fall back to defaults. Drives the toIntValue() helper through every numeric type GGUFs emit. Stopgap workaround until skainet-io-gguf 0.22.2 (SKaiNET#585) is consumed. - ApertusWeightLoaderQuantizedShapeTest — NATIVE_OPTIMIZED branch must wrap quantized payloads with Shape(bytes.size), not the GGUF logical shape, since block-quantized formats (Q8_0, Q4_K, …) have a different byte count than element count. Wires jvmTest deps into llm-core (kotlin-test, junit, skainet io-gguf, skainet io-core). Promotes ApertusWeightLoader.streamingTensorToTensor to internal so the regression test can drive it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop the local Any?.toIntValue() stopgap now that skainet-io-gguf 0.22.2 ships the public Map<String, Any?>.getInt(...) extension that handles every signed and unsigned integer type the GGUF reader emits (SKaiNET-developers/SKaiNET#586). Bump skainet pin to 0.22.2. Composite build (settings.gradle.kts already includes ../SKaiNET) substitutes the pin with the local sibling project until the artifacts land on Maven Central. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…4_K_S End-to-end loader verification against unsloth/Apertus-8B-Instruct-2509-GGUF (Q4_K_S quant). Tests skip cleanly when the GGUF is not on disk so CI without the model stays green; resolves the path from APERTUS_GGUF_PATH or the HF cache. Three coverage levels: 1. peek() — architecture/family/dims via UnifiedModelLoader on the real file 2. tensor-presence — every required tensor name from ApertusTensorNames is present in the GGUF (catches name-mapping drift from upstream tooling) 3. loadQuantized() — full ApertusQuantizedWeights round-trip, asserting fp32Tensors + quantizedTensors + per-layer xIELU params Verified against the real model: 32 layers, dim=4096, ctx=65536, vocab=131072, 131 FP32 small tensors + 193 quantized tensors + 32 xIELU param sets — all populated. Apertus-8B's token_embedding alone dequants to ~2 GB FP32; the loadQuantized test self-skips when JVM heap is below 8 GB and prints a hint to rerun with -PapertusTestMaxHeap=12g (mirroring the gemma module's override pattern). build.gradle.kts gains the same `apertusTestMaxHeap` Gradle property so the default 6g stays CI-friendly. Known limitation surfaced (out of scope for this PR): ApertusNetworkLoader fromGguf().load() runs apertusNetwork(metadata) which pre-allocates FP32 zero tensors for every Linear layer at construction time — for Apertus-8B that's ~27 GB before WeightMapper substitutes the loaded tensors, OOMing under 32 GB heap. Combined with the cleanup commit 8a7e0ff removing ApertusQuantizedRuntime (the memory-efficient runtime path), there is no path to run Apertus-8B Q4_K_S end-to-end on a normal-sized JVM. Documented as a follow-up in the test class kdoc; the fix lives in the SKaiNET DSL (NetworkBuilder.kt zeros() at line ~652). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps the skainet pin to 0.23.0, which carries the lazy parameter-init fix from SKaiNET-developers/SKaiNET#588 (Issue #587). Real-model loading via ApertusNetworkLoader.fromGguf().load() now succeeds in ≤12 GB heap on Apertus-8B Q4_K_S; the previous eager FP32 zero allocation in NetworkBuilder consumed ~27 GB before WeightMapper had a chance to substitute the loaded tensors. The integration test gains the previously-skipped heavy assertion: - ApertusNetworkLoader fromGguf builds module from real Q4_K_S GGUF — loads the actual unsloth/Apertus-8B-Instruct-2509 GGUF (4.7 GB) and asserts the produced module has the expected top-level structure (token_embd, output_norm, output, plus 32 transformer blocks). Self-skips when JVM heap is below 8 GB; rerun with -PapertusTestMaxHeap=12g for default-CI users. Composite-build gate added in settings.gradle.kts: pass -PuseLocalSkaiNet=false to opt out of `includeBuild("../SKaiNET")` and resolve sk.ainet.core:* from mavenLocal / mavenCentral instead. Useful for testing a published-to-mavenLocal SKaiNET version end-to-end without renaming the sibling checkout. Default behavior unchanged (composite still auto-enables when ../SKaiNET exists). Verified end-to-end against published-to-mavenLocal skainet 0.23.0: :llm-core:jvmTest + :llm-inference:apertus:jvmTest with -PuseLocalSkaiNet=false → 100 tests, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ApertusWeightLoader.streamingTensorToTensor / readerTensorToTensor wrap quantized weights with byte-level rank-1 shape under QuantPolicy.NATIVE_OPTIMIZED so the native FFM kernels can address the block layout directly. That works for matmul (the kernel knows the logical shape from metadata) but breaks Embedding.gather, which requires the logical rank-2 [vocab, dim] shape — a rank-1 weight tensor errors with "gather: unsupported input rank 1". Surfaced by ApertusNetworkLoader.fromGguf().load() on real unsloth/Apertus-8B-Instruct-2509 Q4_K_S: token_embd is stored as Q4_K in the GGUF and gets the byte-level shape, so the very first forward pass through the embedding layer dies before any logit math. Add loadStreamingTensor / loadReaderTensor wrappers around the existing *ToTensor helpers. They route token_embd.weight through the dequant path (DequantOps.dequantFromBytes → createTensor with the logical [vocab, dim] shape) when quantPolicy is NATIVE_OPTIMIZED and the tensor is a quantized type. Other tensors keep their NATIVE_OPTIMIZED byte-level layout for kernel dispatch. The integration test class kdoc documents the next blocker that prevents end-to-end inference (linearProject in MultiHeadAttention calls ops.transpose on byte-shape weights for Q/K/V/O and FFN projections, which Gemma solves via Q4_KBlockTensorData but Apertus doesn't yet implement). Tracked as #100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skainet 0.23.0 (PR SKaiNET-developers/SKaiNET#590) is now live on Maven Central, unblocking dependency resolution for the 0.23.0 pin in libs.versions.toml. Empty commit to force a CI re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal and others added 4 commits May 2, 2026 12:56

Fixing apertung weigth loaders

632d10c

michalharakal mentioned this pull request May 2, 2026

DSL eagerly allocates zero tensors for every Linear, OOMs on real models SKaiNET-developers/SKaiNET#587

Closed

michalharakal mentioned this pull request May 2, 2026

Apertus: NATIVE_OPTIMIZED Q4_K end-to-end inference broken — needs block-major tensor-data wrappers #100

Open

michalharakal and others added 2 commits May 2, 2026 19:46

michalharakal merged commit 4cd1da9 into develop May 2, 2026
2 checks passed

michalharakal deleted the fix/apertus-real-loading branch May 2, 2026 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(apertus): real-model loading — UInt metadata + quantized shape#98

fix(apertus): real-model loading — UInt metadata + quantized shape#98
michalharakal merged 7 commits intodevelopfrom
fix/apertus-real-loading

michalharakal commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented May 2, 2026

Summary

1. UnifiedModelLoader.peek reads UInt/ULong GGUF metadata

2. ApertusWeightLoader.streamingTensorToTensor byte-level shape for quantized

Commits

Test plan

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `UnifiedModelLoader.peek` reads UInt/ULong GGUF metadata

2. `ApertusWeightLoader.streamingTensorToTensor` byte-level shape for quantized