fix(apertus): real-model loading — UInt metadata + quantized shape#98
Merged
michalharakal merged 7 commits intodevelopfrom May 2, 2026
Merged
fix(apertus): real-model loading — UInt metadata + quantized shape#98michalharakal merged 7 commits intodevelopfrom
michalharakal merged 7 commits intodevelopfrom
Conversation
Add jvmTest coverage for the two fixes in 632d10c: - UnifiedModelLoaderUIntMetadataTest — peek() must read uint32/uint64 metadata fields, not silently fall back to defaults. Drives the toIntValue() helper through every numeric type GGUFs emit. Stopgap workaround until skainet-io-gguf 0.22.2 (SKaiNET#585) is consumed. - ApertusWeightLoaderQuantizedShapeTest — NATIVE_OPTIMIZED branch must wrap quantized payloads with Shape(bytes.size), not the GGUF logical shape, since block-quantized formats (Q8_0, Q4_K, …) have a different byte count than element count. Wires jvmTest deps into llm-core (kotlin-test, junit, skainet io-gguf, skainet io-core). Promotes ApertusWeightLoader.streamingTensorToTensor to internal so the regression test can drive it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the local Any?.toIntValue() stopgap now that skainet-io-gguf 0.22.2 ships the public Map<String, Any?>.getInt(...) extension that handles every signed and unsigned integer type the GGUF reader emits (SKaiNET-developers/SKaiNET#586). Bump skainet pin to 0.22.2. Composite build (settings.gradle.kts already includes ../SKaiNET) substitutes the pin with the local sibling project until the artifacts land on Maven Central. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4_K_S
End-to-end loader verification against unsloth/Apertus-8B-Instruct-2509-GGUF
(Q4_K_S quant). Tests skip cleanly when the GGUF is not on disk so CI without
the model stays green; resolves the path from APERTUS_GGUF_PATH or the HF
cache.
Three coverage levels:
1. peek() — architecture/family/dims via UnifiedModelLoader on the real file
2. tensor-presence — every required tensor name from ApertusTensorNames is
present in the GGUF (catches name-mapping drift from upstream tooling)
3. loadQuantized() — full ApertusQuantizedWeights round-trip, asserting
fp32Tensors + quantizedTensors + per-layer xIELU params
Verified against the real model: 32 layers, dim=4096, ctx=65536, vocab=131072,
131 FP32 small tensors + 193 quantized tensors + 32 xIELU param sets — all
populated.
Apertus-8B's token_embedding alone dequants to ~2 GB FP32; the loadQuantized
test self-skips when JVM heap is below 8 GB and prints a hint to rerun with
-PapertusTestMaxHeap=12g (mirroring the gemma module's override pattern).
build.gradle.kts gains the same `apertusTestMaxHeap` Gradle property so the
default 6g stays CI-friendly.
Known limitation surfaced (out of scope for this PR): ApertusNetworkLoader
fromGguf().load() runs apertusNetwork(metadata) which pre-allocates FP32 zero
tensors for every Linear layer at construction time — for Apertus-8B that's
~27 GB before WeightMapper substitutes the loaded tensors, OOMing under 32 GB
heap. Combined with the cleanup commit 8a7e0ff removing ApertusQuantizedRuntime
(the memory-efficient runtime path), there is no path to run Apertus-8B Q4_K_S
end-to-end on a normal-sized JVM. Documented as a follow-up in the test class
kdoc; the fix lives in the SKaiNET DSL (NetworkBuilder.kt zeros() at line ~652).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps the skainet pin to 0.23.0, which carries the lazy parameter-init fix from SKaiNET-developers/SKaiNET#588 (Issue #587). Real-model loading via ApertusNetworkLoader.fromGguf().load() now succeeds in ≤12 GB heap on Apertus-8B Q4_K_S; the previous eager FP32 zero allocation in NetworkBuilder consumed ~27 GB before WeightMapper had a chance to substitute the loaded tensors. The integration test gains the previously-skipped heavy assertion: - ApertusNetworkLoader fromGguf builds module from real Q4_K_S GGUF — loads the actual unsloth/Apertus-8B-Instruct-2509 GGUF (4.7 GB) and asserts the produced module has the expected top-level structure (token_embd, output_norm, output, plus 32 transformer blocks). Self-skips when JVM heap is below 8 GB; rerun with -PapertusTestMaxHeap=12g for default-CI users. Composite-build gate added in settings.gradle.kts: pass -PuseLocalSkaiNet=false to opt out of `includeBuild("../SKaiNET")` and resolve sk.ainet.core:* from mavenLocal / mavenCentral instead. Useful for testing a published-to-mavenLocal SKaiNET version end-to-end without renaming the sibling checkout. Default behavior unchanged (composite still auto-enables when ../SKaiNET exists). Verified end-to-end against published-to-mavenLocal skainet 0.23.0: :llm-core:jvmTest + :llm-inference:apertus:jvmTest with -PuseLocalSkaiNet=false → 100 tests, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ApertusWeightLoader.streamingTensorToTensor / readerTensorToTensor wrap quantized weights with byte-level rank-1 shape under QuantPolicy.NATIVE_OPTIMIZED so the native FFM kernels can address the block layout directly. That works for matmul (the kernel knows the logical shape from metadata) but breaks Embedding.gather, which requires the logical rank-2 [vocab, dim] shape — a rank-1 weight tensor errors with "gather: unsupported input rank 1". Surfaced by ApertusNetworkLoader.fromGguf().load() on real unsloth/Apertus-8B-Instruct-2509 Q4_K_S: token_embd is stored as Q4_K in the GGUF and gets the byte-level shape, so the very first forward pass through the embedding layer dies before any logit math. Add loadStreamingTensor / loadReaderTensor wrappers around the existing *ToTensor helpers. They route token_embd.weight through the dequant path (DequantOps.dequantFromBytes → createTensor with the logical [vocab, dim] shape) when quantPolicy is NATIVE_OPTIMIZED and the tensor is a quantized type. Other tensors keep their NATIVE_OPTIMIZED byte-level layout for kernel dispatch. The integration test class kdoc documents the next blocker that prevents end-to-end inference (linearProject in MultiHeadAttention calls ops.transpose on byte-shape weights for Q/K/V/O and FFN projections, which Gemma solves via Q4_KBlockTensorData but Apertus doesn't yet implement). Tracked as #100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
skainet 0.23.0 (PR SKaiNET-developers/SKaiNET#590) is now live on Maven Central, unblocking dependency resolution for the 0.23.0 pin in libs.versions.toml. Empty commit to force a CI re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two fixes that together unblock loading real-world Apertus GGUFs end-to-end, plus regression tests pinning both.
1.
UnifiedModelLoader.peekreads UInt/ULong GGUF metadataGGUFs emitted by recent llama.cpp converters store dimensions and counts as
uint32. The reader preserves them asUInt, which does not extendkotlin.Numberin Kotlin. The previous(value as? Number)?.toInt()pattern silently returnednullfor every modern GGUF, sopeek()defaultedcontextLength=4096,blockCount=0,embeddingLength=0— i.e. a transformer with zero layers.Local stopgap: a
Any?.toIntValue()helper that handlesInt/UInt/Long/ULong/Short/UShort/Byte/UByte. The same bug exists upstream inGgufModelMetadata.from()and is being addressed in SKaiNET-developers/SKaiNET#586 (target: 0.22.2). Once that lands,toIntValuehere can be deleted.2.
ApertusWeightLoader.streamingTensorToTensorbyte-level shape for quantizedThe
NATIVE_OPTIMIZEDbranch shared a body withRAW_BYTESand used the GGUF tensor's logical shape when wrapping the raw byte buffer. For block-quantized formats (Q4_K, Q8_0, …) the byte count differs from the logical element count (e.g. Q8_0 stores 34 bytes per 32 elements), soctx.fromByteArraysize-mismatched. Fix mirrors the LlamaWeightLoader pattern: wrap withShape(bytes.size).Commits
632d10c Fixing apertung weigth loaders— the two fixes themselves.42b2210 test(apertus): pin weight-loader fixes with regression tests—UnifiedModelLoaderUIntMetadataTest(4 cases: uint32 / uint64 / int32 scalars + missing-fields default),ApertusWeightLoaderQuantizedShapeTest(Q8_0 + Q4_K). Wires jvmTest deps intollm-core. PromotesApertusWeightLoader.streamingTensorToTensortointernalso the regression test can drive it.Test plan
:llm-core:jvmTest— all green (UnifiedModelLoaderUIntMetadataTest + existing suite):llm-inference:apertus:jvmTest— all green (ApertusWeightLoaderQuantizedShapeTest + existing suite)Follow-up
After SKaiNET#586 ships in 0.22.2, drop the
Any?.toIntValue()helper inUnifiedModelLoaderand switch to upstreamgetInt.🤖 Generated with Claude Code