Skip to content

fix(apertus): real-model loading — UInt metadata + quantized shape#98

Merged
michalharakal merged 7 commits intodevelopfrom
fix/apertus-real-loading
May 2, 2026
Merged

fix(apertus): real-model loading — UInt metadata + quantized shape#98
michalharakal merged 7 commits intodevelopfrom
fix/apertus-real-loading

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

Two fixes that together unblock loading real-world Apertus GGUFs end-to-end, plus regression tests pinning both.

1. UnifiedModelLoader.peek reads UInt/ULong GGUF metadata

GGUFs emitted by recent llama.cpp converters store dimensions and counts as uint32. The reader preserves them as UInt, which does not extend kotlin.Number in Kotlin. The previous (value as? Number)?.toInt() pattern silently returned null for every modern GGUF, so peek() defaulted contextLength=4096, blockCount=0, embeddingLength=0 — i.e. a transformer with zero layers.

Local stopgap: a Any?.toIntValue() helper that handles Int / UInt / Long / ULong / Short / UShort / Byte / UByte. The same bug exists upstream in GgufModelMetadata.from() and is being addressed in SKaiNET-developers/SKaiNET#586 (target: 0.22.2). Once that lands, toIntValue here can be deleted.

2. ApertusWeightLoader.streamingTensorToTensor byte-level shape for quantized

The NATIVE_OPTIMIZED branch shared a body with RAW_BYTES and used the GGUF tensor's logical shape when wrapping the raw byte buffer. For block-quantized formats (Q4_K, Q8_0, …) the byte count differs from the logical element count (e.g. Q8_0 stores 34 bytes per 32 elements), so ctx.fromByteArray size-mismatched. Fix mirrors the LlamaWeightLoader pattern: wrap with Shape(bytes.size).

Commits

  • 632d10c Fixing apertung weigth loaders — the two fixes themselves.
  • 42b2210 test(apertus): pin weight-loader fixes with regression testsUnifiedModelLoaderUIntMetadataTest (4 cases: uint32 / uint64 / int32 scalars + missing-fields default), ApertusWeightLoaderQuantizedShapeTest (Q8_0 + Q4_K). Wires jvmTest deps into llm-core. Promotes ApertusWeightLoader.streamingTensorToTensor to internal so the regression test can drive it.

Test plan

  • :llm-core:jvmTest — all green (UnifiedModelLoaderUIntMetadataTest + existing suite)
  • :llm-inference:apertus:jvmTest — all green (ApertusWeightLoaderQuantizedShapeTest + existing suite)
  • Combined: 96 tests, 0 failures across both modules
  • End-to-end load of a real Apertus GGUF on JVM (manual)

Follow-up

After SKaiNET#586 ships in 0.22.2, drop the Any?.toIntValue() helper in UnifiedModelLoader and switch to upstream getInt.

🤖 Generated with Claude Code

michalharakal and others added 4 commits May 2, 2026 12:56
Add jvmTest coverage for the two fixes in 632d10c:

- UnifiedModelLoaderUIntMetadataTest — peek() must read uint32/uint64
  metadata fields, not silently fall back to defaults. Drives the
  toIntValue() helper through every numeric type GGUFs emit. Stopgap
  workaround until skainet-io-gguf 0.22.2 (SKaiNET#585) is consumed.

- ApertusWeightLoaderQuantizedShapeTest — NATIVE_OPTIMIZED branch must
  wrap quantized payloads with Shape(bytes.size), not the GGUF logical
  shape, since block-quantized formats (Q8_0, Q4_K, …) have a different
  byte count than element count.

Wires jvmTest deps into llm-core (kotlin-test, junit, skainet io-gguf,
skainet io-core). Promotes ApertusWeightLoader.streamingTensorToTensor
to internal so the regression test can drive it directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the local Any?.toIntValue() stopgap now that skainet-io-gguf 0.22.2
ships the public Map<String, Any?>.getInt(...) extension that handles
every signed and unsigned integer type the GGUF reader emits
(SKaiNET-developers/SKaiNET#586).

Bump skainet pin to 0.22.2. Composite build (settings.gradle.kts already
includes ../SKaiNET) substitutes the pin with the local sibling project
until the artifacts land on Maven Central.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4_K_S

End-to-end loader verification against unsloth/Apertus-8B-Instruct-2509-GGUF
(Q4_K_S quant). Tests skip cleanly when the GGUF is not on disk so CI without
the model stays green; resolves the path from APERTUS_GGUF_PATH or the HF
cache.

Three coverage levels:
  1. peek() — architecture/family/dims via UnifiedModelLoader on the real file
  2. tensor-presence — every required tensor name from ApertusTensorNames is
     present in the GGUF (catches name-mapping drift from upstream tooling)
  3. loadQuantized() — full ApertusQuantizedWeights round-trip, asserting
     fp32Tensors + quantizedTensors + per-layer xIELU params

Verified against the real model: 32 layers, dim=4096, ctx=65536, vocab=131072,
131 FP32 small tensors + 193 quantized tensors + 32 xIELU param sets — all
populated.

Apertus-8B's token_embedding alone dequants to ~2 GB FP32; the loadQuantized
test self-skips when JVM heap is below 8 GB and prints a hint to rerun with
-PapertusTestMaxHeap=12g (mirroring the gemma module's override pattern).
build.gradle.kts gains the same `apertusTestMaxHeap` Gradle property so the
default 6g stays CI-friendly.

Known limitation surfaced (out of scope for this PR): ApertusNetworkLoader
fromGguf().load() runs apertusNetwork(metadata) which pre-allocates FP32 zero
tensors for every Linear layer at construction time — for Apertus-8B that's
~27 GB before WeightMapper substitutes the loaded tensors, OOMing under 32 GB
heap. Combined with the cleanup commit 8a7e0ff removing ApertusQuantizedRuntime
(the memory-efficient runtime path), there is no path to run Apertus-8B Q4_K_S
end-to-end on a normal-sized JVM. Documented as a follow-up in the test class
kdoc; the fix lives in the SKaiNET DSL (NetworkBuilder.kt zeros() at line ~652).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps the skainet pin to 0.23.0, which carries the lazy parameter-init
fix from SKaiNET-developers/SKaiNET#588 (Issue #587). Real-model
loading via ApertusNetworkLoader.fromGguf().load() now succeeds in
≤12 GB heap on Apertus-8B Q4_K_S; the previous eager FP32 zero
allocation in NetworkBuilder consumed ~27 GB before WeightMapper had a
chance to substitute the loaded tensors.

The integration test gains the previously-skipped heavy assertion:
- ApertusNetworkLoader fromGguf builds module from real Q4_K_S GGUF —
  loads the actual unsloth/Apertus-8B-Instruct-2509 GGUF (4.7 GB) and
  asserts the produced module has the expected top-level structure
  (token_embd, output_norm, output, plus 32 transformer blocks).

Self-skips when JVM heap is below 8 GB; rerun with
-PapertusTestMaxHeap=12g for default-CI users.

Composite-build gate added in settings.gradle.kts: pass
-PuseLocalSkaiNet=false to opt out of `includeBuild("../SKaiNET")` and
resolve sk.ainet.core:* from mavenLocal / mavenCentral instead.
Useful for testing a published-to-mavenLocal SKaiNET version end-to-end
without renaming the sibling checkout. Default behavior unchanged
(composite still auto-enables when ../SKaiNET exists).

Verified end-to-end against published-to-mavenLocal skainet 0.23.0:
:llm-core:jvmTest + :llm-inference:apertus:jvmTest with
-PuseLocalSkaiNet=false → 100 tests, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
michalharakal and others added 2 commits May 2, 2026 19:46
ApertusWeightLoader.streamingTensorToTensor / readerTensorToTensor wrap
quantized weights with byte-level rank-1 shape under
QuantPolicy.NATIVE_OPTIMIZED so the native FFM kernels can address the
block layout directly. That works for matmul (the kernel knows the
logical shape from metadata) but breaks Embedding.gather, which requires
the logical rank-2 [vocab, dim] shape — a rank-1 weight tensor errors
with "gather: unsupported input rank 1".

Surfaced by ApertusNetworkLoader.fromGguf().load() on real
unsloth/Apertus-8B-Instruct-2509 Q4_K_S: token_embd is stored as Q4_K
in the GGUF and gets the byte-level shape, so the very first forward
pass through the embedding layer dies before any logit math.

Add loadStreamingTensor / loadReaderTensor wrappers around the existing
*ToTensor helpers. They route token_embd.weight through the dequant
path (DequantOps.dequantFromBytes → createTensor with the logical
[vocab, dim] shape) when quantPolicy is NATIVE_OPTIMIZED and the
tensor is a quantized type. Other tensors keep their NATIVE_OPTIMIZED
byte-level layout for kernel dispatch.

The integration test class kdoc documents the next blocker that
prevents end-to-end inference (linearProject in MultiHeadAttention
calls ops.transpose on byte-shape weights for Q/K/V/O and FFN
projections, which Gemma solves via Q4_KBlockTensorData but Apertus
doesn't yet implement). Tracked as #100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
skainet 0.23.0 (PR SKaiNET-developers/SKaiNET#590) is now live on
Maven Central, unblocking dependency resolution for the 0.23.0 pin
in libs.versions.toml. Empty commit to force a CI re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 4cd1da9 into develop May 2, 2026
2 checks passed
@michalharakal michalharakal deleted the fix/apertus-real-loading branch May 2, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant