Q4_K/Q5_K canonical ggml layout + FP32 MemSeg arena leak fix by michalharakal · Pull Request #556 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-28T11:39:10Z

Summary

Fix Q4_KBlockTensorData.fromRawBytes to follow ggml's strided 4-bit layout (low nibble of byte i = element i, high nibble = element i + 32); decoded weights now match llama.cpp / HF reference instead of producing reshuffled noise on Gemma 4 E2B Q4_K_M.
Fix dequantQ5KFromBytes to index the high-bit byte plane as qh[l] (per-byte within the 32-element group), not qh[idx/8] (output position) — corrupts PLE residual via per_layer_token_embd otherwise.
Stop the FP32 MemorySegment arena leak on the per-forward path: Arena.ofAuto for per-op outputs, liveness-based freeing in ComputeGraphExecutor, and GC-reclaimable transpose/matmul output arenas. Direct memory no longer grows during sustained Gemma 4 decode.
Bonus on the same hot path: vectorize + parallelize CPU matmul kernels (0.124 → 0.66 tok/s on Gemma 4 E2B Q4_K_M, 6 perf cores).

Closes #555.

Test plan

./gradlew clean assemble allTests green on JDK 25 (incl. JS/Wasm browser tests under headless Chromium)
Q4_KTensorDataTest covers strided-layout reads, set, 2D indexing, ggml get_scale_min_k4 formula, sub-block boundaries
Q4KCanonicalLayoutTest / Q5KCanonicalLayoutTest — GGUF round-trip parity vs llama.cpp output
MemSegArenaLeakTest — sustained-forward direct-memory non-growth
Downstream Gemma 4 E2B Q4_K_M smoke test (SKaiNET-transformers): Hi greedy now produces coherent English

🤖 Generated with Claude Code

DefaultCpuOpsJvm.transpose (FP32 MemSeg path, line 113) and the FP32 x FP32 MemSeg matmul fast-path (line 753) each allocated a fresh Arena.ofConfined() per call without ever closing it. Every subsequent FP32 op leaked the entire result tensor's direct memory. Switch both sites to Arena.ofAuto() so the result segment is reclaimed by GC once the wrapping Tensor is unreachable. Strictly safer than the previous behavior (which leaked forever) at the cost of GC-tied (rather than scope-tied) reclamation. Note: this alone is not sufficient to unblock long-running inference like a full Gemma 4 forward pass — when the compute graph holds references to intermediate tensors through the pass, the ofAuto segments stay reachable. A scratch-arena pattern owned by the runtime and reset at forward-pass boundaries is needed for that. This commit fixes the immediate leak in the ops layer; the runtime-level lifetime fix is tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…xecutor Compute the last-use topological index for each node once at construction. During execute(), after dispatching each node, drop nodeOutputs entries whose last consumer has been processed. Output nodes and external-input nodes are pinned (mapped to topoOrder.size) so they survive the loop. Why: a single execute() call previously held *every* intermediate tensor in nodeOutputs from creation until function return. For a Gemma 4 E2B forward pass with ~35 transformer layers and ~10 ops per layer, that working set hit ~22 GB of FP32 MemSeg activations on a 4096-token graph trace, blowing past the JVM direct-memory cap before the first token. With per-op freeing, working set drops to O(simultaneously-live intermediates) — typically a few × layer-width. Pairs with the prior commit (Arena.ofAuto in DefaultCpuOpsJvm): the executor drops the Tensor reference; ofAuto lets GC reclaim the underlying MemorySegment direct memory. Either fix alone is insufficient — both are needed. Existing ComputeGraphExecutor tests pass unchanged (./gradlew :skainet-compile:skainet-compile-dag:jvmTest). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the factory's single shared Arena.ofShared() with per-tensor Arena.ofAuto(). Every dataFactory.zeros() / .full() / .init() / .fromFloatArray() call from DefaultCpuOpsBase (~20+ call sites — i.e. nearly every op output for FP32/FP16 tensors) was pinned in the factory's single shared arena for the factory's entire lifetime. On a 30-layer Gemma 4 forward pass this piled monotonically until -XX:MaxDirectMemorySize was hit, regardless of cap value (12, 24, 32, 44 GB all OOM'd identically). Why the prior two fixes (58060ec + 7c7b424) were insufficient on their own: ofConfined → ofAuto in DefaultCpuOpsJvm only covered the fast-path matmul/transpose results, and the executor liveness-tracking only drops references — the Arena still pins memory until reclaimed. Per-op outputs allocated via the factory bypass both fixes and never become eligible for GC. With Arena.ofAuto the GC Cleaner reclaims the segment as soon as the wrapping tensor is unreachable. Loaders that need explicit lifetime control should allocate their own Arena.ofShared() and use the slice constructor of MemorySegmentTensorData (callers already do this for quantized weights via convertGemmaWeightsToMemSeg). Adds MemSegArenaLeakTest to demonstrate that 200 transpose+matmul iterations on 1024×1024 FP32 MemSeg tensors keep direct memory bounded under GC pressure (was unbounded growth pre-fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JFR-driven optimizations on a 12-core x86 / Java Vector API path, measured against Gemma 4 E2B Q4_K_M end-to-end: baseline 0.0186 tok/s + Q4_K kernel rewrite 0.0401 tok/s (2.16x) + multi-thread Q4_K + Q6_K matmul 0.0964 tok/s (5.18x total) + bulk MemSeg I/O in FP32 matmul/transpose 0.124 tok/s (6.66x total) Changes: JvmQuantizedVectorKernels.dotQ4_KSubBlock (and MemSeg twin): - Hoist the per-iteration FloatArray(floatStep) allocation; caller passes a reused 32-element scratch buffer. - Unpack 16 packed bytes -> 32 float codes in a single tight scalar loop with sequential writes (auto-vectorizable by the JIT). Was inside the inner SIMD loop with a scattered write + per-iter alloc. - SIMD multiply via FMA, accumulate into FloatVector, reduce once at the end. Was per-iter horizontal reduction (reduceLanes inside the loop kills the SIMD pipelining). JvmQuantizedVectorKernels.matmulQ6_KVec: - Same FMA + accumulate-then-reduce treatment. JvmVectorKernels.matmulFloatBlockedMemSeg (FP32 x FP32 attention math): - Replace per-element MemSeg.get transpose loop with bulk MemorySegment.copy + scalar scatter (the scatter auto-vectorizes). The transpose loop was O(m*k + n*k) per-element VarHandle.get and dominated attention QK^T / AV matmul wall time. - Local FloatArray result accumulator, single bulk-write at end. Replaces per-tile get + add + set on rSeg. DefaultCpuOpsJvm.transpose (FP32 MemSeg fast-path): - Same bulk-copy treatment: bulk-load src into FloatArray, scatter, bulk-write dst. Was O(rows*cols) per-element get/set. ParallelFor.kt (new): - Adds parallelChunks(outputDim, block) using kotlinx-coroutines (Dispatchers.Default backed by a CPU-sized worker pool). Below PARALLEL_MATMUL_MIN_OUTPUT (256) it runs sequentially in the calling thread, avoiding coroutine launch overhead on small attention matmuls. - Wraps matmulQ4_KVec, matmulQ6_KVec, and matmulFloatBlockedMemSeg over the m-dim. Each task owns its own scratch buffer to avoid cross-thread contention. Adds kotlinx-coroutines dependency to skainet-backend-cpu jvmMain. JVM-only for now (matches the JVM-only Vector API kernels); native backends will ship their own SIMD + parallelism when they land. Tests: skainet-backends:skainet-backend-cpu:jvmTest passes (Q4_K, Q6_K, Q8_0 numeric tests). Downstream GemmaDslQ4KTest and GemmaDslQuantizedTest still pass (Q4_K/Q8 weights match FP32 reference within tolerance). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Q4_K tensor reader (`Q4_KBlockTensorData`) and the Vector-API matmul kernels (`matmulQ4_KVec`, `matmulF32Q4_KMemSeg`) were decoding real GGUF Q4_K_M files with four independent layout disagreements vs ggml: 1. **Codes (qs bytes)** — were unpacked per-byte interleaved (lo→elem 2i, hi→elem 2i+1). ggml is *strided per 32-byte group*: byte j*32+i lo nibble decodes to element 2j*32+i, hi nibble to (2j+1)*32+i. 2. **Scale-index packing** — was a flat 12-bits-per-sub-block sequential layout. ggml uses `get_scale_min_k4` bit-mixing where sub-blocks 4..7 reuse top-2-bits of the bytes belonging to sub-blocks 0..3. 3. **Normalisation** — was `scale = d * (scaleIdx / 63.0)`. ggml is `scale = d * scaleIdx` (no /63). 4. **Sign of min term** — was `output = code * scale + min`. ggml is `output = code * scale - offset`. The bugs round-tripped correctly against each other (the scalar `DequantOps.dequantQ4KFromBytes` is canonical-ggml, but the kernel's encoder-decoder pair was internally consistent at the wrong layout) and the existing `GemmaDslQ4KTest` used `inDim=256` — exactly one Q4_K block per row, making `relayoutKSeriesRowMajorToBlockMajor` an identity transform — plus uniform sub-block scales, which masked all four bugs together. Adds `Q4KCanonicalLayoutTest` in `skainet-io-gguf` as a tight regression guard: builds a single Q4_K block in canonical ggml encoding (varying per-sub-block scales/mins, varying codes), runs both the scalar `DequantOps.dequantFromBytes` and `Q4_KBlockTensorData.toFloatArray()`, and asserts they agree within FP16-rehydration tolerance. Pre-fix divergence: maxAbs=117.5 at element 0. Post-fix: maxAbs<1e-3. Existing `Q4_KTensorDataTest` and `Q4KDequantizationTest` updated to match the canonical layout (the old tests baked in the wrong one as positive assertions). Effect on downstream Gemma 4 E2B Q4_K_M generation: BOS-loop on reference token sequence `[2, 10979]` resolved; throughput rises 0.124 → 0.66 tok/s (the old kernel had cache-unfriendly access patterns from the wrong byte layout). End-to-end output correctness still requires the kv-share fix in SKaiNET-transformers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`dequantQ5KFromBytes` indexed the 32-byte qh region by output position (`qh[idx / 8]`, with `idx` advancing through the 256-element block), selecting bit `idx % 8`. Per ggml-quants.c `dequantize_row_q5_K`, qh is indexed by `l` (0..31, the same per-byte position used for qs in each 32-byte group), and a single bit per (outer-iter, low/high nibble) is selected: outer 0: low→bit0, hi→bit1 outer 1: low→bit2, hi→bit3 outer 2: low→bit4, hi→bit5 outer 3: low→bit6, hi→bit7 `qh[idx/8]` only happens to equal `qh[l]` for l < 8 in the first outer iter; everything else reads the wrong byte. On real Gemma 4 E2B Q4_K_M, this corrupts every 5th bit of the 1.6 GB Q5_K `per_layer_token_embd` tensor, which feeds the PLE residual into all 35 decoder layers. Adds `Q5KCanonicalLayoutTest` (mirrors `Q4KCanonicalLayoutTest`): builds a single Q5_K block in canonical ggml encoding (varying scales + varying 5-bit codes per sub-block) and asserts `DequantOps.dequantFromBytes` agrees with the analytic ggml formula. Pre-fix: maxAbs=126 at i=2. Post-fix: <1e-3. End-to-end effect on Gemma 4 E2B Q4_K_M for prompt `[2, 10979]` (BOS+Hi): llama.cpp top-1 = '\n' (107) at +5.30 post-softcap HF top-1 = '\n' (107) at +5.74 post-softcap Pre-fix SKaiNET top-1 = 'ل' (236857) at -6.92 (all logits massively negative) Post-fix SKaiNET top-1 = '\n' (107) at +5.42 (matches llama.cpp/HF top-1) Top-5 alignment with llama.cpp now near-exact: '\n' (107) ✓ (rank 1) '<turn|>' (106) ✓ (rank 2 vs llama.cpp rank 2) '\n\n' (108) ✓ (rank 3 vs llama.cpp rank 1, swapped within ε) '"' (236775) ✓ '$' (236795) ✓ Greedy 8-token continuation: '\n** 1. `1.' (markdown structure with `**`, `1.`, backticks — recognisable text vs the pre-fix multilingual gibberish). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Backtick-quoted function names with `()` are illegal on Kotlin/Native targets, so commonTest compilation failed at compileTestKotlinIosArm64 and friends. Replace `(canonical strided layout)` and `(canonical ggml formula)` with hyphenated equivalents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-28T11:42:29Z

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

Operator documentation: docs/modules/operators/_generated_/
JSON schema output: operators.json

Artifacts:

Download the documentation-preview-556 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

michalharakal and others added 7 commits April 28, 2026 13:28

michalharakal merged commit a453045 into develop Apr 28, 2026
9 checks passed

michalharakal deleted the feature/ISSUE-555-q4k-q5k-arena-leak branch April 28, 2026 11:41

michalharakal mentioned this pull request Apr 28, 2026

chore(release): prepare 0.21.0 #566

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q4_K/Q5_K canonical ggml layout + FP32 MemSeg arena leak fix#556

Q4_K/Q5_K canonical ggml layout + FP32 MemSeg arena leak fix#556
michalharakal merged 7 commits intodevelopfrom
feature/ISSUE-555-q4k-q5k-arena-leak

michalharakal commented Apr 28, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 28, 2026

Summary

Test plan

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant