Q4_K/Q5_K canonical ggml layout + FP32 MemSeg arena leak fix#556
Merged
michalharakal merged 7 commits intodevelopfrom Apr 28, 2026
Merged
Q4_K/Q5_K canonical ggml layout + FP32 MemSeg arena leak fix#556michalharakal merged 7 commits intodevelopfrom
michalharakal merged 7 commits intodevelopfrom
Conversation
DefaultCpuOpsJvm.transpose (FP32 MemSeg path, line 113) and the FP32 x FP32 MemSeg matmul fast-path (line 753) each allocated a fresh Arena.ofConfined() per call without ever closing it. Every subsequent FP32 op leaked the entire result tensor's direct memory. Switch both sites to Arena.ofAuto() so the result segment is reclaimed by GC once the wrapping Tensor is unreachable. Strictly safer than the previous behavior (which leaked forever) at the cost of GC-tied (rather than scope-tied) reclamation. Note: this alone is not sufficient to unblock long-running inference like a full Gemma 4 forward pass — when the compute graph holds references to intermediate tensors through the pass, the ofAuto segments stay reachable. A scratch-arena pattern owned by the runtime and reset at forward-pass boundaries is needed for that. This commit fixes the immediate leak in the ops layer; the runtime-level lifetime fix is tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xecutor Compute the last-use topological index for each node once at construction. During execute(), after dispatching each node, drop nodeOutputs entries whose last consumer has been processed. Output nodes and external-input nodes are pinned (mapped to topoOrder.size) so they survive the loop. Why: a single execute() call previously held *every* intermediate tensor in nodeOutputs from creation until function return. For a Gemma 4 E2B forward pass with ~35 transformer layers and ~10 ops per layer, that working set hit ~22 GB of FP32 MemSeg activations on a 4096-token graph trace, blowing past the JVM direct-memory cap before the first token. With per-op freeing, working set drops to O(simultaneously-live intermediates) — typically a few × layer-width. Pairs with the prior commit (Arena.ofAuto in DefaultCpuOpsJvm): the executor drops the Tensor reference; ofAuto lets GC reclaim the underlying MemorySegment direct memory. Either fix alone is insufficient — both are needed. Existing ComputeGraphExecutor tests pass unchanged (./gradlew :skainet-compile:skainet-compile-dag:jvmTest). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the factory's single shared Arena.ofShared() with per-tensor Arena.ofAuto(). Every dataFactory.zeros() / .full() / .init() / .fromFloatArray() call from DefaultCpuOpsBase (~20+ call sites — i.e. nearly every op output for FP32/FP16 tensors) was pinned in the factory's single shared arena for the factory's entire lifetime. On a 30-layer Gemma 4 forward pass this piled monotonically until -XX:MaxDirectMemorySize was hit, regardless of cap value (12, 24, 32, 44 GB all OOM'd identically). Why the prior two fixes (58060ec + 7c7b424) were insufficient on their own: ofConfined → ofAuto in DefaultCpuOpsJvm only covered the fast-path matmul/transpose results, and the executor liveness-tracking only drops references — the Arena still pins memory until reclaimed. Per-op outputs allocated via the factory bypass both fixes and never become eligible for GC. With Arena.ofAuto the GC Cleaner reclaims the segment as soon as the wrapping tensor is unreachable. Loaders that need explicit lifetime control should allocate their own Arena.ofShared() and use the slice constructor of MemorySegmentTensorData (callers already do this for quantized weights via convertGemmaWeightsToMemSeg). Adds MemSegArenaLeakTest to demonstrate that 200 transpose+matmul iterations on 1024×1024 FP32 MemSeg tensors keep direct memory bounded under GC pressure (was unbounded growth pre-fix). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JFR-driven optimizations on a 12-core x86 / Java Vector API path,
measured against Gemma 4 E2B Q4_K_M end-to-end:
baseline 0.0186 tok/s
+ Q4_K kernel rewrite 0.0401 tok/s (2.16x)
+ multi-thread Q4_K + Q6_K matmul 0.0964 tok/s (5.18x total)
+ bulk MemSeg I/O in FP32 matmul/transpose 0.124 tok/s (6.66x total)
Changes:
JvmQuantizedVectorKernels.dotQ4_KSubBlock (and MemSeg twin):
- Hoist the per-iteration FloatArray(floatStep) allocation; caller
passes a reused 32-element scratch buffer.
- Unpack 16 packed bytes -> 32 float codes in a single tight scalar
loop with sequential writes (auto-vectorizable by the JIT). Was
inside the inner SIMD loop with a scattered write + per-iter alloc.
- SIMD multiply via FMA, accumulate into FloatVector, reduce once at
the end. Was per-iter horizontal reduction (reduceLanes inside the
loop kills the SIMD pipelining).
JvmQuantizedVectorKernels.matmulQ6_KVec:
- Same FMA + accumulate-then-reduce treatment.
JvmVectorKernels.matmulFloatBlockedMemSeg (FP32 x FP32 attention math):
- Replace per-element MemSeg.get transpose loop with bulk
MemorySegment.copy + scalar scatter (the scatter auto-vectorizes).
The transpose loop was O(m*k + n*k) per-element VarHandle.get and
dominated attention QK^T / AV matmul wall time.
- Local FloatArray result accumulator, single bulk-write at end.
Replaces per-tile get + add + set on rSeg.
DefaultCpuOpsJvm.transpose (FP32 MemSeg fast-path):
- Same bulk-copy treatment: bulk-load src into FloatArray, scatter,
bulk-write dst. Was O(rows*cols) per-element get/set.
ParallelFor.kt (new):
- Adds parallelChunks(outputDim, block) using kotlinx-coroutines
(Dispatchers.Default backed by a CPU-sized worker pool). Below
PARALLEL_MATMUL_MIN_OUTPUT (256) it runs sequentially in the
calling thread, avoiding coroutine launch overhead on small
attention matmuls.
- Wraps matmulQ4_KVec, matmulQ6_KVec, and matmulFloatBlockedMemSeg
over the m-dim. Each task owns its own scratch buffer to avoid
cross-thread contention.
Adds kotlinx-coroutines dependency to skainet-backend-cpu jvmMain.
JVM-only for now (matches the JVM-only Vector API kernels); native
backends will ship their own SIMD + parallelism when they land.
Tests: skainet-backends:skainet-backend-cpu:jvmTest passes (Q4_K, Q6_K,
Q8_0 numeric tests). Downstream GemmaDslQ4KTest and GemmaDslQuantizedTest
still pass (Q4_K/Q8 weights match FP32 reference within tolerance).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Q4_K tensor reader (`Q4_KBlockTensorData`) and the Vector-API matmul kernels (`matmulQ4_KVec`, `matmulF32Q4_KMemSeg`) were decoding real GGUF Q4_K_M files with four independent layout disagreements vs ggml: 1. **Codes (qs bytes)** — were unpacked per-byte interleaved (lo→elem 2i, hi→elem 2i+1). ggml is *strided per 32-byte group*: byte j*32+i lo nibble decodes to element 2j*32+i, hi nibble to (2j+1)*32+i. 2. **Scale-index packing** — was a flat 12-bits-per-sub-block sequential layout. ggml uses `get_scale_min_k4` bit-mixing where sub-blocks 4..7 reuse top-2-bits of the bytes belonging to sub-blocks 0..3. 3. **Normalisation** — was `scale = d * (scaleIdx / 63.0)`. ggml is `scale = d * scaleIdx` (no /63). 4. **Sign of min term** — was `output = code * scale + min`. ggml is `output = code * scale - offset`. The bugs round-tripped correctly against each other (the scalar `DequantOps.dequantQ4KFromBytes` is canonical-ggml, but the kernel's encoder-decoder pair was internally consistent at the wrong layout) and the existing `GemmaDslQ4KTest` used `inDim=256` — exactly one Q4_K block per row, making `relayoutKSeriesRowMajorToBlockMajor` an identity transform — plus uniform sub-block scales, which masked all four bugs together. Adds `Q4KCanonicalLayoutTest` in `skainet-io-gguf` as a tight regression guard: builds a single Q4_K block in canonical ggml encoding (varying per-sub-block scales/mins, varying codes), runs both the scalar `DequantOps.dequantFromBytes` and `Q4_KBlockTensorData.toFloatArray()`, and asserts they agree within FP16-rehydration tolerance. Pre-fix divergence: maxAbs=117.5 at element 0. Post-fix: maxAbs<1e-3. Existing `Q4_KTensorDataTest` and `Q4KDequantizationTest` updated to match the canonical layout (the old tests baked in the wrong one as positive assertions). Effect on downstream Gemma 4 E2B Q4_K_M generation: BOS-loop on reference token sequence `[2, 10979]` resolved; throughput rises 0.124 → 0.66 tok/s (the old kernel had cache-unfriendly access patterns from the wrong byte layout). End-to-end output correctness still requires the kv-share fix in SKaiNET-transformers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dequantQ5KFromBytes` indexed the 32-byte qh region by output position (`qh[idx / 8]`, with `idx` advancing through the 256-element block), selecting bit `idx % 8`. Per ggml-quants.c `dequantize_row_q5_K`, qh is indexed by `l` (0..31, the same per-byte position used for qs in each 32-byte group), and a single bit per (outer-iter, low/high nibble) is selected: outer 0: low→bit0, hi→bit1 outer 1: low→bit2, hi→bit3 outer 2: low→bit4, hi→bit5 outer 3: low→bit6, hi→bit7 `qh[idx/8]` only happens to equal `qh[l]` for l < 8 in the first outer iter; everything else reads the wrong byte. On real Gemma 4 E2B Q4_K_M, this corrupts every 5th bit of the 1.6 GB Q5_K `per_layer_token_embd` tensor, which feeds the PLE residual into all 35 decoder layers. Adds `Q5KCanonicalLayoutTest` (mirrors `Q4KCanonicalLayoutTest`): builds a single Q5_K block in canonical ggml encoding (varying scales + varying 5-bit codes per sub-block) and asserts `DequantOps.dequantFromBytes` agrees with the analytic ggml formula. Pre-fix: maxAbs=126 at i=2. Post-fix: <1e-3. End-to-end effect on Gemma 4 E2B Q4_K_M for prompt `[2, 10979]` (BOS+Hi): llama.cpp top-1 = '\n' (107) at +5.30 post-softcap HF top-1 = '\n' (107) at +5.74 post-softcap Pre-fix SKaiNET top-1 = 'ل' (236857) at -6.92 (all logits massively negative) Post-fix SKaiNET top-1 = '\n' (107) at +5.42 (matches llama.cpp/HF top-1) Top-5 alignment with llama.cpp now near-exact: '\n' (107) ✓ (rank 1) '<turn|>' (106) ✓ (rank 2 vs llama.cpp rank 2) '\n\n' (108) ✓ (rank 3 vs llama.cpp rank 1, swapped within ε) '"' (236775) ✓ '$' (236795) ✓ Greedy 8-token continuation: '\n** 1. `1.' (markdown structure with `**`, `1.`, backticks — recognisable text vs the pre-fix multilingual gibberish). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backtick-quoted function names with `()` are illegal on Kotlin/Native targets, so commonTest compilation failed at compileTestKotlinIosArm64 and friends. Replace `(canonical strided layout)` and `(canonical ggml formula)` with hyphenated equivalents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Q4_KBlockTensorData.fromRawBytesto follow ggml's strided 4-bit layout (low nibble of bytei= elementi, high nibble = elementi + 32); decoded weights now match llama.cpp / HF reference instead of producing reshuffled noise on Gemma 4 E2BQ4_K_M.dequantQ5KFromBytesto index the high-bit byte plane asqh[l](per-byte within the 32-element group), notqh[idx/8](output position) — corrupts PLE residual viaper_layer_token_embdotherwise.MemorySegmentarena leak on the per-forward path:Arena.ofAutofor per-op outputs, liveness-based freeing inComputeGraphExecutor, and GC-reclaimable transpose/matmul output arenas. Direct memory no longer grows during sustained Gemma 4 decode.Closes #555.
Test plan
./gradlew clean assemble allTestsgreen on JDK 25 (incl. JS/Wasm browser tests under headless Chromium)Q4_KTensorDataTestcovers strided-layout reads, set, 2D indexing, ggmlget_scale_min_k4formula, sub-block boundariesQ4KCanonicalLayoutTest/Q5KCanonicalLayoutTest— GGUF round-trip parity vs llama.cpp outputMemSegArenaLeakTest— sustained-forward direct-memory non-growthHigreedy now produces coherent English🤖 Generated with Claude Code