Skip to content

Q4_K/Q5_K canonical ggml layout + FP32 MemSeg arena leak fix#556

Merged
michalharakal merged 7 commits intodevelopfrom
feature/ISSUE-555-q4k-q5k-arena-leak
Apr 28, 2026
Merged

Q4_K/Q5_K canonical ggml layout + FP32 MemSeg arena leak fix#556
michalharakal merged 7 commits intodevelopfrom
feature/ISSUE-555-q4k-q5k-arena-leak

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

  • Fix Q4_KBlockTensorData.fromRawBytes to follow ggml's strided 4-bit layout (low nibble of byte i = element i, high nibble = element i + 32); decoded weights now match llama.cpp / HF reference instead of producing reshuffled noise on Gemma 4 E2B Q4_K_M.
  • Fix dequantQ5KFromBytes to index the high-bit byte plane as qh[l] (per-byte within the 32-element group), not qh[idx/8] (output position) — corrupts PLE residual via per_layer_token_embd otherwise.
  • Stop the FP32 MemorySegment arena leak on the per-forward path: Arena.ofAuto for per-op outputs, liveness-based freeing in ComputeGraphExecutor, and GC-reclaimable transpose/matmul output arenas. Direct memory no longer grows during sustained Gemma 4 decode.
  • Bonus on the same hot path: vectorize + parallelize CPU matmul kernels (0.124 → 0.66 tok/s on Gemma 4 E2B Q4_K_M, 6 perf cores).

Closes #555.

Test plan

  • ./gradlew clean assemble allTests green on JDK 25 (incl. JS/Wasm browser tests under headless Chromium)
  • Q4_KTensorDataTest covers strided-layout reads, set, 2D indexing, ggml get_scale_min_k4 formula, sub-block boundaries
  • Q4KCanonicalLayoutTest / Q5KCanonicalLayoutTest — GGUF round-trip parity vs llama.cpp output
  • MemSegArenaLeakTest — sustained-forward direct-memory non-growth
  • Downstream Gemma 4 E2B Q4_K_M smoke test (SKaiNET-transformers): Hi greedy now produces coherent English

🤖 Generated with Claude Code

michalharakal and others added 7 commits April 28, 2026 13:28
DefaultCpuOpsJvm.transpose (FP32 MemSeg path, line 113) and the
FP32 x FP32 MemSeg matmul fast-path (line 753) each allocated a
fresh Arena.ofConfined() per call without ever closing it. Every
subsequent FP32 op leaked the entire result tensor's direct memory.

Switch both sites to Arena.ofAuto() so the result segment is
reclaimed by GC once the wrapping Tensor is unreachable. Strictly
safer than the previous behavior (which leaked forever) at the
cost of GC-tied (rather than scope-tied) reclamation.

Note: this alone is not sufficient to unblock long-running
inference like a full Gemma 4 forward pass — when the compute graph
holds references to intermediate tensors through the pass, the
ofAuto segments stay reachable. A scratch-arena pattern owned by
the runtime and reset at forward-pass boundaries is needed for
that. This commit fixes the immediate leak in the ops layer; the
runtime-level lifetime fix is tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xecutor

Compute the last-use topological index for each node once at construction.
During execute(), after dispatching each node, drop nodeOutputs entries
whose last consumer has been processed. Output nodes and external-input
nodes are pinned (mapped to topoOrder.size) so they survive the loop.

Why: a single execute() call previously held *every* intermediate tensor
in nodeOutputs from creation until function return. For a Gemma 4 E2B
forward pass with ~35 transformer layers and ~10 ops per layer, that
working set hit ~22 GB of FP32 MemSeg activations on a 4096-token
graph trace, blowing past the JVM direct-memory cap before the first
token. With per-op freeing, working set drops to O(simultaneously-live
intermediates) — typically a few × layer-width.

Pairs with the prior commit (Arena.ofAuto in DefaultCpuOpsJvm): the
executor drops the Tensor reference; ofAuto lets GC reclaim the
underlying MemorySegment direct memory. Either fix alone is
insufficient — both are needed.

Existing ComputeGraphExecutor tests pass unchanged
(./gradlew :skainet-compile:skainet-compile-dag:jvmTest).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the factory's single shared Arena.ofShared() with per-tensor
Arena.ofAuto(). Every dataFactory.zeros() / .full() / .init() /
.fromFloatArray() call from DefaultCpuOpsBase (~20+ call sites — i.e.
nearly every op output for FP32/FP16 tensors) was pinned in the factory's
single shared arena for the factory's entire lifetime. On a 30-layer
Gemma 4 forward pass this piled monotonically until -XX:MaxDirectMemorySize
was hit, regardless of cap value (12, 24, 32, 44 GB all OOM'd identically).

Why the prior two fixes (58060ec + 7c7b424) were insufficient on their
own: ofConfined → ofAuto in DefaultCpuOpsJvm only covered the fast-path
matmul/transpose results, and the executor liveness-tracking only drops
references — the Arena still pins memory until reclaimed. Per-op outputs
allocated via the factory bypass both fixes and never become eligible
for GC. With Arena.ofAuto the GC Cleaner reclaims the segment as soon as
the wrapping tensor is unreachable.

Loaders that need explicit lifetime control should allocate their own
Arena.ofShared() and use the slice constructor of MemorySegmentTensorData
(callers already do this for quantized weights via convertGemmaWeightsToMemSeg).

Adds MemSegArenaLeakTest to demonstrate that 200 transpose+matmul
iterations on 1024×1024 FP32 MemSeg tensors keep direct memory bounded
under GC pressure (was unbounded growth pre-fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JFR-driven optimizations on a 12-core x86 / Java Vector API path,
measured against Gemma 4 E2B Q4_K_M end-to-end:

  baseline                                   0.0186 tok/s
  + Q4_K kernel rewrite                      0.0401 tok/s  (2.16x)
  + multi-thread Q4_K + Q6_K matmul          0.0964 tok/s  (5.18x total)
  + bulk MemSeg I/O in FP32 matmul/transpose 0.124  tok/s  (6.66x total)

Changes:

JvmQuantizedVectorKernels.dotQ4_KSubBlock (and MemSeg twin):
  - Hoist the per-iteration FloatArray(floatStep) allocation; caller
    passes a reused 32-element scratch buffer.
  - Unpack 16 packed bytes -> 32 float codes in a single tight scalar
    loop with sequential writes (auto-vectorizable by the JIT). Was
    inside the inner SIMD loop with a scattered write + per-iter alloc.
  - SIMD multiply via FMA, accumulate into FloatVector, reduce once at
    the end. Was per-iter horizontal reduction (reduceLanes inside the
    loop kills the SIMD pipelining).

JvmQuantizedVectorKernels.matmulQ6_KVec:
  - Same FMA + accumulate-then-reduce treatment.

JvmVectorKernels.matmulFloatBlockedMemSeg (FP32 x FP32 attention math):
  - Replace per-element MemSeg.get transpose loop with bulk
    MemorySegment.copy + scalar scatter (the scatter auto-vectorizes).
    The transpose loop was O(m*k + n*k) per-element VarHandle.get and
    dominated attention QK^T / AV matmul wall time.
  - Local FloatArray result accumulator, single bulk-write at end.
    Replaces per-tile get + add + set on rSeg.

DefaultCpuOpsJvm.transpose (FP32 MemSeg fast-path):
  - Same bulk-copy treatment: bulk-load src into FloatArray, scatter,
    bulk-write dst. Was O(rows*cols) per-element get/set.

ParallelFor.kt (new):
  - Adds parallelChunks(outputDim, block) using kotlinx-coroutines
    (Dispatchers.Default backed by a CPU-sized worker pool). Below
    PARALLEL_MATMUL_MIN_OUTPUT (256) it runs sequentially in the
    calling thread, avoiding coroutine launch overhead on small
    attention matmuls.
  - Wraps matmulQ4_KVec, matmulQ6_KVec, and matmulFloatBlockedMemSeg
    over the m-dim. Each task owns its own scratch buffer to avoid
    cross-thread contention.

Adds kotlinx-coroutines dependency to skainet-backend-cpu jvmMain.
JVM-only for now (matches the JVM-only Vector API kernels); native
backends will ship their own SIMD + parallelism when they land.

Tests: skainet-backends:skainet-backend-cpu:jvmTest passes (Q4_K, Q6_K,
Q8_0 numeric tests). Downstream GemmaDslQ4KTest and GemmaDslQuantizedTest
still pass (Q4_K/Q8 weights match FP32 reference within tolerance).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Q4_K tensor reader (`Q4_KBlockTensorData`) and the Vector-API matmul
kernels (`matmulQ4_KVec`, `matmulF32Q4_KMemSeg`) were decoding real
GGUF Q4_K_M files with four independent layout disagreements vs ggml:

1. **Codes (qs bytes)** — were unpacked per-byte interleaved (lo→elem 2i,
   hi→elem 2i+1). ggml is *strided per 32-byte group*: byte j*32+i lo
   nibble decodes to element 2j*32+i, hi nibble to (2j+1)*32+i.
2. **Scale-index packing** — was a flat 12-bits-per-sub-block sequential
   layout. ggml uses `get_scale_min_k4` bit-mixing where sub-blocks 4..7
   reuse top-2-bits of the bytes belonging to sub-blocks 0..3.
3. **Normalisation** — was `scale = d * (scaleIdx / 63.0)`. ggml is
   `scale = d * scaleIdx` (no /63).
4. **Sign of min term** — was `output = code * scale + min`. ggml is
   `output = code * scale - offset`.

The bugs round-tripped correctly against each other (the scalar
`DequantOps.dequantQ4KFromBytes` is canonical-ggml, but the kernel's
encoder-decoder pair was internally consistent at the wrong layout) and
the existing `GemmaDslQ4KTest` used `inDim=256` — exactly one Q4_K
block per row, making `relayoutKSeriesRowMajorToBlockMajor` an identity
transform — plus uniform sub-block scales, which masked all four bugs
together.

Adds `Q4KCanonicalLayoutTest` in `skainet-io-gguf` as a tight regression
guard: builds a single Q4_K block in canonical ggml encoding (varying
per-sub-block scales/mins, varying codes), runs both the scalar
`DequantOps.dequantFromBytes` and `Q4_KBlockTensorData.toFloatArray()`,
and asserts they agree within FP16-rehydration tolerance. Pre-fix
divergence: maxAbs=117.5 at element 0. Post-fix: maxAbs<1e-3.

Existing `Q4_KTensorDataTest` and `Q4KDequantizationTest` updated to
match the canonical layout (the old tests baked in the wrong one as
positive assertions).

Effect on downstream Gemma 4 E2B Q4_K_M generation: BOS-loop on
reference token sequence `[2, 10979]` resolved; throughput rises 0.124
→ 0.66 tok/s (the old kernel had cache-unfriendly access patterns from
the wrong byte layout). End-to-end output correctness still requires
the kv-share fix in SKaiNET-transformers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dequantQ5KFromBytes` indexed the 32-byte qh region by output position
(`qh[idx / 8]`, with `idx` advancing through the 256-element block),
selecting bit `idx % 8`. Per ggml-quants.c `dequantize_row_q5_K`, qh is
indexed by `l` (0..31, the same per-byte position used for qs in each
32-byte group), and a single bit per (outer-iter, low/high nibble) is
selected:

   outer 0: low→bit0, hi→bit1
   outer 1: low→bit2, hi→bit3
   outer 2: low→bit4, hi→bit5
   outer 3: low→bit6, hi→bit7

`qh[idx/8]` only happens to equal `qh[l]` for l < 8 in the first outer
iter; everything else reads the wrong byte. On real Gemma 4 E2B Q4_K_M,
this corrupts every 5th bit of the 1.6 GB Q5_K `per_layer_token_embd`
tensor, which feeds the PLE residual into all 35 decoder layers.

Adds `Q5KCanonicalLayoutTest` (mirrors `Q4KCanonicalLayoutTest`):
builds a single Q5_K block in canonical ggml encoding (varying scales
+ varying 5-bit codes per sub-block) and asserts
`DequantOps.dequantFromBytes` agrees with the analytic ggml formula.
Pre-fix: maxAbs=126 at i=2. Post-fix: <1e-3.

End-to-end effect on Gemma 4 E2B Q4_K_M for prompt `[2, 10979]` (BOS+Hi):

  llama.cpp top-1 = '\n' (107) at +5.30 post-softcap
  HF       top-1 = '\n' (107) at +5.74 post-softcap

Pre-fix SKaiNET top-1 = 'ل' (236857) at -6.92 (all logits massively negative)
Post-fix SKaiNET top-1 = '\n' (107) at +5.42 (matches llama.cpp/HF top-1)

Top-5 alignment with llama.cpp now near-exact:
  '\n' (107) ✓ (rank 1)
  '<turn|>' (106) ✓ (rank 2 vs llama.cpp rank 2)
  '\n\n' (108) ✓ (rank 3 vs llama.cpp rank 1, swapped within ε)
  '"' (236775) ✓
  '$' (236795) ✓

Greedy 8-token continuation: '\n** 1. `1.' (markdown structure with `**`,
`1.`, backticks — recognisable text vs the pre-fix multilingual gibberish).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backtick-quoted function names with `()` are illegal on Kotlin/Native
targets, so commonTest compilation failed at compileTestKotlinIosArm64
and friends. Replace `(canonical strided layout)` and `(canonical ggml
formula)` with hyphenated equivalents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit a453045 into develop Apr 28, 2026
9 checks passed
@michalharakal michalharakal deleted the feature/ISSUE-555-q4k-q5k-arena-leak branch April 28, 2026 11:41
@github-actions
Copy link
Copy Markdown

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

  • Operator documentation: docs/modules/operators/_generated_/
  • JSON schema output: operators.json

Artifacts:

  • Download the documentation-preview-556 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

@michalharakal michalharakal mentioned this pull request Apr 28, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 4 / Q4_K_M correctness: canonical ggml layout for Q4_K + Q5_K, FP32 MemSeg arena leak fix

1 participant