perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA#565
Merged
michalharakal merged 1 commit intodevelopfrom Apr 28, 2026
Merged
perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA#565michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
Replaces the fully-scalar inner loop in dotQ4_0BlockMemSeg with the two-stage pattern Q4_K used pre-#562: a scalar byte-pair unpack writes 32 sign-corrected floats into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic of the prior scalar dot product), then a FloatVector FMA reduction takes the dot product against the matching input slice. Per-block scratch is hoisted out of matmulF32Q4_0MemSeg's loops to amortize allocation. A backwards-compatible no-codeBuf overload preserves the prior signature for any external callers (creates a fresh scratch on each invocation). Q4_0 isn't dominant in modern quantized weights (Q4_K_M / Q4_K_S cover Gemma 4, Llama, Qwen) but it was the last fully-scalar kernel in JvmQuantizedVectorKernels — closing it brings the JVM Vector path to parity across Q4_0, Q4_K, Q6_K, Q8_0. A fully-fused ByteVector pipeline (a la PanamaVectorQ4KMatmulKernel) is awkward for Q4_0's interleaved nibble layout — adjacent elements share a byte, so getting codes in natural order needs a strided gather or lane-interleave shuffle. Out of scope; if Q4_0 becomes a hot path again, that's the next move. Tests: cpu jvmTest 218/218 pass, including QuantizedMemSegMatmulTest's parity coverage of the Q4_0 MemSeg matmul. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the fully-scalar inner loop in
dotQ4_0BlockMemSegwith the two-stage pattern Q4_K used pre-#562:FloatArray(16 byte loads, two nibbles each — half the byte traffic of the prior scalar dot product, which loaded a byte per element).FloatVector.fma+reduceLanes(ADD).Per-block scratch is hoisted out of
matmulF32Q4_0MemSeg's loops to amortize allocation. A backwards-compatible no-codeBufoverload preserves the prior signature for any external callers.Why partial, not fully fused?
Q4_0's nibble layout has adjacent elements sharing a byte (
code[2k] = byte[k] & 0x0F,code[2k+1] = byte[k] >>> 4), unlike Q4_K's strided layout where lo/hi nibbles map to separate sub-blocks. Getting Q4_0 codes in natural element order from aByteVectorwould need a lane-interleave shuffle or strided gather — significantly more code, and on NEON gather has no native instruction so it falls back to scalar.The two-stage pattern is the path Q4_K took before #562 promoted it to a fully-fused pipeline. If Q4_0 ever becomes a hot path again (it's rarely used — Q4_K_M / Q4_K_S cover Gemma 4, Llama, Qwen), the same upgrade can follow.
Why now
Q4_0 was the last fully-scalar kernel in
JvmQuantizedVectorKernels. Closing it brings the JVM Vector path to parity across Q4_0, Q4_K, Q6_K, Q8_0 — every quantized format we support is now SIMD'd to some degree.Test plan
./gradlew :skainet-backends:skainet-backend-cpu:jvmTest— 218/218 pass;QuantizedMemSegMatmulTestexercises the Q4_0 MemSeg matmul end-to-end throughctx.ops.matmul.🤖 Generated with Claude Code