perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA by michalharakal · Pull Request #565 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-28T21:01:11Z

Summary

Replaces the fully-scalar inner loop in dotQ4_0BlockMemSeg with the two-stage pattern Q4_K used pre-#562:

Scalar byte-pair unpack writes 32 sign-corrected floats into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic of the prior scalar dot product, which loaded a byte per element).
Vector FMA reduction dot-products the scratch against the matching input slice via FloatVector.fma + reduceLanes(ADD).

Per-block scratch is hoisted out of matmulF32Q4_0MemSeg's loops to amortize allocation. A backwards-compatible no-codeBuf overload preserves the prior signature for any external callers.

Why partial, not fully fused?

Q4_0's nibble layout has adjacent elements sharing a byte (code[2k] = byte[k] & 0x0F, code[2k+1] = byte[k] >>> 4), unlike Q4_K's strided layout where lo/hi nibbles map to separate sub-blocks. Getting Q4_0 codes in natural element order from a ByteVector would need a lane-interleave shuffle or strided gather — significantly more code, and on NEON gather has no native instruction so it falls back to scalar.

The two-stage pattern is the path Q4_K took before #562 promoted it to a fully-fused pipeline. If Q4_0 ever becomes a hot path again (it's rarely used — Q4_K_M / Q4_K_S cover Gemma 4, Llama, Qwen), the same upgrade can follow.

Why now

Q4_0 was the last fully-scalar kernel in JvmQuantizedVectorKernels. Closing it brings the JVM Vector path to parity across Q4_0, Q4_K, Q6_K, Q8_0 — every quantized format we support is now SIMD'd to some degree.

Test plan

./gradlew :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 pass; QuantizedMemSegMatmulTest exercises the Q4_0 MemSeg matmul end-to-end through ctx.ops.matmul.

🤖 Generated with Claude Code

Replaces the fully-scalar inner loop in dotQ4_0BlockMemSeg with the two-stage pattern Q4_K used pre-#562: a scalar byte-pair unpack writes 32 sign-corrected floats into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic of the prior scalar dot product), then a FloatVector FMA reduction takes the dot product against the matching input slice. Per-block scratch is hoisted out of matmulF32Q4_0MemSeg's loops to amortize allocation. A backwards-compatible no-codeBuf overload preserves the prior signature for any external callers (creates a fresh scratch on each invocation). Q4_0 isn't dominant in modern quantized weights (Q4_K_M / Q4_K_S cover Gemma 4, Llama, Qwen) but it was the last fully-scalar kernel in JvmQuantizedVectorKernels — closing it brings the JVM Vector path to parity across Q4_0, Q4_K, Q6_K, Q8_0. A fully-fused ByteVector pipeline (a la PanamaVectorQ4KMatmulKernel) is awkward for Q4_0's interleaved nibble layout — adjacent elements share a byte, so getting codes in natural order needs a strided gather or lane-interleave shuffle. Out of scope; if Q4_0 becomes a hot path again, that's the next move. Tests: cpu jvmTest 218/218 pass, including QuantizedMemSegMatmulTest's parity coverage of the Q4_0 MemSeg matmul. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal marked this pull request as ready for review April 28, 2026 21:04

michalharakal merged commit d48f172 into develop Apr 28, 2026
6 checks passed

michalharakal mentioned this pull request Apr 28, 2026

chore(release): prepare 0.21.0 #566

Merged

3 tasks

michalharakal deleted the feature/jvm-q4_0-simd-dot branch April 29, 2026 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA#565

perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA#565
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4_0-simd-dot

michalharakal commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 28, 2026

Summary

Why partial, not fully fused?

Why now

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant