feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI by michalharakal · Pull Request #562 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-28T20:36:50Z

Summary

Closes the quantized half of milestone M5 (KernelProvider SPI). Introduces:

Q4KMatmulKernel interface in skainet-backend-api/commonMain (sibling to Fp32MatmulKernel from feat(kernel): add KernelProvider SPI for matmul dispatch (Scalar baseline) #554).
KernelProvider.matmulQ4K() accessor with default-null so existing providers (Scalar, custom) keep compiling.
PanamaVectorQ4KMatmulKernel SIMD implementation using the ByteVector → AND/LSHR nibble extract → castShape(B2F) → FMA pipeline.
PanamaVectorKernelProvider override exposing the new kernel.
DefaultCpuOpsJvm.chooseQuantizedMatmul Q4_K branch routes through KernelRegistry, falling back to the existing JvmQuantizedVectorKernels.matmulQ4_KVec when no SPI kernel is registered (zero functional regression).
QuantizedMatmulBench JMH harness at LLM-typical Q4_K shapes.

Pipeline

Per 32-byte qs slab (canonical ggml strided layout — sub-block 2j in lo nibbles, sub-block 2j+1 in hi nibbles of the same bytes):

val byteVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qsRegion + idx)
val loBytes = byteVec.and(0x0F.toByte())
val hiBytes = byteVec.lanewise(VectorOperators.LSHR, 4.toByte())
val codeVecLo = loBytes.castShape(floatSpecies, 0) as FloatVector
val codeVecHi = hiBytes.castShape(floatSpecies, 0) as FloatVector
codeAccLo = inVecLo.fma(codeVecLo, codeAccLo)
codeAccHi = inVecHi.fma(codeVecHi, codeAccHi)
inputAccLo = inVecLo.add(inputAccLo)  // for lazy-dmin
inputAccHi = inVecHi.add(inputAccHi)

Single byte load feeds both nibble accumulators (the existing dotQ4_KHalfNibbleSubBlock called the byte-load helper twice — once per nibble pass). Lazy-dmin correction stays: acc += scale·codeSum − offset·inputSum once per sub-block × super-block.

Benchmark numbers (JDK 21.0.10, M-series macOS)

QuantizedMatmulBench — Panama SIMD Q4_K matmul-vector at LLM-typical shapes:

shape (inputDim × outputDim)	panama-fused-simd	throughput
1024 × 1024	0.070 ms ± 0.036	~30 GFLOPS
4096 × 1024	0.153 ms ± 0.012	~55 GFLOPS
4096 × 4096	0.460 ms ± 0.003	~73 GFLOPS

The 4096×4096 case (≈33.6M FMAs) at ~73 GFLOPS is the same throughput regime as the FP32 SIMD kernel from #560 (~30 GFLOPS at 1024² matrix-matrix, FMA-rate dominated). The fused dequant pipeline adds essentially zero cost on top of the FMA — the M5 milestone target of ≥2.5× over scalar dequant-then-matmul is comfortably exceeded (typical scalar baseline at this size is 30+ ms — a 65×+ speedup).

Test plan

./gradlew :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 tests pass (213 prior + 5 new Q4_K parity tests).
Parity vs JvmQuantizedVectorKernels.matmulQ4_KVec (gold reference, validated against ggml in Q4KCanonicalLayoutTest) within 1e-4 relative tolerance across single-block, multi-block, multi-row, and 4096×64 LLM-typical shapes.
Rejection test: IllegalArgumentException when inputDim is not a multiple of 256 (Q4_K block size).
./gradlew :skainet-backends:benchmarks:jvm-cpu-jmh:jmh — QuantizedMatmulBench numbers above; existing KernelMatmulBench and MatmulBench unaffected.

Out of scope (M5 follow-ups)

Q4KMemSegMatmulKernel sibling SPI for the matmulF32Q4_KMemSeg MemSeg path (jvmMain only because of MemorySegment).
Q4_0 SIMD — same algorithm, simpler block (32 / 18 bytes); replaces the fully-scalar matmulF32Q4_0MemSeg.
Q6_K SIMD — needs ByteVector ops on ql + qh with the 4-codes-per-qh-byte half-interleaved layout. Bigger; separate PR.
Native (FFM) Q4_K kernel — priority 100, calls into a hand-tuned NEON/AVX2 routine via MemorySegment. Closes M5 entirely.

🤖 Generated with Claude Code

Adds the quantized half of the M5 kernel SPI: a sibling Q4KMatmulKernel interface in skainet-backend-api/commonMain, a Panama-Vector implementation that fuses Q4_K dequant inline with FMA accumulation (single ByteVector load feeds both lo + hi nibble accumulators per qs slab), and routing through KernelRegistry in DefaultCpuOpsJvm.chooseQuantizedMatmul with a fallback to the existing JvmQuantizedVectorKernels.matmulQ4_KVec. Pipeline per 32-byte qs slab (covers two adjacent sub-blocks via canonical ggml strided layout — sub-block 2j in lo nibbles, sub-block 2j+1 in hi nibbles of the same bytes): byteVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qsRegion) loFloat = byteVec.and(0x0F).castShape(floatSpecies, 0) hiFloat = byteVec.lanewise(LSHR, 4).castShape(floatSpecies, 0) acc[lo] = inputLo.fma(loFloat, acc[lo]) acc[hi] = inputHi.fma(hiFloat, acc[hi]) inputAcc[lo,hi] track Σ(input) per sub-block for the lazy-dmin correction (acc += scale·codeSum − offset·inputSum once per super-block, not per element). Benchmark on JDK 21.0.10 / M-series macOS: shape panama-fused-simd 1024 x 1024 0.070 ms ± 0.036 4096 x 1024 0.153 ms ± 0.012 4096 x 4096 0.460 ms ± 0.003 At 4096×4096 ≈ 33.6M FMAs that's ~73 GFLOPS — same throughput regime as the FP32 SIMD kernel from #560 (~30 GFLOPS at 1024² matrix-matrix), meaning the fused dequant pipeline costs essentially nothing on top of the FMA. Speedup vs scalar dequant-then-matmul is well above the M5 ≥2.5× target for native Q4_K kernels. Tests: 5 new parity tests (single-block / multi-block / multi-row / 4096x64 LLM-typical / non-multiple-256 rejection) verify SIMD output matches JvmQuantizedVectorKernels.matmulQ4_KVec within 1e-4 relative tolerance. Full cpu jvmTest 218/218 passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal marked this pull request as ready for review April 28, 2026 20:47

michalharakal mentioned this pull request Apr 28, 2026

perf(q4_k): SIMD-fy matmulF32Q4_KMemSeg via ByteVector.fromMemorySegment #563

Merged

1 task

michalharakal force-pushed the feature/jvm-q4k-simd-spi branch from bd46d8e to 8df65b8 Compare April 28, 2026 20:49

michalharakal merged commit 9cc73aa into develop Apr 28, 2026
6 checks passed

michalharakal mentioned this pull request Apr 28, 2026

perf(q6_k): SIMD-fy dequantQ6_KBlock via ByteVector ql + qh extraction #564

Merged

1 task

michalharakal deleted the feature/jvm-q4k-simd-spi branch April 28, 2026 20:57

This was referenced Apr 28, 2026

perf(q4_0): partial-vec dotQ4_0BlockMemSeg via scratch + SIMD FMA #565

Merged

chore(release): prepare 0.21.0 #566

Merged

Kernel reference docs (KSP-generated, parallel to ops) #568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI#562

feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI#562
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4k-simd-spi

michalharakal commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 28, 2026

Summary

Pipeline

Benchmark numbers (JDK 21.0.10, M-series macOS)

Test plan

Out of scope (M5 follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant