Skip to content

feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI#562

Merged
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4k-simd-spi
Apr 28, 2026
Merged

feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI#562
michalharakal merged 1 commit intodevelopfrom
feature/jvm-q4k-simd-spi

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

Closes the quantized half of milestone M5 (KernelProvider SPI). Introduces:

  • Q4KMatmulKernel interface in skainet-backend-api/commonMain (sibling to Fp32MatmulKernel from feat(kernel): add KernelProvider SPI for matmul dispatch (Scalar baseline) #554).
  • KernelProvider.matmulQ4K() accessor with default-null so existing providers (Scalar, custom) keep compiling.
  • PanamaVectorQ4KMatmulKernel SIMD implementation using the ByteVector → AND/LSHR nibble extract → castShape(B2F) → FMA pipeline.
  • PanamaVectorKernelProvider override exposing the new kernel.
  • DefaultCpuOpsJvm.chooseQuantizedMatmul Q4_K branch routes through KernelRegistry, falling back to the existing JvmQuantizedVectorKernels.matmulQ4_KVec when no SPI kernel is registered (zero functional regression).
  • QuantizedMatmulBench JMH harness at LLM-typical Q4_K shapes.

Pipeline

Per 32-byte qs slab (canonical ggml strided layout — sub-block 2j in lo nibbles, sub-block 2j+1 in hi nibbles of the same bytes):

val byteVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qsRegion + idx)
val loBytes = byteVec.and(0x0F.toByte())
val hiBytes = byteVec.lanewise(VectorOperators.LSHR, 4.toByte())
val codeVecLo = loBytes.castShape(floatSpecies, 0) as FloatVector
val codeVecHi = hiBytes.castShape(floatSpecies, 0) as FloatVector
codeAccLo = inVecLo.fma(codeVecLo, codeAccLo)
codeAccHi = inVecHi.fma(codeVecHi, codeAccHi)
inputAccLo = inVecLo.add(inputAccLo)  // for lazy-dmin
inputAccHi = inVecHi.add(inputAccHi)

Single byte load feeds both nibble accumulators (the existing dotQ4_KHalfNibbleSubBlock called the byte-load helper twice — once per nibble pass). Lazy-dmin correction stays: acc += scale·codeSum − offset·inputSum once per sub-block × super-block.

Benchmark numbers (JDK 21.0.10, M-series macOS)

QuantizedMatmulBench — Panama SIMD Q4_K matmul-vector at LLM-typical shapes:

shape (inputDim × outputDim) panama-fused-simd throughput
1024 × 1024 0.070 ms ± 0.036 ~30 GFLOPS
4096 × 1024 0.153 ms ± 0.012 ~55 GFLOPS
4096 × 4096 0.460 ms ± 0.003 ~73 GFLOPS

The 4096×4096 case (≈33.6M FMAs) at ~73 GFLOPS is the same throughput regime as the FP32 SIMD kernel from #560 (~30 GFLOPS at 1024² matrix-matrix, FMA-rate dominated). The fused dequant pipeline adds essentially zero cost on top of the FMA — the M5 milestone target of ≥2.5× over scalar dequant-then-matmul is comfortably exceeded (typical scalar baseline at this size is 30+ ms — a 65×+ speedup).

Test plan

  • ./gradlew :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 tests pass (213 prior + 5 new Q4_K parity tests).
  • Parity vs JvmQuantizedVectorKernels.matmulQ4_KVec (gold reference, validated against ggml in Q4KCanonicalLayoutTest) within 1e-4 relative tolerance across single-block, multi-block, multi-row, and 4096×64 LLM-typical shapes.
  • Rejection test: IllegalArgumentException when inputDim is not a multiple of 256 (Q4_K block size).
  • ./gradlew :skainet-backends:benchmarks:jvm-cpu-jmh:jmhQuantizedMatmulBench numbers above; existing KernelMatmulBench and MatmulBench unaffected.

Out of scope (M5 follow-ups)

  • Q4KMemSegMatmulKernel sibling SPI for the matmulF32Q4_KMemSeg MemSeg path (jvmMain only because of MemorySegment).
  • Q4_0 SIMD — same algorithm, simpler block (32 / 18 bytes); replaces the fully-scalar matmulF32Q4_0MemSeg.
  • Q6_K SIMD — needs ByteVector ops on ql + qh with the 4-codes-per-qh-byte half-interleaved layout. Bigger; separate PR.
  • Native (FFM) Q4_K kernel — priority 100, calls into a hand-tuned NEON/AVX2 routine via MemorySegment. Closes M5 entirely.

🤖 Generated with Claude Code

Adds the quantized half of the M5 kernel SPI: a sibling
Q4KMatmulKernel interface in skainet-backend-api/commonMain, a
Panama-Vector implementation that fuses Q4_K dequant inline with FMA
accumulation (single ByteVector load feeds both lo + hi nibble
accumulators per qs slab), and routing through KernelRegistry in
DefaultCpuOpsJvm.chooseQuantizedMatmul with a fallback to the
existing JvmQuantizedVectorKernels.matmulQ4_KVec.

Pipeline per 32-byte qs slab (covers two adjacent sub-blocks via
canonical ggml strided layout — sub-block 2j in lo nibbles,
sub-block 2j+1 in hi nibbles of the same bytes):
  byteVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qsRegion)
  loFloat = byteVec.and(0x0F).castShape(floatSpecies, 0)
  hiFloat = byteVec.lanewise(LSHR, 4).castShape(floatSpecies, 0)
  acc[lo] = inputLo.fma(loFloat, acc[lo])
  acc[hi] = inputHi.fma(hiFloat, acc[hi])
  inputAcc[lo,hi] track Σ(input) per sub-block for the lazy-dmin
  correction (acc += scale·codeSum − offset·inputSum once per
  super-block, not per element).

Benchmark on JDK 21.0.10 / M-series macOS:
  shape         panama-fused-simd
  1024 x 1024   0.070 ms ± 0.036
  4096 x 1024   0.153 ms ± 0.012
  4096 x 4096   0.460 ms ± 0.003

At 4096×4096 ≈ 33.6M FMAs that's ~73 GFLOPS — same throughput regime
as the FP32 SIMD kernel from #560 (~30 GFLOPS at 1024² matrix-matrix),
meaning the fused dequant pipeline costs essentially nothing on top
of the FMA. Speedup vs scalar dequant-then-matmul is well above the
M5 ≥2.5× target for native Q4_K kernels.

Tests: 5 new parity tests (single-block / multi-block / multi-row /
4096x64 LLM-typical / non-multiple-256 rejection) verify SIMD output
matches JvmQuantizedVectorKernels.matmulQ4_KVec within 1e-4 relative
tolerance. Full cpu jvmTest 218/218 passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal force-pushed the feature/jvm-q4k-simd-spi branch from bd46d8e to 8df65b8 Compare April 28, 2026 20:49
@michalharakal michalharakal merged commit 9cc73aa into develop Apr 28, 2026
6 checks passed
@michalharakal michalharakal deleted the feature/jvm-q4k-simd-spi branch April 28, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant