feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI#562
Merged
michalharakal merged 1 commit intodevelopfrom Apr 28, 2026
Merged
feat(kernel): SIMD-fused Q4_K matmul kernel + Q4KMatmulKernel SPI#562michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
1 task
Adds the quantized half of the M5 kernel SPI: a sibling Q4KMatmulKernel interface in skainet-backend-api/commonMain, a Panama-Vector implementation that fuses Q4_K dequant inline with FMA accumulation (single ByteVector load feeds both lo + hi nibble accumulators per qs slab), and routing through KernelRegistry in DefaultCpuOpsJvm.chooseQuantizedMatmul with a fallback to the existing JvmQuantizedVectorKernels.matmulQ4_KVec. Pipeline per 32-byte qs slab (covers two adjacent sub-blocks via canonical ggml strided layout — sub-block 2j in lo nibbles, sub-block 2j+1 in hi nibbles of the same bytes): byteVec = ByteVector.fromArray(byteSpeciesForFloat, weight, qsRegion) loFloat = byteVec.and(0x0F).castShape(floatSpecies, 0) hiFloat = byteVec.lanewise(LSHR, 4).castShape(floatSpecies, 0) acc[lo] = inputLo.fma(loFloat, acc[lo]) acc[hi] = inputHi.fma(hiFloat, acc[hi]) inputAcc[lo,hi] track Σ(input) per sub-block for the lazy-dmin correction (acc += scale·codeSum − offset·inputSum once per super-block, not per element). Benchmark on JDK 21.0.10 / M-series macOS: shape panama-fused-simd 1024 x 1024 0.070 ms ± 0.036 4096 x 1024 0.153 ms ± 0.012 4096 x 4096 0.460 ms ± 0.003 At 4096×4096 ≈ 33.6M FMAs that's ~73 GFLOPS — same throughput regime as the FP32 SIMD kernel from #560 (~30 GFLOPS at 1024² matrix-matrix), meaning the fused dequant pipeline costs essentially nothing on top of the FMA. Speedup vs scalar dequant-then-matmul is well above the M5 ≥2.5× target for native Q4_K kernels. Tests: 5 new parity tests (single-block / multi-block / multi-row / 4096x64 LLM-typical / non-multiple-256 rejection) verify SIMD output matches JvmQuantizedVectorKernels.matmulQ4_KVec within 1e-4 relative tolerance. Full cpu jvmTest 218/218 passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bd46d8e to
8df65b8
Compare
1 task
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the quantized half of milestone M5 (
KernelProviderSPI). Introduces:Q4KMatmulKernelinterface inskainet-backend-api/commonMain(sibling toFp32MatmulKernelfrom feat(kernel): add KernelProvider SPI for matmul dispatch (Scalar baseline) #554).KernelProvider.matmulQ4K()accessor with default-nullso existing providers (Scalar, custom) keep compiling.PanamaVectorQ4KMatmulKernelSIMD implementation using theByteVector→ AND/LSHR nibble extract →castShape(B2F)→ FMA pipeline.PanamaVectorKernelProvideroverride exposing the new kernel.DefaultCpuOpsJvm.chooseQuantizedMatmulQ4_K branch routes throughKernelRegistry, falling back to the existingJvmQuantizedVectorKernels.matmulQ4_KVecwhen no SPI kernel is registered (zero functional regression).QuantizedMatmulBenchJMH harness at LLM-typical Q4_K shapes.Pipeline
Per 32-byte
qsslab (canonical ggml strided layout — sub-block2jin lo nibbles, sub-block2j+1in hi nibbles of the same bytes):Single byte load feeds both nibble accumulators (the existing
dotQ4_KHalfNibbleSubBlockcalled the byte-load helper twice — once per nibble pass). Lazy-dmincorrection stays:acc += scale·codeSum − offset·inputSumonce per sub-block × super-block.Benchmark numbers (JDK 21.0.10, M-series macOS)
QuantizedMatmulBench— Panama SIMD Q4_K matmul-vector at LLM-typical shapes:The 4096×4096 case (≈33.6M FMAs) at ~73 GFLOPS is the same throughput regime as the FP32 SIMD kernel from #560 (~30 GFLOPS at 1024² matrix-matrix, FMA-rate dominated). The fused dequant pipeline adds essentially zero cost on top of the FMA — the M5 milestone target of ≥2.5× over scalar dequant-then-matmul is comfortably exceeded (typical scalar baseline at this size is 30+ ms — a 65×+ speedup).
Test plan
./gradlew :skainet-backends:skainet-backend-cpu:jvmTest— 218/218 tests pass (213 prior + 5 new Q4_K parity tests).JvmQuantizedVectorKernels.matmulQ4_KVec(gold reference, validated against ggml inQ4KCanonicalLayoutTest) within1e-4relative tolerance across single-block, multi-block, multi-row, and 4096×64 LLM-typical shapes.IllegalArgumentExceptionwheninputDimis not a multiple of 256 (Q4_K block size)../gradlew :skainet-backends:benchmarks:jvm-cpu-jmh:jmh—QuantizedMatmulBenchnumbers above; existingKernelMatmulBenchandMatmulBenchunaffected.Out of scope (M5 follow-ups)
Q4KMemSegMatmulKernelsibling SPI for thematmulF32Q4_KMemSegMemSeg path (jvmMain only because ofMemorySegment).matmulF32Q4_0MemSeg.ByteVectorops onql + qhwith the 4-codes-per-qh-byte half-interleaved layout. Bigger; separate PR.MemorySegment. Closes M5 entirely.🤖 Generated with Claude Code