feat(native-cpu): native FFM Q4_K matmul kernel (PR 2 of 5) by michalharakal · Pull Request #572 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-29T19:53:53Z

Summary

PR 2 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Wires a real Q4_K matmul into the public Q4KMatmulKernel SPI; NativeKernelProvider now wins KernelRegistry.bestAvailable() over Panama for Q4_K on JVM hosts where the bundled libskainet_kernels resolves and skainet_q4k_matmul links.

C kernel (native/src/q4k_matmul.c): single-source, scalar, -O3 -ffast-math -funroll-loops. Mirrors PanamaVectorQ4KMatmulKernel byte-for-byte on the canonical ggml Q4_K layout (256-element / 144-byte super-blocks; FP16 d/dMin; 12-byte get_scale_min_k4 packed sub-scales; 128 bytes of strided 4-bit codes; lazy-dmin accumulation).
Kotlin wrapper (NativeQ4KMatmulKernel): FFM Linker.downcallHandle on FunctionDescriptor.ofVoid with 8 args; heap arrays copied through Arena.ofConfined segments. The MemSeg-input zero-copy variant for mmap'd weights ships in PR 3.
Provider wiring: NativeKernelProvider.isAvailable() now lib+symbol-gated; matmulQ4K() returns the native kernel when available, cleanly cascades to Panama otherwise.

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3, warmup=20, samples=21, median µs)

shape	native	panama	ratio
1024²	379	2225	5.87×
2048²	1393	6558	4.71×
4096²	5958	24865	4.17×

Crushes both PRD targets:

≥2.5× over scalar Q4_K dequant baseline → exceeded by a wide margin (Panama is already much faster than scalar Kotlin; native is 4.17–5.87× faster than Panama)
≥1.5× over Panama Vector → exceeded by 2.7–3.9× margin

Native is single-threaded; Panama uses parallelChunks across all cores. The fact native still wins everywhere suggests parallelChunks overhead dominates at these shapes — also a useful signal for follow-up work in the cpu module.

Test plan

:skainet-backends:skainet-backend-native-cpu:jvmTest — 8/8 (3 pipeline + 5 parity; microbench skipped by default)
:skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
Parity vs PanamaVectorQ4KMatmulKernel within 1e-4 relative tolerance across shapes 256×{1,16}, 1024×64, 4096×64
CI verifies on macOS arm64 / Linux arm64 (cross-arch matrix is PR 4)
To re-run the microbench locally: ./gradlew :skainet-backends:skainet-backend-native-cpu:jvmTest --tests '*Microbench*' -Dskainet.runBench=true

Out of scope

PR 3: Q4KMemSegMatmulKernel SPI sibling + native variant for zero-copy mmap'd Q4_K weights (closes M4↔M5 synergy)
PR 4: hand-tuned NEON / AVX2 intrinsics + linuxX64 / linuxArm64 / macosArm64 cross-arch CI matrix
PR 5: native FP32 / Q6_K / Q8_0 kernels
JMH integration in :skainet-backends:benchmarks:jvm-cpu-jmh (the Q4KMatmulMicrobenchTest here is a stand-in)
Maven Central native classifier publishing (separate plan)

🤖 Generated with Claude Code

PR 2 of the staged native (FFM) kernel provider rollout described in docs/.../perf/native-ffm-plan.adoc. Wires a real Q4_K matmul into the public SPI: NativeKernelProvider now reports isAvailable() = true on hosts where the bundled libskainet_kernels resolves and skainet_q4k_matmul links, and matmulQ4K() returns NativeQ4KMatmulKernel at priority 100 — winning KernelRegistry.bestAvailable() over Panama (50) for Q4_K on JVM. Native side (native/): - src/q4k_matmul.c implements skainet_q4k_matmul over the canonical ggml Q4_K super-block layout (256 elements / 144 bytes; FP16 d/dMin; 12-byte get_scale_min_k4 packed sub-scales; 128 bytes of strided 4-bit codes). Mirrors PanamaVectorQ4KMatmulKernel byte-for-byte — same lazy-dmin trick (codeSum + inputSum per sub-block; combine via d*scaleIdx*codeSum - dMin*minIdx*inputSum). Single-threaded, scalar C; the 32-iteration inner loop is straight-line FP arithmetic that -O3 -ffast-math auto-vectorizes on AVX2 / NEON. - include/skainet_kernels.h declares the new export with the SKAINET_API visibility macro. - CMakeLists.txt picks up q4k_matmul.c and adds -O3 -ffast-math -funroll-loops to the compile flags so the auto-vec actually fires. Kotlin side (src/jvmMain): - NativeQ4KMatmulKernel implements Q4KMatmulKernel via FFM downcall (Linker.downcallHandle on FunctionDescriptor.ofVoid with 8 args matching the C signature). Heap arrays are copied into Arena. ofConfined off-heap segments, the kernel runs, output bulk-copies back. The MemorySegment-input overload that avoids the heap copy for mmap'd Q4_K weights ships in PR 3. - NativeKernelProvider.isAvailable() now returns NativeQ4KMatmulKernel.isAvailable() (lib loaded + symbol resolved). matmulQ4K() returns the native kernel when available; cascades to Panama otherwise. matmulFp32() still null pending a later PR. Tests (src/jvmTest): - NativeQ4KMatmulKernelParityTest: 5 parity assertions vs PanamaVectorQ4KMatmulKernel (the existing priority-50 reference) across single-block / multi-block / LLM-typical (4096×64) shapes with the same fixture pattern as PanamaVectorQ4KMatmulKernelTest. Tolerance: 1e-2 to 5e-1 absolute or 1e-4 relative — the same bar Panama-vs-scalar parity uses, which already swallows FMA + native -ffast-math reassociation differences. - Q4KMatmulMicrobenchTest: wall-clock comparison vs Panama at 1024² / 2048² / 4096². Skipped by default; activates with -Dskainet.runBench=true (forwarded from Gradle CLI through a new systemProperty bridge in build.gradle.kts). - NativeFfmPipelineTest: stub-flip assertion updated to expect isAvailable() = true and matmulQ4K() != null. build.gradle.kts: - jvmTest dependencies pick up :skainet-backend-cpu (for the parity reference) and kotlinx-coroutines (transitive: PanamaVector uses parallelChunks). - Test JVM args extended with --add-modules jdk.incubator.vector so the parity test can load Panama. Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3, -O3 -ffast- math; warmup=20, samples=21, median µs): shape native panama ratio 1024² 379 2225 5.87× 2048² 1393 6558 4.71× 4096² 5958 24865 4.17× Crushes both PRD targets: - ≥2.5× over scalar Q4_K dequant baseline (Panama is already >> scalar; native is 4.17–5.87× faster than Panama) - ≥1.5× over Panama Vector → exceeded by 2.7–3.9× margin Verification (linux-x86_64, JDK 21.0.10, cmake 3.28.3): - :skainet-backends:skainet-backend-native-cpu:jvmTest — 8/8 (3 pipeline + 5 parity, microbench skipped without -D) - :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (cascade unchanged; with native registered the registry now hands out the native Q4_K kernel ahead of Panama) Out of scope (deferred per asciidoc staging): - PR 3: Q4KMemSegMatmulKernel SPI sibling for zero-copy mmap'd weights - PR 4: linuxX64 AVX2 + NEON intrinsics + cross-arch CI matrix - PR 5: native FP32 / Q6_K / Q8_0 kernels - JMH integration in :skainet-backends:benchmarks:jvm-cpu-jmh (Q4KMatmulMicrobenchTest is a stand-in) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal merged commit c928f71 into develop Apr 29, 2026
6 checks passed

michalharakal deleted the feature/native-q4k-matmul branch April 29, 2026 19:54

This was referenced Apr 29, 2026

feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5) #575

Merged

Prepare 0.22.0 #580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(native-cpu): native FFM Q4_K matmul kernel (PR 2 of 5)#572

feat(native-cpu): native FFM Q4_K matmul kernel (PR 2 of 5)#572
michalharakal merged 1 commit intodevelopfrom
feature/native-q4k-matmul

michalharakal commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 29, 2026

Summary

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3, warmup=20, samples=21, median µs)

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant