feat(native-cpu): native FFM Q4_K matmul kernel (PR 2 of 5)#572
Merged
michalharakal merged 1 commit intodevelopfrom Apr 29, 2026
Merged
feat(native-cpu): native FFM Q4_K matmul kernel (PR 2 of 5)#572michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
PR 2 of the staged native (FFM) kernel provider rollout described in
docs/.../perf/native-ffm-plan.adoc. Wires a real Q4_K matmul into the
public SPI: NativeKernelProvider now reports isAvailable() = true on
hosts where the bundled libskainet_kernels resolves and
skainet_q4k_matmul links, and matmulQ4K() returns NativeQ4KMatmulKernel
at priority 100 — winning KernelRegistry.bestAvailable() over Panama
(50) for Q4_K on JVM.
Native side (native/):
- src/q4k_matmul.c implements skainet_q4k_matmul over the canonical
ggml Q4_K super-block layout (256 elements / 144 bytes; FP16 d/dMin;
12-byte get_scale_min_k4 packed sub-scales; 128 bytes of strided
4-bit codes). Mirrors PanamaVectorQ4KMatmulKernel byte-for-byte —
same lazy-dmin trick (codeSum + inputSum per sub-block; combine via
d*scaleIdx*codeSum - dMin*minIdx*inputSum). Single-threaded, scalar
C; the 32-iteration inner loop is straight-line FP arithmetic that
-O3 -ffast-math auto-vectorizes on AVX2 / NEON.
- include/skainet_kernels.h declares the new export with the
SKAINET_API visibility macro.
- CMakeLists.txt picks up q4k_matmul.c and adds -O3 -ffast-math
-funroll-loops to the compile flags so the auto-vec actually fires.
Kotlin side (src/jvmMain):
- NativeQ4KMatmulKernel implements Q4KMatmulKernel via FFM downcall
(Linker.downcallHandle on FunctionDescriptor.ofVoid with 8 args
matching the C signature). Heap arrays are copied into Arena.
ofConfined off-heap segments, the kernel runs, output bulk-copies
back. The MemorySegment-input overload that avoids the heap copy
for mmap'd Q4_K weights ships in PR 3.
- NativeKernelProvider.isAvailable() now returns
NativeQ4KMatmulKernel.isAvailable() (lib loaded + symbol resolved).
matmulQ4K() returns the native kernel when available; cascades to
Panama otherwise. matmulFp32() still null pending a later PR.
Tests (src/jvmTest):
- NativeQ4KMatmulKernelParityTest: 5 parity assertions vs
PanamaVectorQ4KMatmulKernel (the existing priority-50 reference)
across single-block / multi-block / LLM-typical (4096×64) shapes
with the same fixture pattern as PanamaVectorQ4KMatmulKernelTest.
Tolerance: 1e-2 to 5e-1 absolute or 1e-4 relative — the same bar
Panama-vs-scalar parity uses, which already swallows FMA + native
-ffast-math reassociation differences.
- Q4KMatmulMicrobenchTest: wall-clock comparison vs Panama at
1024² / 2048² / 4096². Skipped by default; activates with
-Dskainet.runBench=true (forwarded from Gradle CLI through a new
systemProperty bridge in build.gradle.kts).
- NativeFfmPipelineTest: stub-flip assertion updated to expect
isAvailable() = true and matmulQ4K() != null.
build.gradle.kts:
- jvmTest dependencies pick up :skainet-backend-cpu (for the parity
reference) and kotlinx-coroutines (transitive: PanamaVector uses
parallelChunks).
- Test JVM args extended with --add-modules jdk.incubator.vector so
the parity test can load Panama.
Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3, -O3 -ffast-
math; warmup=20, samples=21, median µs):
shape native panama ratio
1024² 379 2225 5.87×
2048² 1393 6558 4.71×
4096² 5958 24865 4.17×
Crushes both PRD targets:
- ≥2.5× over scalar Q4_K dequant baseline (Panama is already >>
scalar; native is 4.17–5.87× faster than Panama)
- ≥1.5× over Panama Vector → exceeded by 2.7–3.9× margin
Verification (linux-x86_64, JDK 21.0.10, cmake 3.28.3):
- :skainet-backends:skainet-backend-native-cpu:jvmTest — 8/8 (3
pipeline + 5 parity, microbench skipped without -D)
- :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (cascade
unchanged; with native registered the registry now hands out the
native Q4_K kernel ahead of Panama)
Out of scope (deferred per asciidoc staging):
- PR 3: Q4KMemSegMatmulKernel SPI sibling for zero-copy mmap'd weights
- PR 4: linuxX64 AVX2 + NEON intrinsics + cross-arch CI matrix
- PR 5: native FP32 / Q6_K / Q8_0 kernels
- JMH integration in :skainet-backends:benchmarks:jvm-cpu-jmh
(Q4KMatmulMicrobenchTest is a stand-in)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 29, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 2 of the staged native-FFM rollout per
docs/.../perf/native-ffm-plan.adoc. Wires a real Q4_K matmul into the publicQ4KMatmulKernelSPI;NativeKernelProvidernow winsKernelRegistry.bestAvailable()over Panama for Q4_K on JVM hosts where the bundledlibskainet_kernelsresolves andskainet_q4k_matmullinks.native/src/q4k_matmul.c): single-source, scalar,-O3 -ffast-math -funroll-loops. MirrorsPanamaVectorQ4KMatmulKernelbyte-for-byte on the canonical ggml Q4_K layout (256-element / 144-byte super-blocks; FP16 d/dMin; 12-byteget_scale_min_k4packed sub-scales; 128 bytes of strided 4-bit codes; lazy-dminaccumulation).NativeQ4KMatmulKernel): FFMLinker.downcallHandleonFunctionDescriptor.ofVoidwith 8 args; heap arrays copied throughArena.ofConfinedsegments. The MemSeg-input zero-copy variant for mmap'd weights ships in PR 3.NativeKernelProvider.isAvailable()now lib+symbol-gated;matmulQ4K()returns the native kernel when available, cleanly cascades to Panama otherwise.Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3, warmup=20, samples=21, median µs)
Crushes both PRD targets:
≥2.5×over scalar Q4_K dequant baseline → exceeded by a wide margin (Panama is already much faster than scalar Kotlin; native is 4.17–5.87× faster than Panama)≥1.5×over Panama Vector → exceeded by 2.7–3.9× marginNative is single-threaded; Panama uses
parallelChunksacross all cores. The fact native still wins everywhere suggestsparallelChunksoverhead dominates at these shapes — also a useful signal for follow-up work in the cpu module.Test plan
:skainet-backends:skainet-backend-native-cpu:jvmTest— 8/8 (3 pipeline + 5 parity; microbench skipped by default):skainet-backends:skainet-backend-cpu:jvmTest— 218/218 (no regression)PanamaVectorQ4KMatmulKernelwithin1e-4relative tolerance across shapes 256×{1,16}, 1024×64, 4096×64./gradlew :skainet-backends:skainet-backend-native-cpu:jvmTest --tests '*Microbench*' -Dskainet.runBench=trueOut of scope
Q4KMemSegMatmulKernelSPI sibling + native variant for zero-copy mmap'd Q4_K weights (closes M4↔M5 synergy):skainet-backends:benchmarks:jvm-cpu-jmh(theQ4KMatmulMicrobenchTesthere is a stand-in)🤖 Generated with Claude Code