Skip to content

feat(native-cpu): native FFM Q4_K matmul kernel (PR 2 of 5)#572

Merged
michalharakal merged 1 commit intodevelopfrom
feature/native-q4k-matmul
Apr 29, 2026
Merged

feat(native-cpu): native FFM Q4_K matmul kernel (PR 2 of 5)#572
michalharakal merged 1 commit intodevelopfrom
feature/native-q4k-matmul

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

PR 2 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Wires a real Q4_K matmul into the public Q4KMatmulKernel SPI; NativeKernelProvider now wins KernelRegistry.bestAvailable() over Panama for Q4_K on JVM hosts where the bundled libskainet_kernels resolves and skainet_q4k_matmul links.

  • C kernel (native/src/q4k_matmul.c): single-source, scalar, -O3 -ffast-math -funroll-loops. Mirrors PanamaVectorQ4KMatmulKernel byte-for-byte on the canonical ggml Q4_K layout (256-element / 144-byte super-blocks; FP16 d/dMin; 12-byte get_scale_min_k4 packed sub-scales; 128 bytes of strided 4-bit codes; lazy-dmin accumulation).
  • Kotlin wrapper (NativeQ4KMatmulKernel): FFM Linker.downcallHandle on FunctionDescriptor.ofVoid with 8 args; heap arrays copied through Arena.ofConfined segments. The MemSeg-input zero-copy variant for mmap'd weights ships in PR 3.
  • Provider wiring: NativeKernelProvider.isAvailable() now lib+symbol-gated; matmulQ4K() returns the native kernel when available, cleanly cascades to Panama otherwise.

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3, warmup=20, samples=21, median µs)

shape native panama ratio
1024² 379 2225 5.87×
2048² 1393 6558 4.71×
4096² 5958 24865 4.17×

Crushes both PRD targets:

  • ≥2.5× over scalar Q4_K dequant baseline → exceeded by a wide margin (Panama is already much faster than scalar Kotlin; native is 4.17–5.87× faster than Panama)
  • ≥1.5× over Panama Vector → exceeded by 2.7–3.9× margin

Native is single-threaded; Panama uses parallelChunks across all cores. The fact native still wins everywhere suggests parallelChunks overhead dominates at these shapes — also a useful signal for follow-up work in the cpu module.

Test plan

  • :skainet-backends:skainet-backend-native-cpu:jvmTest — 8/8 (3 pipeline + 5 parity; microbench skipped by default)
  • :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
  • Parity vs PanamaVectorQ4KMatmulKernel within 1e-4 relative tolerance across shapes 256×{1,16}, 1024×64, 4096×64
  • CI verifies on macOS arm64 / Linux arm64 (cross-arch matrix is PR 4)
  • To re-run the microbench locally: ./gradlew :skainet-backends:skainet-backend-native-cpu:jvmTest --tests '*Microbench*' -Dskainet.runBench=true

Out of scope

  • PR 3: Q4KMemSegMatmulKernel SPI sibling + native variant for zero-copy mmap'd Q4_K weights (closes M4↔M5 synergy)
  • PR 4: hand-tuned NEON / AVX2 intrinsics + linuxX64 / linuxArm64 / macosArm64 cross-arch CI matrix
  • PR 5: native FP32 / Q6_K / Q8_0 kernels
  • JMH integration in :skainet-backends:benchmarks:jvm-cpu-jmh (the Q4KMatmulMicrobenchTest here is a stand-in)
  • Maven Central native classifier publishing (separate plan)

🤖 Generated with Claude Code

PR 2 of the staged native (FFM) kernel provider rollout described in
docs/.../perf/native-ffm-plan.adoc. Wires a real Q4_K matmul into the
public SPI: NativeKernelProvider now reports isAvailable() = true on
hosts where the bundled libskainet_kernels resolves and
skainet_q4k_matmul links, and matmulQ4K() returns NativeQ4KMatmulKernel
at priority 100 — winning KernelRegistry.bestAvailable() over Panama
(50) for Q4_K on JVM.

Native side (native/):

- src/q4k_matmul.c implements skainet_q4k_matmul over the canonical
  ggml Q4_K super-block layout (256 elements / 144 bytes; FP16 d/dMin;
  12-byte get_scale_min_k4 packed sub-scales; 128 bytes of strided
  4-bit codes). Mirrors PanamaVectorQ4KMatmulKernel byte-for-byte —
  same lazy-dmin trick (codeSum + inputSum per sub-block; combine via
  d*scaleIdx*codeSum - dMin*minIdx*inputSum). Single-threaded, scalar
  C; the 32-iteration inner loop is straight-line FP arithmetic that
  -O3 -ffast-math auto-vectorizes on AVX2 / NEON.

- include/skainet_kernels.h declares the new export with the
  SKAINET_API visibility macro.

- CMakeLists.txt picks up q4k_matmul.c and adds -O3 -ffast-math
  -funroll-loops to the compile flags so the auto-vec actually fires.

Kotlin side (src/jvmMain):

- NativeQ4KMatmulKernel implements Q4KMatmulKernel via FFM downcall
  (Linker.downcallHandle on FunctionDescriptor.ofVoid with 8 args
  matching the C signature). Heap arrays are copied into Arena.
  ofConfined off-heap segments, the kernel runs, output bulk-copies
  back. The MemorySegment-input overload that avoids the heap copy
  for mmap'd Q4_K weights ships in PR 3.

- NativeKernelProvider.isAvailable() now returns
  NativeQ4KMatmulKernel.isAvailable() (lib loaded + symbol resolved).
  matmulQ4K() returns the native kernel when available; cascades to
  Panama otherwise. matmulFp32() still null pending a later PR.

Tests (src/jvmTest):

- NativeQ4KMatmulKernelParityTest: 5 parity assertions vs
  PanamaVectorQ4KMatmulKernel (the existing priority-50 reference)
  across single-block / multi-block / LLM-typical (4096×64) shapes
  with the same fixture pattern as PanamaVectorQ4KMatmulKernelTest.
  Tolerance: 1e-2 to 5e-1 absolute or 1e-4 relative — the same bar
  Panama-vs-scalar parity uses, which already swallows FMA + native
  -ffast-math reassociation differences.

- Q4KMatmulMicrobenchTest: wall-clock comparison vs Panama at
  1024² / 2048² / 4096². Skipped by default; activates with
  -Dskainet.runBench=true (forwarded from Gradle CLI through a new
  systemProperty bridge in build.gradle.kts).

- NativeFfmPipelineTest: stub-flip assertion updated to expect
  isAvailable() = true and matmulQ4K() != null.

build.gradle.kts:

- jvmTest dependencies pick up :skainet-backend-cpu (for the parity
  reference) and kotlinx-coroutines (transitive: PanamaVector uses
  parallelChunks).

- Test JVM args extended with --add-modules jdk.incubator.vector so
  the parity test can load Panama.

Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3, -O3 -ffast-
math; warmup=20, samples=21, median µs):

  shape      native    panama    ratio
  1024²        379      2225     5.87×
  2048²       1393      6558     4.71×
  4096²       5958     24865     4.17×

Crushes both PRD targets:
  - ≥2.5× over scalar Q4_K dequant baseline (Panama is already >>
    scalar; native is 4.17–5.87× faster than Panama)
  - ≥1.5× over Panama Vector → exceeded by 2.7–3.9× margin

Verification (linux-x86_64, JDK 21.0.10, cmake 3.28.3):
- :skainet-backends:skainet-backend-native-cpu:jvmTest — 8/8 (3
  pipeline + 5 parity, microbench skipped without -D)
- :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (cascade
  unchanged; with native registered the registry now hands out the
  native Q4_K kernel ahead of Panama)

Out of scope (deferred per asciidoc staging):
- PR 3: Q4KMemSegMatmulKernel SPI sibling for zero-copy mmap'd weights
- PR 4: linuxX64 AVX2 + NEON intrinsics + cross-arch CI matrix
- PR 5: native FP32 / Q6_K / Q8_0 kernels
- JMH integration in :skainet-backends:benchmarks:jvm-cpu-jmh
  (Q4KMatmulMicrobenchTest is a stand-in)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit c928f71 into develop Apr 29, 2026
6 checks passed
@michalharakal michalharakal deleted the feature/native-q4k-matmul branch April 29, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant