feat(native-cpu): zero-copy Q4_K MemSeg kernel + SPI sibling (PR 3 of 5) by michalharakal · Pull Request #573 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-29T21:10:53Z

Summary

PR 3 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Closes the M4↔M5 zero-copy story for mmap'd Q4_K weights.

New SPI in skainet-backend-api/jvmMain:
- Q4KMemSegMatmulKernel — JVM-only sibling of Q4KMatmulKernel taking MemorySegment weights with Long byte offset. Same block layout / lazy-dmin math contract.
- MemSegKernelProvider — JVM-only sibling of KernelProvider with a null-defaulting matmulQ4KMemSeg() accessor. Doesn't fork the registry; providers opt in by implementing both interfaces and callers smart-cast: (KernelRegistry.bestAvailable() as? MemSegKernelProvider)?.matmulQ4KMemSeg() ?: heapFallback(). Lives in jvmMain because adding MemorySegment to the commonMain KernelProvider would have broken Native / JS / Wasm targets.
Native impl in skainet-backend-native-cpu:
- NativeQ4KMemSegMatmulKernel reuses PR 2's skainet_q4k_matmul C symbol — the kernel just sees const uint8_t* and is oblivious to whether bytes were staged through an arena or supplied directly. The weight pointer goes through; only input/output use small confined-arena copies (heap arrays from the surrounding forward pass).
- Validates segment size ((inputDim/256) * outputDim * 144 bytes from offset) and rejects undersized segments with IllegalArgumentException — without that, an undersized segment would crash the JVM with SIGSEGV.
- NativeKernelProvider now implements both KernelProvider and MemSegKernelProvider; NativeKernelProviderFactory delegates both via by NativeKernelProvider so the ServiceLoader-supplied factory passes the smart-cast.

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 `-O3 -ffast-math`; warmup=20, samples=21, median µs)

shape	native (heap)	native (memseg)	zero-copy speedup	memseg vs panama
1024²	360	369	0.98×	5.05×
2048²	1317	1284	1.03×	4.66×
4096²	6206	5184	1.20×	4.48×

Honest read: zero-copy is noise at smaller shapes (sub-1MB staging copy, hidden by arena allocator + memcpy throughput) and a real +20% saving at 4096² where the 9 MB weight copy starts to dominate cache pressure. Production loads on real LLMs will be larger still and benefit more — plus they save resident memory since the heap path materializes a JVM-heap copy on top of the off-heap segment.

Test plan

:skainet-backends:skainet-backend-native-cpu:jvmTest — 15/15 (3 pipeline + 5 heap-parity + 7 memseg-parity; microbench gated)
:skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
Bit-identical parity (Float.toRawBits equality, no tolerance) between heap and MemSeg paths across 256×{1,16}, 1024×64, 4096×64, plus a non-zero-weight-byte-offset case
Provider + factory smart-cast tests confirm the SPI plumbing
CI verifies on macOS arm64 / Linux arm64 (cross-arch matrix is PR 4)

Out of scope

PR 4: NEON / AVX2 intrinsics + cross-arch CI matrix
PR 5: native FP32 / Q6_K / Q8_0 kernels
int64_t weight-offset C-symbol overload (current int32_t limit hits at 2 GB per single segment slice)
Panama priority-50 implementation of MemSegKernelProvider — Panama already has Q4_K MemSeg internals; exposing through the new SPI is a small follow-up and lets the smart-cast cascade work even when the native provider is unavailable

🤖 Generated with Claude Code

PR 3 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Closes the M4↔M5 zero-copy story for mmap'd Q4_K weights: callers that already hold off-heap weight bytes (mmap'd .gguf files, shared arenas) skip the staging ByteArray → MemorySegment copy that NativeQ4KMatmulKernel.matmul performs on every call. SPI surface (skainet-backend-api/src/jvmMain): - Q4KMemSegMatmulKernel — JVM-only sibling of Q4KMatmulKernel; same block layout / lazy-dmin contract, but `weight` is a java.lang.foreign.MemorySegment with a Long byte offset. KMP-safe positioning: lives in jvmMain (not commonMain) because java.lang.foreign isn't available on Native / JS / Wasm targets. - MemSegKernelProvider — JVM-only sibling of KernelProvider that exposes a `matmulQ4KMemSeg(): Q4KMemSegMatmulKernel?` accessor with a `null`-defaulting body. Lookup pattern at the call site: val kernel = (KernelRegistry.bestAvailable() as? MemSegKernelProvider) ?.matmulQ4KMemSeg() ?: heapFallback() Doesn't fork the registry — providers opt into MemSeg surfaces by implementing both interfaces; smart-cast does the rest. Adding `matmulQ4KMemSeg` directly to KernelProvider would have broken commonMain (MemorySegment is JVM-only). Native side (skainet-backend-native-cpu): - NativeQ4KMemSegMatmulKernel reuses PR 2's skainet_q4k_matmul C symbol — the kernel just sees `const uint8_t*` and is oblivious to whether the bytes were staged through an arena or read directly from a caller-owned segment. The weight pointer is forwarded straight through; only input/output go through small confined-arena copies (those are usually a few KB and produced/consumed on the heap by the surrounding forward pass). - Validates the segment is large enough for `(inputDim/256) * outputDim * 144` bytes from the given offset and rejects undersized segments with IllegalArgumentException — without it, an undersized segment would crash the JVM with SIGSEGV from the C side. - weightByteOffset is Long on the Kotlin side narrowing to int32_t at the FFM boundary; we require <= Int.MAX_VALUE for now and document the eventual int64_t-offset overload as a follow-up. No current LLM single-tensor exceeds 2 GB. - NativeKernelProvider now implements both KernelProvider and MemSegKernelProvider; NativeKernelProviderFactory delegates both via `by NativeKernelProvider`. Without the second `by`, the factory instance the registry hands out would fail the smart-cast even though the underlying singleton implements both interfaces. Tests (skainet-backend-native-cpu/src/jvmTest): - NativeQ4KMemSegMatmulKernelParityTest — 7 tests asserting bit-identical output (compared via Float.toRawBits, no tolerance) to NativeQ4KMatmulKernel across single-block / multi-block / LLM-typical shapes. The bit-identical contract is the right bar: same C symbol, same inputs ⇒ same outputs; any drift means the wrapper added arithmetic. - Honors-non-zero-weight-byte-offset and rejects-undersized-segment cases for the new validation logic. - Provider/factory smart-cast tests confirm the SPI plumbing works end-to-end (NativeKernelProvider as MemSegKernelProvider succeeds; factory ditto). - Q4KMatmulMicrobenchTest extended: heap-copy vs zero-copy at LLM shapes. Weight segment pre-allocated in an Arena.ofShared outside the timed region — that's the realistic load profile (mmap once, reuse across forward passes). Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast- math; warmup=20, samples=21, median µs): shape heap memseg zero-copy speedup memseg vs panama 1024² 360 369 0.98× 5.05× 2048² 1317 1284 1.03× 4.66× 4096² 6206 5184 1.20× 4.48× Honest read: zero-copy is noise at small shapes (the staged copy is sub-1MB; arena allocator + memcpy throughput hide it) and a real +20% saving at 4096² (9 MB weight copy starts to dominate cache pressure). Production loads on actual LLMs will be larger still and will benefit more — plus they'll save on resident memory because the heap path materializes a copy of every weight in JVM heap on top of the off-heap segment. Verification (linux-x86_64, JDK 21.0.10): - :skainet-backends:skainet-backend-native-cpu:jvmTest — 15/15 (3 pipeline + 5 heap-parity + 7 memseg-parity, microbench skipped) - :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression) - :skainet-backends:skainet-backend-api:jvmTest — 0/0 (no tests yet) Out of scope (deferred per asciidoc staging): - PR 4: NEON / AVX2 intrinsics + cross-arch CI matrix - PR 5: native FP32 / Q6_K / Q8_0 kernels - int64_t weight offset overload (current int32_t limit hit at 2 GB per single segment slice) - Panama priority-50 implementation of MemSegKernelProvider — Panama already has Q4_K MemSeg internals; exposing through the new SPI is a small follow-up and lets the smart-cast cascade work even when the native provider is unavailable Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

michalharakal merged commit 33a576c into develop Apr 29, 2026
5 of 6 checks passed

This was referenced Apr 29, 2026

feat(native-cpu): native FFM FP32 SGEMM kernel (PR 5 of 5) #575

Merged

Prepare 0.22.0 #580

Merged

michalharakal deleted the feature/native-q4k-memseg branch May 2, 2026 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(native-cpu): zero-copy Q4_K MemSeg kernel + SPI sibling (PR 3 of 5)#573

feat(native-cpu): zero-copy Q4_K MemSeg kernel + SPI sibling (PR 3 of 5)#573
michalharakal merged 1 commit intodevelopfrom
feature/native-q4k-memseg

michalharakal commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 29, 2026

Summary

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-math; warmup=20, samples=21, median µs)

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 `-O3 -ffast-math`; warmup=20, samples=21, median µs)