Skip to content

feat(native-cpu): zero-copy Q4_K MemSeg kernel + SPI sibling (PR 3 of 5)#573

Merged
michalharakal merged 1 commit intodevelopfrom
feature/native-q4k-memseg
Apr 29, 2026
Merged

feat(native-cpu): zero-copy Q4_K MemSeg kernel + SPI sibling (PR 3 of 5)#573
michalharakal merged 1 commit intodevelopfrom
feature/native-q4k-memseg

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

Summary

PR 3 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc. Closes the M4↔M5 zero-copy story for mmap'd Q4_K weights.

  • New SPI in skainet-backend-api/jvmMain:

    • Q4KMemSegMatmulKernel — JVM-only sibling of Q4KMatmulKernel taking MemorySegment weights with Long byte offset. Same block layout / lazy-dmin math contract.
    • MemSegKernelProvider — JVM-only sibling of KernelProvider with a null-defaulting matmulQ4KMemSeg() accessor. Doesn't fork the registry; providers opt in by implementing both interfaces and callers smart-cast: (KernelRegistry.bestAvailable() as? MemSegKernelProvider)?.matmulQ4KMemSeg() ?: heapFallback(). Lives in jvmMain because adding MemorySegment to the commonMain KernelProvider would have broken Native / JS / Wasm targets.
  • Native impl in skainet-backend-native-cpu:

    • NativeQ4KMemSegMatmulKernel reuses PR 2's skainet_q4k_matmul C symbol — the kernel just sees const uint8_t* and is oblivious to whether bytes were staged through an arena or supplied directly. The weight pointer goes through; only input/output use small confined-arena copies (heap arrays from the surrounding forward pass).
    • Validates segment size ((inputDim/256) * outputDim * 144 bytes from offset) and rejects undersized segments with IllegalArgumentException — without that, an undersized segment would crash the JVM with SIGSEGV.
    • NativeKernelProvider now implements both KernelProvider and MemSegKernelProvider; NativeKernelProviderFactory delegates both via by NativeKernelProvider so the ServiceLoader-supplied factory passes the smart-cast.

Microbench (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-math; warmup=20, samples=21, median µs)

shape native (heap) native (memseg) zero-copy speedup memseg vs panama
1024² 360 369 0.98× 5.05×
2048² 1317 1284 1.03× 4.66×
4096² 6206 5184 1.20× 4.48×

Honest read: zero-copy is noise at smaller shapes (sub-1MB staging copy, hidden by arena allocator + memcpy throughput) and a real +20% saving at 4096² where the 9 MB weight copy starts to dominate cache pressure. Production loads on real LLMs will be larger still and benefit more — plus they save resident memory since the heap path materializes a JVM-heap copy on top of the off-heap segment.

Test plan

  • :skainet-backends:skainet-backend-native-cpu:jvmTest — 15/15 (3 pipeline + 5 heap-parity + 7 memseg-parity; microbench gated)
  • :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
  • Bit-identical parity (Float.toRawBits equality, no tolerance) between heap and MemSeg paths across 256×{1,16}, 1024×64, 4096×64, plus a non-zero-weight-byte-offset case
  • Provider + factory smart-cast tests confirm the SPI plumbing
  • CI verifies on macOS arm64 / Linux arm64 (cross-arch matrix is PR 4)

Out of scope

  • PR 4: NEON / AVX2 intrinsics + cross-arch CI matrix
  • PR 5: native FP32 / Q6_K / Q8_0 kernels
  • int64_t weight-offset C-symbol overload (current int32_t limit hits at 2 GB per single segment slice)
  • Panama priority-50 implementation of MemSegKernelProvider — Panama already has Q4_K MemSeg internals; exposing through the new SPI is a small follow-up and lets the smart-cast cascade work even when the native provider is unavailable

🤖 Generated with Claude Code

PR 3 of the staged native-FFM rollout per docs/.../perf/native-ffm-plan.adoc.
Closes the M4↔M5 zero-copy story for mmap'd Q4_K weights: callers
that already hold off-heap weight bytes (mmap'd .gguf files, shared
arenas) skip the staging ByteArray → MemorySegment copy that
NativeQ4KMatmulKernel.matmul performs on every call.

SPI surface (skainet-backend-api/src/jvmMain):

- Q4KMemSegMatmulKernel — JVM-only sibling of Q4KMatmulKernel; same
  block layout / lazy-dmin contract, but `weight` is a
  java.lang.foreign.MemorySegment with a Long byte offset. KMP-safe
  positioning: lives in jvmMain (not commonMain) because
  java.lang.foreign isn't available on Native / JS / Wasm targets.

- MemSegKernelProvider — JVM-only sibling of KernelProvider that
  exposes a `matmulQ4KMemSeg(): Q4KMemSegMatmulKernel?` accessor with
  a `null`-defaulting body. Lookup pattern at the call site:

      val kernel = (KernelRegistry.bestAvailable() as? MemSegKernelProvider)
          ?.matmulQ4KMemSeg() ?: heapFallback()

  Doesn't fork the registry — providers opt into MemSeg surfaces by
  implementing both interfaces; smart-cast does the rest. Adding
  `matmulQ4KMemSeg` directly to KernelProvider would have broken
  commonMain (MemorySegment is JVM-only).

Native side (skainet-backend-native-cpu):

- NativeQ4KMemSegMatmulKernel reuses PR 2's skainet_q4k_matmul C
  symbol — the kernel just sees `const uint8_t*` and is oblivious to
  whether the bytes were staged through an arena or read directly
  from a caller-owned segment. The weight pointer is forwarded
  straight through; only input/output go through small confined-arena
  copies (those are usually a few KB and produced/consumed on the
  heap by the surrounding forward pass).

- Validates the segment is large enough for `(inputDim/256) *
  outputDim * 144` bytes from the given offset and rejects undersized
  segments with IllegalArgumentException — without it, an undersized
  segment would crash the JVM with SIGSEGV from the C side.

- weightByteOffset is Long on the Kotlin side narrowing to int32_t
  at the FFM boundary; we require <= Int.MAX_VALUE for now and
  document the eventual int64_t-offset overload as a follow-up. No
  current LLM single-tensor exceeds 2 GB.

- NativeKernelProvider now implements both KernelProvider and
  MemSegKernelProvider; NativeKernelProviderFactory delegates both
  via `by NativeKernelProvider`. Without the second `by`, the factory
  instance the registry hands out would fail the smart-cast even
  though the underlying singleton implements both interfaces.

Tests (skainet-backend-native-cpu/src/jvmTest):

- NativeQ4KMemSegMatmulKernelParityTest — 7 tests asserting
  bit-identical output (compared via Float.toRawBits, no tolerance)
  to NativeQ4KMatmulKernel across single-block / multi-block /
  LLM-typical shapes. The bit-identical contract is the right bar:
  same C symbol, same inputs ⇒ same outputs; any drift means the
  wrapper added arithmetic.

- Honors-non-zero-weight-byte-offset and rejects-undersized-segment
  cases for the new validation logic.

- Provider/factory smart-cast tests confirm the SPI plumbing works
  end-to-end (NativeKernelProvider as MemSegKernelProvider succeeds;
  factory ditto).

- Q4KMatmulMicrobenchTest extended: heap-copy vs zero-copy at LLM
  shapes. Weight segment pre-allocated in an Arena.ofShared outside
  the timed region — that's the realistic load profile (mmap once,
  reuse across forward passes).

Microbench numbers (Linux x86_64, JDK 21.0.10, gcc 13.3 -O3 -ffast-
math; warmup=20, samples=21, median µs):

  shape   heap   memseg   zero-copy speedup   memseg vs panama
  1024²   360    369      0.98×                5.05×
  2048²  1317   1284      1.03×                4.66×
  4096²  6206   5184      1.20×                4.48×

Honest read: zero-copy is noise at small shapes (the staged copy is
sub-1MB; arena allocator + memcpy throughput hide it) and a real
+20% saving at 4096² (9 MB weight copy starts to dominate cache
pressure). Production loads on actual LLMs will be larger still and
will benefit more — plus they'll save on resident memory because
the heap path materializes a copy of every weight in JVM heap on
top of the off-heap segment.

Verification (linux-x86_64, JDK 21.0.10):
- :skainet-backends:skainet-backend-native-cpu:jvmTest — 15/15
  (3 pipeline + 5 heap-parity + 7 memseg-parity, microbench skipped)
- :skainet-backends:skainet-backend-cpu:jvmTest — 218/218 (no regression)
- :skainet-backends:skainet-backend-api:jvmTest — 0/0 (no tests yet)

Out of scope (deferred per asciidoc staging):
- PR 4: NEON / AVX2 intrinsics + cross-arch CI matrix
- PR 5: native FP32 / Q6_K / Q8_0 kernels
- int64_t weight offset overload (current int32_t limit hit at 2 GB
  per single segment slice)
- Panama priority-50 implementation of MemSegKernelProvider — Panama
  already has Q4_K MemSeg internals; exposing through the new SPI is
  a small follow-up and lets the smart-cast cascade work even when
  the native provider is unavailable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 33a576c into develop Apr 29, 2026
5 of 6 checks passed
@michalharakal michalharakal deleted the feature/native-q4k-memseg branch May 2, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant