|
| 1 | +# PRD — Native (FFM) Kernel Provider |
| 2 | + |
| 3 | +**Status:** Deferred (post-0.21.0). Captured here so the design doesn't drift. |
| 4 | +**Owner:** unassigned. |
| 5 | +**Milestone:** M5 (CPU backend dispatch) — final piece. The roadmap's M5 success metric `native ≥2.5× for Q4_K` requires this provider; the JVM Vector half (PRs #554, #557, #560, #562, #563, #564) closed the Panama story but not the native one. |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +The kernel SPI shipped in PR #554 (`KernelProvider`, `Fp32MatmulKernel`, `KernelRegistry`) was designed to host **three** providers, ordered by priority: |
| 10 | + |
| 11 | +| priority | provider | status | |
| 12 | +|---------:|---|---| |
| 13 | +| 0 | `ScalarKernelProvider` | shipped (#554) | |
| 14 | +| 50 | `PanamaVectorKernelProvider` | shipped (#557, plus tile-blocking #560, ServiceLoader #559) | |
| 15 | +| 100 | `NativeKernelProvider` (FFM) | **this PRD** | |
| 16 | + |
| 17 | +The Panama provider runs the FP32 matmul at ~73 GFLOPS for square 4096² shapes on Apple Silicon (per #558's JMH bench), the Q4_K SIMD kernel runs in the same throughput regime as Panama FP32 (#562 numbers). That's already in the ggml NEON ballpark — but ggml's hand-tuned NEON / AVX2 still outruns Panama on dense per-cycle FLOPs and on Q4_K specifically, where 4-bit nibble unpacking maps cleanly to dedicated SIMD shuffles that the Vector API can't always emit. |
| 18 | + |
| 19 | +A native provider closes that gap and unlocks two follow-ons: |
| 20 | +- **M4 ↔ M5 synergy.** Mmap'd Q4_K weights stay as `MemorySegment` views; a native kernel reads the same pages with zero copy via FFI. No staging buffer, no `ByteArray` round-trip. |
| 21 | +- **Hardware-specific lanes.** AVX-512 VNNI fused INT8 dot products, NEON `bf16`/`fp16` SDOT instructions, future SVE — the Vector API exposes none of these portably today. |
| 22 | + |
| 23 | +## Goals |
| 24 | + |
| 25 | +1. **A `NativeKernelProvider` registered at priority 100** that on JDK 21+ wins `KernelRegistry.bestAvailable()` over Panama whenever the native lib is loaded successfully. |
| 26 | +2. **A first concrete kernel: native Q4_K matmul.** It must: |
| 27 | + - take a `MemorySegment` for both input (FP32) and packed Q4_K weights (canonical ggml layout — same as `Q4_KBlockTensorData` and the existing `matmulF32Q4_KMemSeg`); |
| 28 | + - produce numerically equivalent output to `PanamaVectorQ4KMatmulKernel` within `1e-4` relative tolerance (same parity bar `PanamaVectorQ4KMatmulKernelTest` uses); |
| 29 | + - clear **≥2.5× over the prior Q4_K scalar dequant baseline** on the bench shapes from `QuantizedMatmulBench` (1024², 4096×1024, 4096²). |
| 30 | +3. **Optional follow-on kernels** — Q6_K, Q8_0, FP32 matmul — share the build system but each ship as a separate small PR. |
| 31 | +4. **One supported architecture for the first PR** (likely Apple Silicon NEON, since that's the development hardware in use), with a clear extension path for `linuxX64` AVX2 / `linuxArm64` NEON. |
| 32 | + |
| 33 | +## Non-goals |
| 34 | + |
| 35 | +- **JNI.** The roadmap explicitly says "FFM not JNI". JNI's per-call overhead and the global JNI lock are wrong for hot per-token kernels; FFM (Java 22 stable, Java 21 preview) gives near-zero-overhead native calls and direct `MemorySegment` ABI. |
| 36 | +- **Cross-compilation matrix on day one.** The first PR can ship just one (host-arch) variant; CI cross-arch builds come later. |
| 37 | +- **Replacing Panama.** Panama remains the priority-50 fallback for environments that can't load native libs (sandboxes, Wasm, Native targets, JDK without `jdk.incubator.vector`). |
| 38 | +- **Distribution via Maven Central pre-built native artifacts.** Out of scope for the first PR — local build only. A separate "publish native classifier JARs" PRD comes later. |
| 39 | + |
| 40 | +## Architecture |
| 41 | + |
| 42 | +### Module layout |
| 43 | + |
| 44 | +``` |
| 45 | +skainet-backends/ |
| 46 | + skainet-backend-native-cpu/ # NEW |
| 47 | + src/ |
| 48 | + jvmMain/kotlin/sk/ainet/exec/kernel/ # Kotlin side |
| 49 | + NativeKernelProvider.kt # priority=100, isAvailable() = libLoaded |
| 50 | + NativeQ4KMatmulKernel.kt # implements Q4KMatmulKernel, calls FFM |
| 51 | + NativeLibraryLoader.kt # loadLibrary, locate, check API version |
| 52 | + jvmMain/resources/META-INF/services/ |
| 53 | + sk.ainet.backend.api.kernel.KernelProvider # appends NativeKernelProviderFactory |
| 54 | + jvmTest/kotlin/sk/ainet/exec/kernel/ |
| 55 | + NativeQ4KMatmulKernelTest.kt # parity vs PanamaVectorQ4KMatmulKernel |
| 56 | + native/ # native source tree |
| 57 | + c/ |
| 58 | + q4k_matmul.c # ggml-style hand-tuned kernel |
| 59 | + q4k_matmul.h |
| 60 | + CMakeLists.txt # or Bazel BUILD |
| 61 | + build.gradle.kts # Gradle wrapper that invokes CMake |
| 62 | +``` |
| 63 | + |
| 64 | +The native library compiles to a shared object (`libskainet_kernels.dylib` on macOS, `.so` on Linux, `.dll` on Windows) and is packaged into the module's resources for `System.loadLibrary` discovery. |
| 65 | + |
| 66 | +### FFM binding pattern |
| 67 | + |
| 68 | +Single C entry point per kernel: |
| 69 | + |
| 70 | +```c |
| 71 | +// q4k_matmul.h |
| 72 | +void skainet_q4k_matmul( |
| 73 | + const float* input, // FP32 input vector, length input_dim |
| 74 | + const uint8_t* weight, // packed Q4_K bytes (canonical ggml layout) |
| 75 | + int32_t weight_byte_offset, |
| 76 | + int32_t input_dim, |
| 77 | + int32_t output_dim, |
| 78 | + float* output, // FP32 output, length output_dim |
| 79 | + int32_t output_offset |
| 80 | +); |
| 81 | +``` |
| 82 | + |
| 83 | +Kotlin side: |
| 84 | + |
| 85 | +```kotlin |
| 86 | +internal object NativeQ4KMatmulKernel : Q4KMatmulKernel { |
| 87 | + private val handle: MethodHandle = run { |
| 88 | + val arena = Arena.ofAuto() |
| 89 | + val symbol = NativeLibraryLoader.lib.find("skainet_q4k_matmul").orElseThrow() |
| 90 | + Linker.nativeLinker().downcallHandle( |
| 91 | + symbol, |
| 92 | + FunctionDescriptor.ofVoid( |
| 93 | + ValueLayout.ADDRESS, ValueLayout.ADDRESS, ValueLayout.JAVA_INT, |
| 94 | + ValueLayout.JAVA_INT, ValueLayout.JAVA_INT, |
| 95 | + ValueLayout.ADDRESS, ValueLayout.JAVA_INT, |
| 96 | + ), |
| 97 | + ) |
| 98 | + } |
| 99 | + |
| 100 | + override fun matmul( |
| 101 | + input: FloatArray, inputOffset: Int, |
| 102 | + weight: ByteArray, weightByteOffset: Int, |
| 103 | + inputDim: Int, outputDim: Int, |
| 104 | + output: FloatArray, outputOffset: Int, |
| 105 | + ) { |
| 106 | + // Heap arrays: pass via temporary off-heap MemorySegment + bulk copy, |
| 107 | + // OR (preferred) overload with a MemorySegment-input variant for |
| 108 | + // mmap'd weights to avoid the copy. |
| 109 | + ... |
| 110 | + } |
| 111 | +} |
| 112 | +``` |
| 113 | + |
| 114 | +The cleaner path is to introduce a sibling **`Q4KMemSegMatmulKernel`** SPI (mentioned as out-of-scope in #563) that takes `MemorySegment` directly, and have the native provider implement *that* — no heap copy. The `Q4KMatmulKernel` (ByteArray) variant can wrap the MemSeg one with a temporary `Arena.ofConfined()` copy if needed for legacy callers. |
| 115 | + |
| 116 | +### Build system |
| 117 | + |
| 118 | +**Gradle + CMake** is the path of least resistance: |
| 119 | +- A new Gradle plugin (or hand-rolled `Exec` tasks) invokes CMake for the native module's `build` task. |
| 120 | +- Native artifacts land in `build/native/<arch>/` and are copied into `src/jvmMain/resources/native/<os>-<arch>/` so `System.loadLibrary` finds them. |
| 121 | +- The Kotlin compile depends on the native artifact being built first. |
| 122 | + |
| 123 | +The `xnnpack` backend already in the repo (`skainet-backends/skainet-backend-xnnpack/`) demonstrates a similar pattern — Gradle invokes CMake to build a native lib via cinterop. **Reuse that template** rather than reinventing. |
| 124 | + |
| 125 | +**Architecture detection**: at native module build time, query host arch and only build for it (first PR scope). CI cross-arch matrix follows. |
| 126 | + |
| 127 | +### Provider class |
| 128 | + |
| 129 | +```kotlin |
| 130 | +public object NativeKernelProvider : KernelProvider { |
| 131 | + override val name: String = "native-ffm" |
| 132 | + override val priority: Int = 100 |
| 133 | + |
| 134 | + private val available: Boolean by lazy { |
| 135 | + runCatching { NativeLibraryLoader.load() }.isSuccess |
| 136 | + } |
| 137 | + |
| 138 | + override fun isAvailable(): Boolean = available |
| 139 | + |
| 140 | + override fun matmulFp32(): Fp32MatmulKernel? = null // future PR |
| 141 | + override fun matmulQ4K(): Q4KMatmulKernel? = |
| 142 | + if (isAvailable()) NativeQ4KMatmulKernel else null |
| 143 | +} |
| 144 | +``` |
| 145 | + |
| 146 | +Registered via the existing ServiceLoader mechanism (`META-INF/services/sk.ainet.backend.api.kernel.KernelProvider`, factory wrapper because `KernelProvider by NativeKernelProvider`). When unavailable, the cascade falls through to Panama (priority 50), preserving the M5 metric on environments without native code. |
| 147 | + |
| 148 | +## Staged delivery |
| 149 | + |
| 150 | +PRs in order, each independently mergeable: |
| 151 | + |
| 152 | +1. **`skainet-backend-native-cpu` module scaffolding** — Gradle module, build.gradle.kts wired to invoke CMake, a *trivial* C kernel (e.g. just multiplies its first input by 2.0 and writes to output) to prove the FFM pipeline end-to-end. NativeKernelProvider that's `isAvailable() = false` until the real kernel lands. Sets up CI artifact path on host arch. |
| 153 | +2. **First real native kernel: Q4_K matmul (Apple Silicon NEON)** — hand-tuned kernel, parity tests vs PanamaVectorQ4KMatmulKernel, JMH bench variant added to `QuantizedMatmulBench`. |
| 154 | +3. **`Q4KMemSegMatmulKernel` SPI sibling + native variant** — closes the M4↔M5 zero-copy story for mmap'd weights. |
| 155 | +4. **linuxX64 AVX2 variant + cross-arch CI build** — the cross-compilation matrix story. |
| 156 | +5. **Optional: native FP32 matmul, native Q6_K, native Q8_0** — same shape as PRs 2–3, one per format. |
| 157 | + |
| 158 | +The first PR (1) is the largest in *scaffolding* terms (~500–800 LoC of build glue + 1 trivial kernel), but every subsequent PR is small and template-able. |
| 159 | + |
| 160 | +## Success metrics |
| 161 | + |
| 162 | +- **PR 2 sign-off**: native Q4_K matmul on Apple Silicon clears **≥2.5×** over the scalar Q4_K dequant-then-matmul baseline at 4096² (the M5 milestone target). For reference: Panama Q4_K SIMD already exceeds this metric (see #562 PR body, ~73 GFLOPS), so the bar is "beats Panama by a meaningful margin", probably ≥1.5× over Panama. |
| 163 | +- **PR 3 sign-off**: Q4_K MemSeg native path is faster than the Panama Q4_K MemSeg path from #563, with no heap copy in the timed region. |
| 164 | +- **No regression on JVM-only environments** — when the native lib fails to load (sandbox, missing arch, etc.), `KernelRegistry.bestAvailable()` cleanly falls through to Panama, and existing tests / benches show the same numbers as today. |
| 165 | + |
| 166 | +## Risks & open questions |
| 167 | + |
| 168 | +1. **JDK 21 preview vs JDK 22 stable.** FFM left preview in Java 22. The repo currently builds on JDK 21 with `--enable-preview --add-modules jdk.incubator.vector`. We need to decide: stay on JDK 21 preview FFM (smaller blast radius, matches Vector API status), or bump to JDK 22+ for stable FFM. **Recommendation**: stay on 21 preview; flip to 22 in a separate toolchain-bump PR. |
| 169 | +2. **`MethodHandle` invocation overhead.** Even with FFM, each native call has a small fixed cost (microseconds-ish). For the smallest matmul shapes (e.g. 256² FP32) this could swamp the FLOPs win. Mitigation: route small inputs to Panama and large inputs to native at the registry/provider level, OR accept that the win is sized for production-relevant shapes (4096²+). |
| 170 | +3. **Native code quality and maintenance.** Hand-tuned NEON / AVX2 in C is harder to audit than Kotlin Vector API code. Mitigation: keep kernels small (<300 LoC each), parity-test exhaustively, prefer porting from ggml's reference (which is BSD-licensed and well-vetted) over writing from scratch. |
| 171 | +4. **Distribution.** Native artifacts complicate Maven Central publication (need `<classifier>` per OS/arch). For the first internal-use PR this isn't a blocker, but a separate "publish native classifier JARs" PRD will be needed before community use. |
| 172 | +5. **Cross-arch CI cost.** Building NEON natively on Apple Silicon CI plus AVX2 on linuxX64 plus Android NDK doubles or triples build time. The xnnpack backend's existing CI matrix is a precedent — reuse the same approach. |
| 173 | +6. **Native `MemorySegment` lifetime.** The Kotlin caller owns the `Arena` for arrays it copies in. The native kernel must NOT retain pointers past the FFM call return. Document this contract in `NativeQ4KMatmulKernel.matmul` kdoc. |
| 174 | + |
| 175 | +## When to start |
| 176 | + |
| 177 | +Trigger conditions (any one): |
| 178 | +- Real workload demands the native ≥2.5× target (Panama Q4_K stops being fast enough on a customer machine). |
| 179 | +- A community contributor offers a hand-tuned NEON / AVX2 Q4_K kernel that's measurably faster than Panama. |
| 180 | +- A second M5 metric (e.g. SDPA throughput, training-loop throughput) needs hand-tuned native code. |
| 181 | + |
| 182 | +Until then: **pause**. The Panama provider is doing the milestone-equivalent work in absolute terms, and adding a native build system is a meaningful complexity tax to take on speculatively. |
0 commit comments