|
| 1 | += Plan: Native (FFM) Kernel Provider |
| 2 | +:description: Where the JVM Vector kernels stop, what a native priority-100 provider would look like, and when to build it. |
| 3 | + |
| 4 | +This page is a *plan*, not shipped code. The intent is to capture |
| 5 | +enough detail that the design doesn't drift between the time someone |
| 6 | +decides to start the work and the moment a PR is opened. The earlier |
| 7 | +content of this page lived briefly in `NATIVE_FFM_KERNEL_PROVIDER.md` |
| 8 | +at the repo root and was removed on advice of "ship the release first, |
| 9 | +keep the plan in docs"; this is its permanent home. |
| 10 | + |
| 11 | +== Where the JVM Vector kernels run out |
| 12 | + |
| 13 | +After the M5 milestone work landed (PRs #554–#565 across the 0.21.0 |
| 14 | +release), every CPU matmul path goes through the kernel SPI — see |
| 15 | +xref:explanation/perf/simd-kernels.adoc[] and |
| 16 | +xref:explanation/perf/quantized-simd-kernels.adoc[]. The Panama Vector |
| 17 | +provider runs at: |
| 18 | + |
| 19 | +* ~73 GFLOPS on FP32 4096² matmul (Apple Silicon NEON) |
| 20 | +* ~73 GFLOPS on Q4_K 4096² matmul-vector (same regime; fused dequant |
| 21 | +adds essentially zero cost on top of the FMA) |
| 22 | + |
| 23 | +That's already in the ggml NEON ballpark in absolute terms. But |
| 24 | +ggml's hand-tuned NEON / AVX2 still outruns the JVM Vector API on: |
| 25 | + |
| 26 | +* dense FLOPs/cycle on shapes the Vector API can't tile-block |
| 27 | +optimally (the 8×8×128 default is heuristic) |
| 28 | +* AVX-512 VNNI fused INT8 dot products |
| 29 | +* NEON `bf16` / `fp16` SDOT instructions |
| 30 | +* future SVE / SME — none of which the Vector API exposes portably |
| 31 | +today |
| 32 | + |
| 33 | +A native provider closes that gap and unlocks two follow-ons that |
| 34 | +*can't* be built on the Vector API alone: |
| 35 | + |
| 36 | +. *M4 ↔ M5 zero-copy.* Mmap'd Q4_K weights stay as `MemorySegment` |
| 37 | +views; a native kernel reads the same pages with no heap copy and |
| 38 | +no staging buffer. |
| 39 | +. *Hardware-specific lanes* unreachable from portable Vector code. |
| 40 | + |
| 41 | +== Provider shape |
| 42 | + |
| 43 | +[cols="1,1,1",options="header"] |
| 44 | +|=== |
| 45 | +| Priority | Provider | Status |
| 46 | +| 0 | `ScalarKernelProvider` | shipped (PR #554) |
| 47 | +| 50 | `PanamaVectorKernelProvider` | shipped (PRs #557, #560 + ServiceLoader #559) |
| 48 | +| *100* | *`NativeKernelProvider` (FFM)* | *this plan* |
| 49 | +|=== |
| 50 | + |
| 51 | +The `KernelRegistry.bestAvailable()` cascade means: when the native |
| 52 | +lib loads, native wins; when it doesn't (sandbox, missing arch, JDK |
| 53 | +without FFM, kill-switch flipped), Panama wins; on Native targets and |
| 54 | +JS / Wasm where neither is available, scalar wins. No code change |
| 55 | +above the registry layer. |
| 56 | + |
| 57 | +== Goals |
| 58 | + |
| 59 | +. *A `NativeKernelProvider` registered at priority 100* that on JDK |
| 60 | +21+ wins `KernelRegistry.bestAvailable()` over Panama whenever the |
| 61 | +native lib loads successfully. |
| 62 | +. *A first concrete kernel: native Q4_K matmul.* It must: |
| 63 | +.. take a `MemorySegment` for both input (FP32) and packed Q4_K |
| 64 | +weights (canonical ggml layout — same as `Q4_KBlockTensorData` |
| 65 | +and `matmulF32Q4_KMemSeg`); |
| 66 | +.. produce numerically equivalent output to |
| 67 | +`PanamaVectorQ4KMatmulKernel` within `1e-4` relative tolerance |
| 68 | +(same parity bar `PanamaVectorQ4KMatmulKernelTest` uses); |
| 69 | +.. clear *≥2.5×* over the prior Q4_K scalar dequant baseline — the |
| 70 | +M5 success metric — on the bench shapes from |
| 71 | +`QuantizedMatmulBench` (1024², 4096×1024, 4096²). |
| 72 | +. *Optional follow-on kernels* — Q6_K, Q8_0, FP32 — share the build |
| 73 | +system but each ship as a separate small PR. |
| 74 | +. *One supported architecture for the first PR* (likely Apple |
| 75 | +Silicon NEON since that's the development hardware in use), with a |
| 76 | +clear extension path for `linuxX64` AVX2 / `linuxArm64` NEON. |
| 77 | + |
| 78 | +== Non-goals |
| 79 | + |
| 80 | +* *JNI.* The roadmap explicitly says "FFM not JNI". JNI's per-call |
| 81 | +overhead and the global JNI lock are wrong for hot per-token |
| 82 | +kernels; FFM (Java 22 stable, Java 21 preview) gives near-zero |
| 83 | +overhead native calls and direct `MemorySegment` ABI. |
| 84 | +* *Cross-compilation matrix on day one.* The first PR can ship just |
| 85 | +one (host-arch) variant; CI cross-arch builds come later. |
| 86 | +* *Replacing Panama.* Panama remains the priority-50 fallback for |
| 87 | +environments that can't load native libs (sandboxes, Wasm, Native |
| 88 | +targets, JDK without `jdk.incubator.vector`). |
| 89 | +* *Distribution via pre-built native artifacts on Maven Central.* |
| 90 | +Out of scope for the first PR — local build only. Publishing |
| 91 | +classifier JARs comes in a separate plan. |
| 92 | + |
| 93 | +== Architecture |
| 94 | + |
| 95 | +=== Module layout |
| 96 | + |
| 97 | +[source] |
| 98 | +---- |
| 99 | +skainet-backends/ |
| 100 | + skainet-backend-native-cpu/ # NEW |
| 101 | + src/ |
| 102 | + jvmMain/kotlin/sk/ainet/exec/kernel/ # Kotlin side |
| 103 | + NativeKernelProvider.kt # priority=100, isAvailable()=libLoaded |
| 104 | + NativeQ4KMatmulKernel.kt # implements Q4KMatmulKernel via FFM |
| 105 | + NativeLibraryLoader.kt # System.loadLibrary, locate, version |
| 106 | + jvmMain/resources/META-INF/services/ |
| 107 | + sk.ainet.backend.api.kernel.KernelProvider # appends NativeKernelProviderFactory |
| 108 | + jvmTest/kotlin/sk/ainet/exec/kernel/ |
| 109 | + NativeQ4KMatmulKernelTest.kt # parity vs PanamaVectorQ4KMatmulKernel |
| 110 | + native/ # native source tree |
| 111 | + c/ |
| 112 | + q4k_matmul.c # ggml-style hand-tuned kernel |
| 113 | + q4k_matmul.h |
| 114 | + CMakeLists.txt # or Bazel BUILD |
| 115 | + build.gradle.kts # Gradle wrapper that invokes CMake |
| 116 | +---- |
| 117 | + |
| 118 | +The native library compiles to a shared object (`libskainet_kernels.dylib` |
| 119 | +on macOS, `.so` on Linux, `.dll` on Windows) and is packaged into the |
| 120 | +module's resources for `System.loadLibrary` discovery. |
| 121 | + |
| 122 | +=== FFM binding pattern |
| 123 | + |
| 124 | +Single C entry point per kernel: |
| 125 | + |
| 126 | +[source,c] |
| 127 | +---- |
| 128 | +// q4k_matmul.h |
| 129 | +void skainet_q4k_matmul( |
| 130 | + const float* input, // FP32 input vector, length input_dim |
| 131 | + const uint8_t* weight, // packed Q4_K bytes (canonical ggml layout) |
| 132 | + int32_t weight_byte_offset, |
| 133 | + int32_t input_dim, |
| 134 | + int32_t output_dim, |
| 135 | + float* output, // FP32 output, length output_dim |
| 136 | + int32_t output_offset |
| 137 | +); |
| 138 | +---- |
| 139 | + |
| 140 | +Kotlin side: |
| 141 | + |
| 142 | +[source,kotlin] |
| 143 | +---- |
| 144 | +internal object NativeQ4KMatmulKernel : Q4KMatmulKernel { |
| 145 | + private val handle: MethodHandle = run { |
| 146 | + val arena = Arena.ofAuto() |
| 147 | + val symbol = NativeLibraryLoader.lib.find("skainet_q4k_matmul").orElseThrow() |
| 148 | + Linker.nativeLinker().downcallHandle( |
| 149 | + symbol, |
| 150 | + FunctionDescriptor.ofVoid( |
| 151 | + ValueLayout.ADDRESS, ValueLayout.ADDRESS, ValueLayout.JAVA_INT, |
| 152 | + ValueLayout.JAVA_INT, ValueLayout.JAVA_INT, |
| 153 | + ValueLayout.ADDRESS, ValueLayout.JAVA_INT, |
| 154 | + ), |
| 155 | + ) |
| 156 | + } |
| 157 | +
|
| 158 | + override fun matmul( |
| 159 | + input: FloatArray, inputOffset: Int, |
| 160 | + weight: ByteArray, weightByteOffset: Int, |
| 161 | + inputDim: Int, outputDim: Int, |
| 162 | + output: FloatArray, outputOffset: Int, |
| 163 | + ) { |
| 164 | + // Heap arrays: pass via temporary off-heap MemorySegment + bulk copy, |
| 165 | + // OR (preferred) overload with a MemorySegment-input variant for |
| 166 | + // mmap'd weights to avoid the copy. |
| 167 | + } |
| 168 | +} |
| 169 | +---- |
| 170 | + |
| 171 | +The cleaner path is to introduce a sibling `Q4KMemSegMatmulKernel` |
| 172 | +SPI (mentioned as out-of-scope in PR #563) that takes `MemorySegment` |
| 173 | +directly, and have the native provider implement *that* — no heap |
| 174 | +copy. The `Q4KMatmulKernel` (`ByteArray`) variant can wrap the |
| 175 | +MemSeg one with a temporary `Arena.ofConfined()` copy if needed for |
| 176 | +legacy callers. |
| 177 | + |
| 178 | +=== Build system |
| 179 | + |
| 180 | +*Gradle + CMake* is the path of least resistance: |
| 181 | + |
| 182 | +* A new Gradle module (or hand-rolled `Exec` tasks) invokes CMake |
| 183 | +for the native module's `build` task. |
| 184 | +* Native artifacts land in `build/native/<arch>/` and are copied |
| 185 | +into `src/jvmMain/resources/native/<os>-<arch>/` so |
| 186 | +`System.loadLibrary` finds them. |
| 187 | +* Kotlin compile depends on the native artifact being built first. |
| 188 | + |
| 189 | +The xnnpack backend already in the repo |
| 190 | +(`skainet-backends/skainet-backend-xnnpack/`) demonstrates a similar |
| 191 | +pattern — Gradle invokes CMake to build a native lib via cinterop. |
| 192 | +*Reuse that template* rather than reinventing. |
| 193 | + |
| 194 | +== Staged delivery |
| 195 | + |
| 196 | +PRs in order, each independently mergeable: |
| 197 | + |
| 198 | +. *`skainet-backend-native-cpu` module scaffolding.* Gradle module, |
| 199 | +`build.gradle.kts` wired to invoke CMake, a *trivial* C kernel |
| 200 | +(e.g. just multiplies its first input by 2.0) to prove the FFM |
| 201 | +pipeline end-to-end. `NativeKernelProvider` that's `isAvailable() |
| 202 | += false` until the real kernel lands. Sets up CI artifact path on |
| 203 | +host arch. |
| 204 | +. *First real native kernel: Q4_K matmul (Apple Silicon NEON).* |
| 205 | +Hand-tuned kernel, parity tests vs `PanamaVectorQ4KMatmulKernel`, |
| 206 | +JMH bench variant added to `QuantizedMatmulBench`. |
| 207 | +. *`Q4KMemSegMatmulKernel` SPI sibling + native variant.* Closes |
| 208 | +the M4↔M5 zero-copy story for mmap'd weights. |
| 209 | +. *`linuxX64` AVX2 variant + cross-arch CI build.* The |
| 210 | +cross-compilation matrix story. |
| 211 | +. *Optional: native FP32 matmul, native Q6_K, native Q8_0.* Same |
| 212 | +shape as PRs 2–3, one per format. |
| 213 | + |
| 214 | +The first PR is the largest in scaffolding terms (~500–800 LoC of |
| 215 | +build glue + 1 trivial kernel), but every subsequent PR is small and |
| 216 | +template-able. |
| 217 | + |
| 218 | +== Success metrics |
| 219 | + |
| 220 | +* *PR 2 sign-off*: native Q4_K matmul on Apple Silicon clears *≥2.5×* |
| 221 | +over the scalar Q4_K dequant-then-matmul baseline at 4096² (the M5 |
| 222 | +milestone target). For reference: Panama Q4_K SIMD already exceeds |
| 223 | +this metric (~73 GFLOPS, see |
| 224 | +xref:explanation/perf/quantized-simd-kernels.adoc[]), so the bar is |
| 225 | +"beats Panama by a meaningful margin", probably ≥1.5× over Panama. |
| 226 | +* *PR 3 sign-off*: Q4_K MemSeg native path is faster than the Panama |
| 227 | +Q4_K MemSeg path from PR #563, with no heap copy in the timed |
| 228 | +region. |
| 229 | +* *No regression on JVM-only environments* — when the native lib |
| 230 | +fails to load (sandbox, missing arch, kill-switch), `bestAvailable()` |
| 231 | +cleanly falls through to Panama, and existing tests / benches show |
| 232 | +the same numbers as today. |
| 233 | + |
| 234 | +== Risks & open questions |
| 235 | + |
| 236 | +. *JDK 21 preview FFM vs JDK 22 stable.* FFM left preview in Java 22. |
| 237 | +The repo currently builds on JDK 21 with `--enable-preview |
| 238 | +--add-modules jdk.incubator.vector`. Recommendation: stay on 21 |
| 239 | +preview; flip to 22 in a separate toolchain-bump PR. |
| 240 | +. *`MethodHandle` invocation overhead.* Even with FFM, each native |
| 241 | +call has a small fixed cost (~µs). For the smallest matmul shapes |
| 242 | +(e.g. 256² FP32) this could swamp the FLOPs win. Mitigation: route |
| 243 | +small inputs to Panama and large inputs to native at the |
| 244 | +registry/provider level, OR accept that the win is sized for |
| 245 | +production-relevant shapes (4096²+). |
| 246 | +. *Native code quality and maintenance.* Hand-tuned NEON / AVX2 in C |
| 247 | +is harder to audit than Kotlin Vector API code. Mitigation: keep |
| 248 | +kernels small (<300 LoC each), parity-test exhaustively, prefer |
| 249 | +porting from ggml's reference (BSD-licensed, well-vetted) over |
| 250 | +writing from scratch. |
| 251 | +. *Distribution.* Native artifacts complicate Maven Central |
| 252 | +publication (need `<classifier>` per OS/arch). Not a blocker for |
| 253 | +the first internal-use PR; a separate "publish native classifier |
| 254 | +JARs" plan will be needed before community use. |
| 255 | +. *Cross-arch CI cost.* Building NEON natively on Apple Silicon CI |
| 256 | +plus AVX2 on linuxX64 plus Android NDK doubles or triples build |
| 257 | +time. The xnnpack backend's existing CI matrix is a precedent — |
| 258 | +reuse the same approach. |
| 259 | +. *Native `MemorySegment` lifetime.* The Kotlin caller owns the |
| 260 | +`Arena` for arrays it copies in. The native kernel must NOT retain |
| 261 | +pointers past the FFM call return. Document this contract in |
| 262 | +`NativeQ4KMatmulKernel.matmul` kdoc. |
| 263 | + |
| 264 | +== When to start |
| 265 | + |
| 266 | +Trigger conditions (any one): |
| 267 | + |
| 268 | +* Real workload demands the native ≥2.5× target (Panama Q4_K stops |
| 269 | +being fast enough on a customer machine). |
| 270 | +* A community contributor offers a hand-tuned NEON / AVX2 Q4_K |
| 271 | +kernel that's measurably faster than Panama. |
| 272 | +* A second M5 metric (e.g. SDPA throughput, training-loop |
| 273 | +throughput) needs hand-tuned native code. |
| 274 | + |
| 275 | +Until then: *pause.* The Panama provider is doing the |
| 276 | +milestone-equivalent work in absolute terms, and adding a native |
| 277 | +build system is a meaningful complexity tax to take on |
| 278 | +speculatively. |
0 commit comments