Skip to content

Commit 90bcf1f

Browse files
Merge pull request #567 from SKaiNET-developers/feature/docs-simd-kernels-and-arc42
docs: SIMD kernels, quantized SIMD, native FFM plan; arc42 architecture
2 parents 4a3758a + 75aa4f6 commit 90bcf1f

5 files changed

Lines changed: 1043 additions & 6 deletions

File tree

docs/modules/ROOT/nav.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,5 +25,8 @@
2525
* xref:explanation/examples/index.adoc[Worked examples]
2626
** xref:explanation/examples/matmul.adoc[Matrix multiplication examples]
2727
* xref:explanation/perf/jvm-cpu.adoc[JVM CPU performance]
28+
* xref:explanation/perf/simd-kernels.adoc[How SIMD kernels are built]
29+
* xref:explanation/perf/quantized-simd-kernels.adoc[How quantized SIMD kernels are built]
30+
* xref:explanation/perf/native-ffm-plan.adoc[Plan: native FFM kernel provider]
2831
* xref:explanation/perf/java-25-cpu-backend.adoc[Java 25 CPU backend notes]
2932
* xref:explanation/issues/native-macos-accelerate-simd.adoc[Native macOS Accelerate SIMD issues]
Lines changed: 278 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
= Plan: Native (FFM) Kernel Provider
2+
:description: Where the JVM Vector kernels stop, what a native priority-100 provider would look like, and when to build it.
3+
4+
This page is a *plan*, not shipped code. The intent is to capture
5+
enough detail that the design doesn't drift between the time someone
6+
decides to start the work and the moment a PR is opened. The earlier
7+
content of this page lived briefly in `NATIVE_FFM_KERNEL_PROVIDER.md`
8+
at the repo root and was removed on advice of "ship the release first,
9+
keep the plan in docs"; this is its permanent home.
10+
11+
== Where the JVM Vector kernels run out
12+
13+
After the M5 milestone work landed (PRs #554–#565 across the 0.21.0
14+
release), every CPU matmul path goes through the kernel SPI — see
15+
xref:explanation/perf/simd-kernels.adoc[] and
16+
xref:explanation/perf/quantized-simd-kernels.adoc[]. The Panama Vector
17+
provider runs at:
18+
19+
* ~73 GFLOPS on FP32 4096² matmul (Apple Silicon NEON)
20+
* ~73 GFLOPS on Q4_K 4096² matmul-vector (same regime; fused dequant
21+
adds essentially zero cost on top of the FMA)
22+
23+
That's already in the ggml NEON ballpark in absolute terms. But
24+
ggml's hand-tuned NEON / AVX2 still outruns the JVM Vector API on:
25+
26+
* dense FLOPs/cycle on shapes the Vector API can't tile-block
27+
optimally (the 8×8×128 default is heuristic)
28+
* AVX-512 VNNI fused INT8 dot products
29+
* NEON `bf16` / `fp16` SDOT instructions
30+
* future SVE / SME — none of which the Vector API exposes portably
31+
today
32+
33+
A native provider closes that gap and unlocks two follow-ons that
34+
*can't* be built on the Vector API alone:
35+
36+
. *M4 ↔ M5 zero-copy.* Mmap'd Q4_K weights stay as `MemorySegment`
37+
views; a native kernel reads the same pages with no heap copy and
38+
no staging buffer.
39+
. *Hardware-specific lanes* unreachable from portable Vector code.
40+
41+
== Provider shape
42+
43+
[cols="1,1,1",options="header"]
44+
|===
45+
| Priority | Provider | Status
46+
| 0 | `ScalarKernelProvider` | shipped (PR #554)
47+
| 50 | `PanamaVectorKernelProvider` | shipped (PRs #557, #560 + ServiceLoader #559)
48+
| *100* | *`NativeKernelProvider` (FFM)* | *this plan*
49+
|===
50+
51+
The `KernelRegistry.bestAvailable()` cascade means: when the native
52+
lib loads, native wins; when it doesn't (sandbox, missing arch, JDK
53+
without FFM, kill-switch flipped), Panama wins; on Native targets and
54+
JS / Wasm where neither is available, scalar wins. No code change
55+
above the registry layer.
56+
57+
== Goals
58+
59+
. *A `NativeKernelProvider` registered at priority 100* that on JDK
60+
21+ wins `KernelRegistry.bestAvailable()` over Panama whenever the
61+
native lib loads successfully.
62+
. *A first concrete kernel: native Q4_K matmul.* It must:
63+
.. take a `MemorySegment` for both input (FP32) and packed Q4_K
64+
weights (canonical ggml layout — same as `Q4_KBlockTensorData`
65+
and `matmulF32Q4_KMemSeg`);
66+
.. produce numerically equivalent output to
67+
`PanamaVectorQ4KMatmulKernel` within `1e-4` relative tolerance
68+
(same parity bar `PanamaVectorQ4KMatmulKernelTest` uses);
69+
.. clear *≥2.5×* over the prior Q4_K scalar dequant baseline — the
70+
M5 success metric — on the bench shapes from
71+
`QuantizedMatmulBench` (1024², 4096×1024, 4096²).
72+
. *Optional follow-on kernels* — Q6_K, Q8_0, FP32 — share the build
73+
system but each ship as a separate small PR.
74+
. *One supported architecture for the first PR* (likely Apple
75+
Silicon NEON since that's the development hardware in use), with a
76+
clear extension path for `linuxX64` AVX2 / `linuxArm64` NEON.
77+
78+
== Non-goals
79+
80+
* *JNI.* The roadmap explicitly says "FFM not JNI". JNI's per-call
81+
overhead and the global JNI lock are wrong for hot per-token
82+
kernels; FFM (Java 22 stable, Java 21 preview) gives near-zero
83+
overhead native calls and direct `MemorySegment` ABI.
84+
* *Cross-compilation matrix on day one.* The first PR can ship just
85+
one (host-arch) variant; CI cross-arch builds come later.
86+
* *Replacing Panama.* Panama remains the priority-50 fallback for
87+
environments that can't load native libs (sandboxes, Wasm, Native
88+
targets, JDK without `jdk.incubator.vector`).
89+
* *Distribution via pre-built native artifacts on Maven Central.*
90+
Out of scope for the first PR — local build only. Publishing
91+
classifier JARs comes in a separate plan.
92+
93+
== Architecture
94+
95+
=== Module layout
96+
97+
[source]
98+
----
99+
skainet-backends/
100+
skainet-backend-native-cpu/ # NEW
101+
src/
102+
jvmMain/kotlin/sk/ainet/exec/kernel/ # Kotlin side
103+
NativeKernelProvider.kt # priority=100, isAvailable()=libLoaded
104+
NativeQ4KMatmulKernel.kt # implements Q4KMatmulKernel via FFM
105+
NativeLibraryLoader.kt # System.loadLibrary, locate, version
106+
jvmMain/resources/META-INF/services/
107+
sk.ainet.backend.api.kernel.KernelProvider # appends NativeKernelProviderFactory
108+
jvmTest/kotlin/sk/ainet/exec/kernel/
109+
NativeQ4KMatmulKernelTest.kt # parity vs PanamaVectorQ4KMatmulKernel
110+
native/ # native source tree
111+
c/
112+
q4k_matmul.c # ggml-style hand-tuned kernel
113+
q4k_matmul.h
114+
CMakeLists.txt # or Bazel BUILD
115+
build.gradle.kts # Gradle wrapper that invokes CMake
116+
----
117+
118+
The native library compiles to a shared object (`libskainet_kernels.dylib`
119+
on macOS, `.so` on Linux, `.dll` on Windows) and is packaged into the
120+
module's resources for `System.loadLibrary` discovery.
121+
122+
=== FFM binding pattern
123+
124+
Single C entry point per kernel:
125+
126+
[source,c]
127+
----
128+
// q4k_matmul.h
129+
void skainet_q4k_matmul(
130+
const float* input, // FP32 input vector, length input_dim
131+
const uint8_t* weight, // packed Q4_K bytes (canonical ggml layout)
132+
int32_t weight_byte_offset,
133+
int32_t input_dim,
134+
int32_t output_dim,
135+
float* output, // FP32 output, length output_dim
136+
int32_t output_offset
137+
);
138+
----
139+
140+
Kotlin side:
141+
142+
[source,kotlin]
143+
----
144+
internal object NativeQ4KMatmulKernel : Q4KMatmulKernel {
145+
private val handle: MethodHandle = run {
146+
val arena = Arena.ofAuto()
147+
val symbol = NativeLibraryLoader.lib.find("skainet_q4k_matmul").orElseThrow()
148+
Linker.nativeLinker().downcallHandle(
149+
symbol,
150+
FunctionDescriptor.ofVoid(
151+
ValueLayout.ADDRESS, ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
152+
ValueLayout.JAVA_INT, ValueLayout.JAVA_INT,
153+
ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
154+
),
155+
)
156+
}
157+
158+
override fun matmul(
159+
input: FloatArray, inputOffset: Int,
160+
weight: ByteArray, weightByteOffset: Int,
161+
inputDim: Int, outputDim: Int,
162+
output: FloatArray, outputOffset: Int,
163+
) {
164+
// Heap arrays: pass via temporary off-heap MemorySegment + bulk copy,
165+
// OR (preferred) overload with a MemorySegment-input variant for
166+
// mmap'd weights to avoid the copy.
167+
}
168+
}
169+
----
170+
171+
The cleaner path is to introduce a sibling `Q4KMemSegMatmulKernel`
172+
SPI (mentioned as out-of-scope in PR #563) that takes `MemorySegment`
173+
directly, and have the native provider implement *that* — no heap
174+
copy. The `Q4KMatmulKernel` (`ByteArray`) variant can wrap the
175+
MemSeg one with a temporary `Arena.ofConfined()` copy if needed for
176+
legacy callers.
177+
178+
=== Build system
179+
180+
*Gradle + CMake* is the path of least resistance:
181+
182+
* A new Gradle module (or hand-rolled `Exec` tasks) invokes CMake
183+
for the native module's `build` task.
184+
* Native artifacts land in `build/native/<arch>/` and are copied
185+
into `src/jvmMain/resources/native/<os>-<arch>/` so
186+
`System.loadLibrary` finds them.
187+
* Kotlin compile depends on the native artifact being built first.
188+
189+
The xnnpack backend already in the repo
190+
(`skainet-backends/skainet-backend-xnnpack/`) demonstrates a similar
191+
pattern — Gradle invokes CMake to build a native lib via cinterop.
192+
*Reuse that template* rather than reinventing.
193+
194+
== Staged delivery
195+
196+
PRs in order, each independently mergeable:
197+
198+
. *`skainet-backend-native-cpu` module scaffolding.* Gradle module,
199+
`build.gradle.kts` wired to invoke CMake, a *trivial* C kernel
200+
(e.g. just multiplies its first input by 2.0) to prove the FFM
201+
pipeline end-to-end. `NativeKernelProvider` that's `isAvailable()
202+
= false` until the real kernel lands. Sets up CI artifact path on
203+
host arch.
204+
. *First real native kernel: Q4_K matmul (Apple Silicon NEON).*
205+
Hand-tuned kernel, parity tests vs `PanamaVectorQ4KMatmulKernel`,
206+
JMH bench variant added to `QuantizedMatmulBench`.
207+
. *`Q4KMemSegMatmulKernel` SPI sibling + native variant.* Closes
208+
the M4↔M5 zero-copy story for mmap'd weights.
209+
. *`linuxX64` AVX2 variant + cross-arch CI build.* The
210+
cross-compilation matrix story.
211+
. *Optional: native FP32 matmul, native Q6_K, native Q8_0.* Same
212+
shape as PRs 2–3, one per format.
213+
214+
The first PR is the largest in scaffolding terms (~500–800 LoC of
215+
build glue + 1 trivial kernel), but every subsequent PR is small and
216+
template-able.
217+
218+
== Success metrics
219+
220+
* *PR 2 sign-off*: native Q4_K matmul on Apple Silicon clears *≥2.5×*
221+
over the scalar Q4_K dequant-then-matmul baseline at 4096² (the M5
222+
milestone target). For reference: Panama Q4_K SIMD already exceeds
223+
this metric (~73 GFLOPS, see
224+
xref:explanation/perf/quantized-simd-kernels.adoc[]), so the bar is
225+
"beats Panama by a meaningful margin", probably ≥1.5× over Panama.
226+
* *PR 3 sign-off*: Q4_K MemSeg native path is faster than the Panama
227+
Q4_K MemSeg path from PR #563, with no heap copy in the timed
228+
region.
229+
* *No regression on JVM-only environments* — when the native lib
230+
fails to load (sandbox, missing arch, kill-switch), `bestAvailable()`
231+
cleanly falls through to Panama, and existing tests / benches show
232+
the same numbers as today.
233+
234+
== Risks & open questions
235+
236+
. *JDK 21 preview FFM vs JDK 22 stable.* FFM left preview in Java 22.
237+
The repo currently builds on JDK 21 with `--enable-preview
238+
--add-modules jdk.incubator.vector`. Recommendation: stay on 21
239+
preview; flip to 22 in a separate toolchain-bump PR.
240+
. *`MethodHandle` invocation overhead.* Even with FFM, each native
241+
call has a small fixed cost (~µs). For the smallest matmul shapes
242+
(e.g. 256² FP32) this could swamp the FLOPs win. Mitigation: route
243+
small inputs to Panama and large inputs to native at the
244+
registry/provider level, OR accept that the win is sized for
245+
production-relevant shapes (4096²+).
246+
. *Native code quality and maintenance.* Hand-tuned NEON / AVX2 in C
247+
is harder to audit than Kotlin Vector API code. Mitigation: keep
248+
kernels small (<300 LoC each), parity-test exhaustively, prefer
249+
porting from ggml's reference (BSD-licensed, well-vetted) over
250+
writing from scratch.
251+
. *Distribution.* Native artifacts complicate Maven Central
252+
publication (need `<classifier>` per OS/arch). Not a blocker for
253+
the first internal-use PR; a separate "publish native classifier
254+
JARs" plan will be needed before community use.
255+
. *Cross-arch CI cost.* Building NEON natively on Apple Silicon CI
256+
plus AVX2 on linuxX64 plus Android NDK doubles or triples build
257+
time. The xnnpack backend's existing CI matrix is a precedent —
258+
reuse the same approach.
259+
. *Native `MemorySegment` lifetime.* The Kotlin caller owns the
260+
`Arena` for arrays it copies in. The native kernel must NOT retain
261+
pointers past the FFM call return. Document this contract in
262+
`NativeQ4KMatmulKernel.matmul` kdoc.
263+
264+
== When to start
265+
266+
Trigger conditions (any one):
267+
268+
* Real workload demands the native ≥2.5× target (Panama Q4_K stops
269+
being fast enough on a customer machine).
270+
* A community contributor offers a hand-tuned NEON / AVX2 Q4_K
271+
kernel that's measurably faster than Panama.
272+
* A second M5 metric (e.g. SDPA throughput, training-loop
273+
throughput) needs hand-tuned native code.
274+
275+
Until then: *pause.* The Panama provider is doing the
276+
milestone-equivalent work in absolute terms, and adding a native
277+
build system is a meaningful complexity tax to take on
278+
speculatively.

0 commit comments

Comments
 (0)