Skip to content

Commit 61962de

Browse files
michalharakalclaude
andcommitted
chore(release): prepare 0.21.0
- gradle.properties: drop -SNAPSHOT; RELEASE_SIGNING_ENABLED stays true. - CHANGELOG: add 0.21.0 section covering the JVM Vector half of M5 (kernel SPI + Panama FP32 + tile-blocking + production routing + ServiceLoader auto-discovery + Q4_K SIMD + sibling SPI + Q4_K MemSeg + Q6_K SIMD + Q4_0 partial SIMD), plus ScratchPool SPI, TensorOps.permute, and Q4_K/Q5_K canonical layout fix. - README: bump Quickstart coordinates to 0.21.0; compact "What's New" section. - NATIVE_FFM_KERNEL_PROVIDER.md: PRD for the deferred priority-100 native FFM kernel provider — module layout, FFM binding pattern, staged delivery plan, success metrics, risks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d48f172 commit 61962de

4 files changed

Lines changed: 214 additions & 9 deletions

File tree

CHANGELOG.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,31 @@
22

33
## [Unreleased]
44

5+
## [0.21.0] - 2026-04-28
6+
7+
### Added
8+
9+
#### CPU kernel SPI (M5 — JVM Vector half complete)
10+
11+
This release lands the JVM Vector half of milestone M5 from the JVM inference performance roadmap — a pluggable kernel SPI parallel to `BackendProvider`, plus a Panama Vector provider that matches or beats the prior production path on every shape we measure. The native (FFM) priority-100 provider that closes the milestone metric is captured as a PRD ([NATIVE_FFM_KERNEL_PROVIDER.md](NATIVE_FFM_KERNEL_PROVIDER.md)) and deferred.
12+
13+
- **`KernelProvider` SPI**`skainet-backend-api` now exposes a `KernelProvider` interface with `name`, `priority`, `isAvailable()`, and per-kernel accessors (`matmulFp32()`, `matmulQ4K()`). `KernelRegistry` does priority-ordered `bestAvailable()` lookup; a JVM-only `KernelServiceLoader.installAll()` auto-discovers providers via `META-INF/services/sk.ainet.backend.api.kernel.KernelProvider`. Manual `register(...)` still works for tests and non-JVM platforms. (PRs #554, #559)
14+
- **`Fp32MatmulKernel` + `PanamaVectorMatmulKernel`** — JDK Vector API implementation using `FloatVector.SPECIES_PREFERRED` + `fma` + `reduceLanes`, cache-blocked with 8×8×128 tiles. `KernelMatmulBench` measures **8.61× / 8.62× / 10.83×** speedup over scalar at 256/512/1024 (JDK 21.0.10, M-series macOS). Within JMH noise of — and often slightly faster than — the prior `JvmVectorKernels.matmulFloatBlocked` production path, so routing introduced no regression. (PRs #557, #558, #560)
15+
- **Production matmul routes through `KernelRegistry`**`DefaultCpuOpsJvm.matmul` now resolves the FP32 kernel via `KernelRegistry.bestAvailable()` instead of calling `JvmVectorKernels.matmulFloat*` directly. Production `MatmulBench` numbers post-routing match pre-routing within JMH noise. (PR #561)
16+
- **`Q4KMatmulKernel` SPI + SIMD-fused Panama implementation** — Sibling kernel interface in `skainet-backend-api/commonMain`, `KernelProvider.matmulQ4K()` accessor (default-`null` for backwards compat). `PanamaVectorQ4KMatmulKernel` fuses Q4_K dequant inline with the FMA accumulator: a single `ByteVector` load feeds both lo and hi sub-block accumulators per qs slab via AND/LSHR nibble extract → `castShape(B2F)` → FMA, with the lazy-`dmin` correction (`acc += scale·codeSum − offset·inputSum` once per sub-block). `QuantizedMatmulBench` measures 0.07/0.15/0.46 ms at 1024×1024 / 4096×1024 / 4096×4096 (≈30/55/73 GFLOPS — same throughput regime as the FP32 SIMD kernel, meaning fused dequant adds essentially zero cost on top of the FMA). `DefaultCpuOpsJvm.chooseQuantizedMatmul`'s `Q4_KTensorData` branch routes through the SPI with a fall-through to the legacy kernel when no provider resolves. (PR #562)
17+
- **Q4_K MemSeg SIMD** — Same fused-pipeline algorithm applied inline to `JvmQuantizedVectorKernels.matmulF32Q4_KMemSeg` (the path mmap'd weights take). `ByteVector.fromMemorySegment` instead of `ByteVector.fromArray` — no heap copy. (PR #563)
18+
- **Q6_K SIMD dequant**`dequantQ6_KBlock` replaces its scalar 32-iteration loop with a `ByteVector`-based ql + qh extraction pipeline: per `floatStep`-wide chunk of `l`, loads ql + qh slices, assembles `q1..q4 = (ql nibble) | ((qh slice) << 4) − 32` per lane, multiplies by per-sub-block `d·scale`, stores to four 32-element regions of the scratch FloatArray. (PR #564)
19+
- **Q4_0 partial SIMD**`dotQ4_0BlockMemSeg` two-stage pattern: scalar byte-pair unpack into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic) followed by a `FloatVector` FMA reduction. Closes the last fully-scalar quantized kernel; every quantized format in `JvmQuantizedVectorKernels` (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0) is now SIMD'd to some degree. (PR #565)
20+
21+
#### Other
22+
23+
- **`ScratchPool` SPI** — Runtime workspace allocation for transient tensor scratch buffers. Per-runtime size-classed slabs, scoped acquire/release. Closes the framework-side primitive for milestone M1 of the JVM perf roadmap. (PR #550)
24+
- **`TensorOps.permute(axes)`** — Arbitrary-axis permutation (generalizes the existing `transpose` to N-D). (PR #552)
25+
26+
### Fixed
27+
28+
- **Q4_K / Q5_K canonical ggml layout + FP32 MemSeg arena leak**`Q4_KTensorData` and Q5_K dequant now apply the canonical ggml layout (super-block scale + per-sub-block scaleIdx/minIdx via `get_scale_min_k4` mixing, strided 4-bit codes layout). `MemorySegmentTensorDataFactory` uses `Arena.ofAuto()` for per-op outputs so the matmul / transpose output segments are GC-reclaimable; the prior `ofConfined()` builds leaked tens of MB per matmul, which over a 35-layer Gemma 4 forward pass exhausted the JVM direct-memory cap. Liveness-based freeing of intermediate tensors in `ComputeGraphExecutor`. (PR #556)
29+
530
## [0.20.0] - 2026-04-24
631

732
### Added

NATIVE_FFM_KERNEL_PROVIDER.md

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# PRD — Native (FFM) Kernel Provider
2+
3+
**Status:** Deferred (post-0.21.0). Captured here so the design doesn't drift.
4+
**Owner:** unassigned.
5+
**Milestone:** M5 (CPU backend dispatch) — final piece. The roadmap's M5 success metric `native ≥2.5× for Q4_K` requires this provider; the JVM Vector half (PRs #554, #557, #560, #562, #563, #564) closed the Panama story but not the native one.
6+
7+
## Context
8+
9+
The kernel SPI shipped in PR #554 (`KernelProvider`, `Fp32MatmulKernel`, `KernelRegistry`) was designed to host **three** providers, ordered by priority:
10+
11+
| priority | provider | status |
12+
|---------:|---|---|
13+
| 0 | `ScalarKernelProvider` | shipped (#554) |
14+
| 50 | `PanamaVectorKernelProvider` | shipped (#557, plus tile-blocking #560, ServiceLoader #559) |
15+
| 100 | `NativeKernelProvider` (FFM) | **this PRD** |
16+
17+
The Panama provider runs the FP32 matmul at ~73 GFLOPS for square 4096² shapes on Apple Silicon (per #558's JMH bench), the Q4_K SIMD kernel runs in the same throughput regime as Panama FP32 (#562 numbers). That's already in the ggml NEON ballpark — but ggml's hand-tuned NEON / AVX2 still outruns Panama on dense per-cycle FLOPs and on Q4_K specifically, where 4-bit nibble unpacking maps cleanly to dedicated SIMD shuffles that the Vector API can't always emit.
18+
19+
A native provider closes that gap and unlocks two follow-ons:
20+
- **M4 ↔ M5 synergy.** Mmap'd Q4_K weights stay as `MemorySegment` views; a native kernel reads the same pages with zero copy via FFI. No staging buffer, no `ByteArray` round-trip.
21+
- **Hardware-specific lanes.** AVX-512 VNNI fused INT8 dot products, NEON `bf16`/`fp16` SDOT instructions, future SVE — the Vector API exposes none of these portably today.
22+
23+
## Goals
24+
25+
1. **A `NativeKernelProvider` registered at priority 100** that on JDK 21+ wins `KernelRegistry.bestAvailable()` over Panama whenever the native lib is loaded successfully.
26+
2. **A first concrete kernel: native Q4_K matmul.** It must:
27+
- take a `MemorySegment` for both input (FP32) and packed Q4_K weights (canonical ggml layout — same as `Q4_KBlockTensorData` and the existing `matmulF32Q4_KMemSeg`);
28+
- produce numerically equivalent output to `PanamaVectorQ4KMatmulKernel` within `1e-4` relative tolerance (same parity bar `PanamaVectorQ4KMatmulKernelTest` uses);
29+
- clear **≥2.5× over the prior Q4_K scalar dequant baseline** on the bench shapes from `QuantizedMatmulBench` (1024², 4096×1024, 4096²).
30+
3. **Optional follow-on kernels** — Q6_K, Q8_0, FP32 matmul — share the build system but each ship as a separate small PR.
31+
4. **One supported architecture for the first PR** (likely Apple Silicon NEON, since that's the development hardware in use), with a clear extension path for `linuxX64` AVX2 / `linuxArm64` NEON.
32+
33+
## Non-goals
34+
35+
- **JNI.** The roadmap explicitly says "FFM not JNI". JNI's per-call overhead and the global JNI lock are wrong for hot per-token kernels; FFM (Java 22 stable, Java 21 preview) gives near-zero-overhead native calls and direct `MemorySegment` ABI.
36+
- **Cross-compilation matrix on day one.** The first PR can ship just one (host-arch) variant; CI cross-arch builds come later.
37+
- **Replacing Panama.** Panama remains the priority-50 fallback for environments that can't load native libs (sandboxes, Wasm, Native targets, JDK without `jdk.incubator.vector`).
38+
- **Distribution via Maven Central pre-built native artifacts.** Out of scope for the first PR — local build only. A separate "publish native classifier JARs" PRD comes later.
39+
40+
## Architecture
41+
42+
### Module layout
43+
44+
```
45+
skainet-backends/
46+
skainet-backend-native-cpu/ # NEW
47+
src/
48+
jvmMain/kotlin/sk/ainet/exec/kernel/ # Kotlin side
49+
NativeKernelProvider.kt # priority=100, isAvailable() = libLoaded
50+
NativeQ4KMatmulKernel.kt # implements Q4KMatmulKernel, calls FFM
51+
NativeLibraryLoader.kt # loadLibrary, locate, check API version
52+
jvmMain/resources/META-INF/services/
53+
sk.ainet.backend.api.kernel.KernelProvider # appends NativeKernelProviderFactory
54+
jvmTest/kotlin/sk/ainet/exec/kernel/
55+
NativeQ4KMatmulKernelTest.kt # parity vs PanamaVectorQ4KMatmulKernel
56+
native/ # native source tree
57+
c/
58+
q4k_matmul.c # ggml-style hand-tuned kernel
59+
q4k_matmul.h
60+
CMakeLists.txt # or Bazel BUILD
61+
build.gradle.kts # Gradle wrapper that invokes CMake
62+
```
63+
64+
The native library compiles to a shared object (`libskainet_kernels.dylib` on macOS, `.so` on Linux, `.dll` on Windows) and is packaged into the module's resources for `System.loadLibrary` discovery.
65+
66+
### FFM binding pattern
67+
68+
Single C entry point per kernel:
69+
70+
```c
71+
// q4k_matmul.h
72+
void skainet_q4k_matmul(
73+
const float* input, // FP32 input vector, length input_dim
74+
const uint8_t* weight, // packed Q4_K bytes (canonical ggml layout)
75+
int32_t weight_byte_offset,
76+
int32_t input_dim,
77+
int32_t output_dim,
78+
float* output, // FP32 output, length output_dim
79+
int32_t output_offset
80+
);
81+
```
82+
83+
Kotlin side:
84+
85+
```kotlin
86+
internal object NativeQ4KMatmulKernel : Q4KMatmulKernel {
87+
private val handle: MethodHandle = run {
88+
val arena = Arena.ofAuto()
89+
val symbol = NativeLibraryLoader.lib.find("skainet_q4k_matmul").orElseThrow()
90+
Linker.nativeLinker().downcallHandle(
91+
symbol,
92+
FunctionDescriptor.ofVoid(
93+
ValueLayout.ADDRESS, ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
94+
ValueLayout.JAVA_INT, ValueLayout.JAVA_INT,
95+
ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
96+
),
97+
)
98+
}
99+
100+
override fun matmul(
101+
input: FloatArray, inputOffset: Int,
102+
weight: ByteArray, weightByteOffset: Int,
103+
inputDim: Int, outputDim: Int,
104+
output: FloatArray, outputOffset: Int,
105+
) {
106+
// Heap arrays: pass via temporary off-heap MemorySegment + bulk copy,
107+
// OR (preferred) overload with a MemorySegment-input variant for
108+
// mmap'd weights to avoid the copy.
109+
...
110+
}
111+
}
112+
```
113+
114+
The cleaner path is to introduce a sibling **`Q4KMemSegMatmulKernel`** SPI (mentioned as out-of-scope in #563) that takes `MemorySegment` directly, and have the native provider implement *that* — no heap copy. The `Q4KMatmulKernel` (ByteArray) variant can wrap the MemSeg one with a temporary `Arena.ofConfined()` copy if needed for legacy callers.
115+
116+
### Build system
117+
118+
**Gradle + CMake** is the path of least resistance:
119+
- A new Gradle plugin (or hand-rolled `Exec` tasks) invokes CMake for the native module's `build` task.
120+
- Native artifacts land in `build/native/<arch>/` and are copied into `src/jvmMain/resources/native/<os>-<arch>/` so `System.loadLibrary` finds them.
121+
- The Kotlin compile depends on the native artifact being built first.
122+
123+
The `xnnpack` backend already in the repo (`skainet-backends/skainet-backend-xnnpack/`) demonstrates a similar pattern — Gradle invokes CMake to build a native lib via cinterop. **Reuse that template** rather than reinventing.
124+
125+
**Architecture detection**: at native module build time, query host arch and only build for it (first PR scope). CI cross-arch matrix follows.
126+
127+
### Provider class
128+
129+
```kotlin
130+
public object NativeKernelProvider : KernelProvider {
131+
override val name: String = "native-ffm"
132+
override val priority: Int = 100
133+
134+
private val available: Boolean by lazy {
135+
runCatching { NativeLibraryLoader.load() }.isSuccess
136+
}
137+
138+
override fun isAvailable(): Boolean = available
139+
140+
override fun matmulFp32(): Fp32MatmulKernel? = null // future PR
141+
override fun matmulQ4K(): Q4KMatmulKernel? =
142+
if (isAvailable()) NativeQ4KMatmulKernel else null
143+
}
144+
```
145+
146+
Registered via the existing ServiceLoader mechanism (`META-INF/services/sk.ainet.backend.api.kernel.KernelProvider`, factory wrapper because `KernelProvider by NativeKernelProvider`). When unavailable, the cascade falls through to Panama (priority 50), preserving the M5 metric on environments without native code.
147+
148+
## Staged delivery
149+
150+
PRs in order, each independently mergeable:
151+
152+
1. **`skainet-backend-native-cpu` module scaffolding** — Gradle module, build.gradle.kts wired to invoke CMake, a *trivial* C kernel (e.g. just multiplies its first input by 2.0 and writes to output) to prove the FFM pipeline end-to-end. NativeKernelProvider that's `isAvailable() = false` until the real kernel lands. Sets up CI artifact path on host arch.
153+
2. **First real native kernel: Q4_K matmul (Apple Silicon NEON)** — hand-tuned kernel, parity tests vs PanamaVectorQ4KMatmulKernel, JMH bench variant added to `QuantizedMatmulBench`.
154+
3. **`Q4KMemSegMatmulKernel` SPI sibling + native variant** — closes the M4↔M5 zero-copy story for mmap'd weights.
155+
4. **linuxX64 AVX2 variant + cross-arch CI build** — the cross-compilation matrix story.
156+
5. **Optional: native FP32 matmul, native Q6_K, native Q8_0** — same shape as PRs 2–3, one per format.
157+
158+
The first PR (1) is the largest in *scaffolding* terms (~500–800 LoC of build glue + 1 trivial kernel), but every subsequent PR is small and template-able.
159+
160+
## Success metrics
161+
162+
- **PR 2 sign-off**: native Q4_K matmul on Apple Silicon clears **≥2.5×** over the scalar Q4_K dequant-then-matmul baseline at 4096² (the M5 milestone target). For reference: Panama Q4_K SIMD already exceeds this metric (see #562 PR body, ~73 GFLOPS), so the bar is "beats Panama by a meaningful margin", probably ≥1.5× over Panama.
163+
- **PR 3 sign-off**: Q4_K MemSeg native path is faster than the Panama Q4_K MemSeg path from #563, with no heap copy in the timed region.
164+
- **No regression on JVM-only environments** — when the native lib fails to load (sandbox, missing arch, etc.), `KernelRegistry.bestAvailable()` cleanly falls through to Panama, and existing tests / benches show the same numbers as today.
165+
166+
## Risks & open questions
167+
168+
1. **JDK 21 preview vs JDK 22 stable.** FFM left preview in Java 22. The repo currently builds on JDK 21 with `--enable-preview --add-modules jdk.incubator.vector`. We need to decide: stay on JDK 21 preview FFM (smaller blast radius, matches Vector API status), or bump to JDK 22+ for stable FFM. **Recommendation**: stay on 21 preview; flip to 22 in a separate toolchain-bump PR.
169+
2. **`MethodHandle` invocation overhead.** Even with FFM, each native call has a small fixed cost (microseconds-ish). For the smallest matmul shapes (e.g. 256² FP32) this could swamp the FLOPs win. Mitigation: route small inputs to Panama and large inputs to native at the registry/provider level, OR accept that the win is sized for production-relevant shapes (4096²+).
170+
3. **Native code quality and maintenance.** Hand-tuned NEON / AVX2 in C is harder to audit than Kotlin Vector API code. Mitigation: keep kernels small (<300 LoC each), parity-test exhaustively, prefer porting from ggml's reference (which is BSD-licensed and well-vetted) over writing from scratch.
171+
4. **Distribution.** Native artifacts complicate Maven Central publication (need `<classifier>` per OS/arch). For the first internal-use PR this isn't a blocker, but a separate "publish native classifier JARs" PRD will be needed before community use.
172+
5. **Cross-arch CI cost.** Building NEON natively on Apple Silicon CI plus AVX2 on linuxX64 plus Android NDK doubles or triples build time. The xnnpack backend's existing CI matrix is a precedent — reuse the same approach.
173+
6. **Native `MemorySegment` lifetime.** The Kotlin caller owns the `Arena` for arrays it copies in. The native kernel must NOT retain pointers past the FFM call return. Document this contract in `NativeQ4KMatmulKernel.matmul` kdoc.
174+
175+
## When to start
176+
177+
Trigger conditions (any one):
178+
- Real workload demands the native ≥2.5× target (Panama Q4_K stops being fast enough on a customer machine).
179+
- A community contributor offers a hand-tuned NEON / AVX2 Q4_K kernel that's measurably faster than Panama.
180+
- A second M5 metric (e.g. SDPA throughput, training-loop throughput) needs hand-tuned native code.
181+
182+
Until then: **pause**. The Panama provider is doing the milestone-equivalent work in absolute terms, and adding a native build system is a meaningful complexity tax to take on speculatively.

README.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ Add the core dependencies (Gradle Kotlin DSL):
1919

2020
```kotlin
2121
dependencies {
22-
implementation("sk.ainet.core:SKaiNET-lang-core:0.20.0")
23-
implementation("sk.ainet.core:SKaiNET-backend-cpu:0.20.0")
22+
implementation("sk.ainet.core:SKaiNET-lang-core:0.21.0")
23+
implementation("sk.ainet.core:SKaiNET-backend-cpu:0.21.0")
2424
}
2525
```
2626

@@ -137,13 +137,11 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,
137137

138138
---
139139

140-
## What's New in 0.20.0
140+
## What's New in 0.21.0
141141

142-
- **Q6_K Native Matmul** — New `Q6_KTensorData` stores 210-byte ggml blocks verbatim and a Vector-API SIMD kernel (`matmulQ6_KVec`) dispatches from `DefaultCpuOpsJvm.chooseQuantizedMatmul`. Together with the existing Q4_K infra, this unblocks running Gemma 4 E2B Q4_K_M (and any mostly-Q4_K + Q6_K checkpoint) through the DSL path without a ~12 GB FP32 dequant blow-up at load.
143-
- **Q4_K / Q6_K Lazy Shape-Swap Transpose**`ops.transpose` on `Q4_KTensorData` / `Q6_KTensorData` now returns a new tensor wrapping the *same* packed byte array with swapped shape, matching the existing Q4/Q8 MemorySegment path. `linearProject(x, W)` can run `matmul(x, transpose(W))` on Q4_K/Q6_K weights without round-tripping through FP32 (Δ logits = 4.29e-6 vs FP32 baseline on Gemma).
144-
- **SDPA → StableHLO / IREE**`scaledDotProductAttention` is now recorded by `RecordingExecution` and lowered to StableHLO as `dot_general(Q, K.T)` → scale → optional mask → softmax → `dot_general(weights, V)`, so attention blocks compile end-to-end through the SKaiNET → StableHLO → IREE path. (#543)
145-
- **SDPA Q/K/V Shape Validation** — Mismatched `head_dim` between Q/K or Q/V (seen in real Gemma 4 E2B with mixed-head-dim layers sharing a KV cache) used to surface as an `ArrayIndexOutOfBoundsException` deep in the dot-product loop; `scaledDotProductAttention` now fails fast with `require()` messages naming the offending dimensions.
146-
- **Toolchain bumps** — Kotlin 2.3.21, AGP 9.2.0, Ktor client 3.4.3.
142+
- **JVM CPU performance — Vector API SIMD across the board.** Pluggable `KernelProvider` SPI with priority-ordered lookup; FP32 matmul tile-blocked at **8.6×–10.8× over scalar**, Q4_K matmul fully SIMD-fused with inline dequant at **~30–73 GFLOPS** on Apple Silicon. Every quantized format we support (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0) is now SIMD'd to some degree.
143+
- **`ScratchPool` SPI and `TensorOps.permute(axes)`** — runtime workspace allocator for transient tensors and arbitrary-axis permutation.
144+
- **Native (FFM) kernel provider** captured as PRD in [`NATIVE_FFM_KERNEL_PROVIDER.md`](NATIVE_FFM_KERNEL_PROVIDER.md), deferred.
147145

148146
See [CHANGELOG.md](CHANGELOG.md) for the full release history.
149147

gradle.properties

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
GROUP=sk.ainet.core
2-
VERSION_NAME=0.21.0-SNAPSHOT
2+
VERSION_NAME=0.21.0
33
POM_DESCRIPTION=SKaiNET
44

55
POM_URL=https://github.com/SKaiNET-developers/skainet/

0 commit comments

Comments
 (0)