chore(release): prepare 0.21.0

michalharakal · claude · michalharakal · commit 61962defed86 · 2026-04-28T23:12:58.000+02:00
- gradle.properties: drop -SNAPSHOT; RELEASE_SIGNING_ENABLED stays true.
- CHANGELOG: add 0.21.0 section covering the JVM Vector half of M5
  (kernel SPI + Panama FP32 + tile-blocking + production routing +
  ServiceLoader auto-discovery + Q4_K SIMD + sibling SPI + Q4_K MemSeg
  + Q6_K SIMD + Q4_0 partial SIMD), plus ScratchPool SPI,
  TensorOps.permute, and Q4_K/Q5_K canonical layout fix.
- README: bump Quickstart coordinates to 0.21.0; compact "What's New"
  section.
- NATIVE_FFM_KERNEL_PROVIDER.md: PRD for the deferred priority-100
  native FFM kernel provider — module layout, FFM binding pattern,
  staged delivery plan, success metrics, risks.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,31 @@
 
 ## [Unreleased]
 
+## [0.21.0] - 2026-04-28
+
+### Added
+
+#### CPU kernel SPI (M5 — JVM Vector half complete)
+
+This release lands the JVM Vector half of milestone M5 from the JVM inference performance roadmap — a pluggable kernel SPI parallel to `BackendProvider`, plus a Panama Vector provider that matches or beats the prior production path on every shape we measure. The native (FFM) priority-100 provider that closes the milestone metric is captured as a PRD ([NATIVE_FFM_KERNEL_PROVIDER.md](NATIVE_FFM_KERNEL_PROVIDER.md)) and deferred.
+
+- **`KernelProvider` SPI** — `skainet-backend-api` now exposes a `KernelProvider` interface with `name`, `priority`, `isAvailable()`, and per-kernel accessors (`matmulFp32()`, `matmulQ4K()`). `KernelRegistry` does priority-ordered `bestAvailable()` lookup; a JVM-only `KernelServiceLoader.installAll()` auto-discovers providers via `META-INF/services/sk.ainet.backend.api.kernel.KernelProvider`. Manual `register(...)` still works for tests and non-JVM platforms. (PRs #554, #559)
+- **`Fp32MatmulKernel` + `PanamaVectorMatmulKernel`** — JDK Vector API implementation using `FloatVector.SPECIES_PREFERRED` + `fma` + `reduceLanes`, cache-blocked with 8×8×128 tiles. `KernelMatmulBench` measures **8.61× / 8.62× / 10.83×** speedup over scalar at 256/512/1024 (JDK 21.0.10, M-series macOS). Within JMH noise of — and often slightly faster than — the prior `JvmVectorKernels.matmulFloatBlocked` production path, so routing introduced no regression. (PRs #557, #558, #560)
+- **Production matmul routes through `KernelRegistry`** — `DefaultCpuOpsJvm.matmul` now resolves the FP32 kernel via `KernelRegistry.bestAvailable()` instead of calling `JvmVectorKernels.matmulFloat*` directly. Production `MatmulBench` numbers post-routing match pre-routing within JMH noise. (PR #561)
+- **`Q4KMatmulKernel` SPI + SIMD-fused Panama implementation** — Sibling kernel interface in `skainet-backend-api/commonMain`, `KernelProvider.matmulQ4K()` accessor (default-`null` for backwards compat). `PanamaVectorQ4KMatmulKernel` fuses Q4_K dequant inline with the FMA accumulator: a single `ByteVector` load feeds both lo and hi sub-block accumulators per qs slab via AND/LSHR nibble extract → `castShape(B2F)` → FMA, with the lazy-`dmin` correction (`acc += scale·codeSum − offset·inputSum` once per sub-block). `QuantizedMatmulBench` measures 0.07/0.15/0.46 ms at 1024×1024 / 4096×1024 / 4096×4096 (≈30/55/73 GFLOPS — same throughput regime as the FP32 SIMD kernel, meaning fused dequant adds essentially zero cost on top of the FMA). `DefaultCpuOpsJvm.chooseQuantizedMatmul`'s `Q4_KTensorData` branch routes through the SPI with a fall-through to the legacy kernel when no provider resolves. (PR #562)
+- **Q4_K MemSeg SIMD** — Same fused-pipeline algorithm applied inline to `JvmQuantizedVectorKernels.matmulF32Q4_KMemSeg` (the path mmap'd weights take). `ByteVector.fromMemorySegment` instead of `ByteVector.fromArray` — no heap copy. (PR #563)
+- **Q6_K SIMD dequant** — `dequantQ6_KBlock` replaces its scalar 32-iteration loop with a `ByteVector`-based ql + qh extraction pipeline: per `floatStep`-wide chunk of `l`, loads ql + qh slices, assembles `q1..q4 = (ql nibble) | ((qh slice) << 4) − 32` per lane, multiplies by per-sub-block `d·scale`, stores to four 32-element regions of the scratch FloatArray. (PR #564)
+- **Q4_0 partial SIMD** — `dotQ4_0BlockMemSeg` two-stage pattern: scalar byte-pair unpack into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic) followed by a `FloatVector` FMA reduction. Closes the last fully-scalar quantized kernel; every quantized format in `JvmQuantizedVectorKernels` (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0) is now SIMD'd to some degree. (PR #565)
+
+#### Other
+
+- **`ScratchPool` SPI** — Runtime workspace allocation for transient tensor scratch buffers. Per-runtime size-classed slabs, scoped acquire/release. Closes the framework-side primitive for milestone M1 of the JVM perf roadmap. (PR #550)
+- **`TensorOps.permute(axes)`** — Arbitrary-axis permutation (generalizes the existing `transpose` to N-D). (PR #552)
+
+### Fixed
+
+- **Q4_K / Q5_K canonical ggml layout + FP32 MemSeg arena leak** — `Q4_KTensorData` and Q5_K dequant now apply the canonical ggml layout (super-block scale + per-sub-block scaleIdx/minIdx via `get_scale_min_k4` mixing, strided 4-bit codes layout). `MemorySegmentTensorDataFactory` uses `Arena.ofAuto()` for per-op outputs so the matmul / transpose output segments are GC-reclaimable; the prior `ofConfined()` builds leaked tens of MB per matmul, which over a 35-layer Gemma 4 forward pass exhausted the JVM direct-memory cap. Liveness-based freeing of intermediate tensors in `ComputeGraphExecutor`. (PR #556)
+
 ## [0.20.0] - 2026-04-24
 
 ### Added
diff --git a/NATIVE_FFM_KERNEL_PROVIDER.md b/NATIVE_FFM_KERNEL_PROVIDER.md
@@ -0,0 +1,182 @@
+# PRD — Native (FFM) Kernel Provider
+
+**Status:** Deferred (post-0.21.0). Captured here so the design doesn't drift.
+**Owner:** unassigned.
+**Milestone:** M5 (CPU backend dispatch) — final piece. The roadmap's M5 success metric `native ≥2.5× for Q4_K` requires this provider; the JVM Vector half (PRs #554, #557, #560, #562, #563, #564) closed the Panama story but not the native one.
+
+## Context
+
+The kernel SPI shipped in PR #554 (`KernelProvider`, `Fp32MatmulKernel`, `KernelRegistry`) was designed to host **three** providers, ordered by priority:
+
+| priority | provider | status |
+|---------:|---|---|
+|   0 | `ScalarKernelProvider` | shipped (#554) |
+|  50 | `PanamaVectorKernelProvider` | shipped (#557, plus tile-blocking #560, ServiceLoader #559) |
+| 100 | `NativeKernelProvider` (FFM) | **this PRD** |
+
+The Panama provider runs the FP32 matmul at ~73 GFLOPS for square 4096² shapes on Apple Silicon (per #558's JMH bench), the Q4_K SIMD kernel runs in the same throughput regime as Panama FP32 (#562 numbers). That's already in the ggml NEON ballpark — but ggml's hand-tuned NEON / AVX2 still outruns Panama on dense per-cycle FLOPs and on Q4_K specifically, where 4-bit nibble unpacking maps cleanly to dedicated SIMD shuffles that the Vector API can't always emit.
+
+A native provider closes that gap and unlocks two follow-ons:
+- **M4 ↔ M5 synergy.** Mmap'd Q4_K weights stay as `MemorySegment` views; a native kernel reads the same pages with zero copy via FFI. No staging buffer, no `ByteArray` round-trip.
+- **Hardware-specific lanes.** AVX-512 VNNI fused INT8 dot products, NEON `bf16`/`fp16` SDOT instructions, future SVE — the Vector API exposes none of these portably today.
+
+## Goals
+
+1. **A `NativeKernelProvider` registered at priority 100** that on JDK 21+ wins `KernelRegistry.bestAvailable()` over Panama whenever the native lib is loaded successfully.
+2. **A first concrete kernel: native Q4_K matmul.** It must:
+   - take a `MemorySegment` for both input (FP32) and packed Q4_K weights (canonical ggml layout — same as `Q4_KBlockTensorData` and the existing `matmulF32Q4_KMemSeg`);
+   - produce numerically equivalent output to `PanamaVectorQ4KMatmulKernel` within `1e-4` relative tolerance (same parity bar `PanamaVectorQ4KMatmulKernelTest` uses);
+   - clear **≥2.5× over the prior Q4_K scalar dequant baseline** on the bench shapes from `QuantizedMatmulBench` (1024², 4096×1024, 4096²).
+3. **Optional follow-on kernels** — Q6_K, Q8_0, FP32 matmul — share the build system but each ship as a separate small PR.
+4. **One supported architecture for the first PR** (likely Apple Silicon NEON, since that's the development hardware in use), with a clear extension path for `linuxX64` AVX2 / `linuxArm64` NEON.
+
+## Non-goals
+
+- **JNI.** The roadmap explicitly says "FFM not JNI". JNI's per-call overhead and the global JNI lock are wrong for hot per-token kernels; FFM (Java 22 stable, Java 21 preview) gives near-zero-overhead native calls and direct `MemorySegment` ABI.
+- **Cross-compilation matrix on day one.** The first PR can ship just one (host-arch) variant; CI cross-arch builds come later.
+- **Replacing Panama.** Panama remains the priority-50 fallback for environments that can't load native libs (sandboxes, Wasm, Native targets, JDK without `jdk.incubator.vector`).
+- **Distribution via Maven Central pre-built native artifacts.** Out of scope for the first PR — local build only. A separate "publish native classifier JARs" PRD comes later.
+
+## Architecture
+
+### Module layout
+
+```
+skainet-backends/
+  skainet-backend-native-cpu/                  # NEW
+    src/
+      jvmMain/kotlin/sk/ainet/exec/kernel/     # Kotlin side
+        NativeKernelProvider.kt                # priority=100, isAvailable() = libLoaded
+        NativeQ4KMatmulKernel.kt               # implements Q4KMatmulKernel, calls FFM
+        NativeLibraryLoader.kt                 # loadLibrary, locate, check API version
+      jvmMain/resources/META-INF/services/
+        sk.ainet.backend.api.kernel.KernelProvider  # appends NativeKernelProviderFactory
+      jvmTest/kotlin/sk/ainet/exec/kernel/
+        NativeQ4KMatmulKernelTest.kt           # parity vs PanamaVectorQ4KMatmulKernel
+      native/                                  # native source tree
+        c/
+          q4k_matmul.c                         # ggml-style hand-tuned kernel
+          q4k_matmul.h
+        CMakeLists.txt                         # or Bazel BUILD
+        build.gradle.kts                       # Gradle wrapper that invokes CMake
+```
+
+The native library compiles to a shared object (`libskainet_kernels.dylib` on macOS, `.so` on Linux, `.dll` on Windows) and is packaged into the module's resources for `System.loadLibrary` discovery.
+
+### FFM binding pattern
+
+Single C entry point per kernel:
+
+```c
+// q4k_matmul.h
+void skainet_q4k_matmul(
+    const float* input,        // FP32 input vector, length input_dim
+    const uint8_t* weight,     // packed Q4_K bytes (canonical ggml layout)
+    int32_t weight_byte_offset,
+    int32_t input_dim,
+    int32_t output_dim,
+    float* output,             // FP32 output, length output_dim
+    int32_t output_offset
+);
+```
+
+Kotlin side:
+
+```kotlin
+internal object NativeQ4KMatmulKernel : Q4KMatmulKernel {
+    private val handle: MethodHandle = run {
+        val arena = Arena.ofAuto()
+        val symbol = NativeLibraryLoader.lib.find("skainet_q4k_matmul").orElseThrow()
+        Linker.nativeLinker().downcallHandle(
+            symbol,
+            FunctionDescriptor.ofVoid(
+                ValueLayout.ADDRESS, ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
+                ValueLayout.JAVA_INT, ValueLayout.JAVA_INT,
+                ValueLayout.ADDRESS, ValueLayout.JAVA_INT,
+            ),
+        )
+    }
+
+    override fun matmul(
+        input: FloatArray, inputOffset: Int,
+        weight: ByteArray, weightByteOffset: Int,
+        inputDim: Int, outputDim: Int,
+        output: FloatArray, outputOffset: Int,
+    ) {
+        // Heap arrays: pass via temporary off-heap MemorySegment + bulk copy,
+        // OR (preferred) overload with a MemorySegment-input variant for
+        // mmap'd weights to avoid the copy.
+        ...
+    }
+}
+```
+
+The cleaner path is to introduce a sibling **`Q4KMemSegMatmulKernel`** SPI (mentioned as out-of-scope in #563) that takes `MemorySegment` directly, and have the native provider implement *that* — no heap copy. The `Q4KMatmulKernel` (ByteArray) variant can wrap the MemSeg one with a temporary `Arena.ofConfined()` copy if needed for legacy callers.
+
+### Build system
+
+**Gradle + CMake** is the path of least resistance:
+- A new Gradle plugin (or hand-rolled `Exec` tasks) invokes CMake for the native module's `build` task.
+- Native artifacts land in `build/native/<arch>/` and are copied into `src/jvmMain/resources/native/<os>-<arch>/` so `System.loadLibrary` finds them.
+- The Kotlin compile depends on the native artifact being built first.
+
+The `xnnpack` backend already in the repo (`skainet-backends/skainet-backend-xnnpack/`) demonstrates a similar pattern — Gradle invokes CMake to build a native lib via cinterop. **Reuse that template** rather than reinventing.
+
+**Architecture detection**: at native module build time, query host arch and only build for it (first PR scope). CI cross-arch matrix follows.
+
+### Provider class
+
+```kotlin
+public object NativeKernelProvider : KernelProvider {
+    override val name: String = "native-ffm"
+    override val priority: Int = 100
+
+    private val available: Boolean by lazy {
+        runCatching { NativeLibraryLoader.load() }.isSuccess
+    }
+
+    override fun isAvailable(): Boolean = available
+
+    override fun matmulFp32(): Fp32MatmulKernel? = null  // future PR
+    override fun matmulQ4K(): Q4KMatmulKernel? =
+        if (isAvailable()) NativeQ4KMatmulKernel else null
+}
+```
+
+Registered via the existing ServiceLoader mechanism (`META-INF/services/sk.ainet.backend.api.kernel.KernelProvider`, factory wrapper because `KernelProvider by NativeKernelProvider`). When unavailable, the cascade falls through to Panama (priority 50), preserving the M5 metric on environments without native code.
+
+## Staged delivery
+
+PRs in order, each independently mergeable:
+
+1. **`skainet-backend-native-cpu` module scaffolding** — Gradle module, build.gradle.kts wired to invoke CMake, a *trivial* C kernel (e.g. just multiplies its first input by 2.0 and writes to output) to prove the FFM pipeline end-to-end. NativeKernelProvider that's `isAvailable() = false` until the real kernel lands. Sets up CI artifact path on host arch.
+2. **First real native kernel: Q4_K matmul (Apple Silicon NEON)** — hand-tuned kernel, parity tests vs PanamaVectorQ4KMatmulKernel, JMH bench variant added to `QuantizedMatmulBench`.
+3. **`Q4KMemSegMatmulKernel` SPI sibling + native variant** — closes the M4↔M5 zero-copy story for mmap'd weights.
+4. **linuxX64 AVX2 variant + cross-arch CI build** — the cross-compilation matrix story.
+5. **Optional: native FP32 matmul, native Q6_K, native Q8_0** — same shape as PRs 2–3, one per format.
+
+The first PR (1) is the largest in *scaffolding* terms (~500–800 LoC of build glue + 1 trivial kernel), but every subsequent PR is small and template-able.
+
+## Success metrics
+
+- **PR 2 sign-off**: native Q4_K matmul on Apple Silicon clears **≥2.5×** over the scalar Q4_K dequant-then-matmul baseline at 4096² (the M5 milestone target). For reference: Panama Q4_K SIMD already exceeds this metric (see #562 PR body, ~73 GFLOPS), so the bar is "beats Panama by a meaningful margin", probably ≥1.5× over Panama.
+- **PR 3 sign-off**: Q4_K MemSeg native path is faster than the Panama Q4_K MemSeg path from #563, with no heap copy in the timed region.
+- **No regression on JVM-only environments** — when the native lib fails to load (sandbox, missing arch, etc.), `KernelRegistry.bestAvailable()` cleanly falls through to Panama, and existing tests / benches show the same numbers as today.
+
+## Risks & open questions
+
+1. **JDK 21 preview vs JDK 22 stable.** FFM left preview in Java 22. The repo currently builds on JDK 21 with `--enable-preview --add-modules jdk.incubator.vector`. We need to decide: stay on JDK 21 preview FFM (smaller blast radius, matches Vector API status), or bump to JDK 22+ for stable FFM. **Recommendation**: stay on 21 preview; flip to 22 in a separate toolchain-bump PR.
+2. **`MethodHandle` invocation overhead.** Even with FFM, each native call has a small fixed cost (microseconds-ish). For the smallest matmul shapes (e.g. 256² FP32) this could swamp the FLOPs win. Mitigation: route small inputs to Panama and large inputs to native at the registry/provider level, OR accept that the win is sized for production-relevant shapes (4096²+).
+3. **Native code quality and maintenance.** Hand-tuned NEON / AVX2 in C is harder to audit than Kotlin Vector API code. Mitigation: keep kernels small (<300 LoC each), parity-test exhaustively, prefer porting from ggml's reference (which is BSD-licensed and well-vetted) over writing from scratch.
+4. **Distribution.** Native artifacts complicate Maven Central publication (need `<classifier>` per OS/arch). For the first internal-use PR this isn't a blocker, but a separate "publish native classifier JARs" PRD will be needed before community use.
+5. **Cross-arch CI cost.** Building NEON natively on Apple Silicon CI plus AVX2 on linuxX64 plus Android NDK doubles or triples build time. The xnnpack backend's existing CI matrix is a precedent — reuse the same approach.
+6. **Native `MemorySegment` lifetime.** The Kotlin caller owns the `Arena` for arrays it copies in. The native kernel must NOT retain pointers past the FFM call return. Document this contract in `NativeQ4KMatmulKernel.matmul` kdoc.
+
+## When to start
+
+Trigger conditions (any one):
+- Real workload demands the native ≥2.5× target (Panama Q4_K stops being fast enough on a customer machine).
+- A community contributor offers a hand-tuned NEON / AVX2 Q4_K kernel that's measurably faster than Panama.
+- A second M5 metric (e.g. SDPA throughput, training-loop throughput) needs hand-tuned native code.
+
+Until then: **pause**. The Panama provider is doing the milestone-equivalent work in absolute terms, and adding a native build system is a meaningful complexity tax to take on speculatively.
diff --git a/README.md b/README.md
@@ -19,8 +19,8 @@ Add the core dependencies (Gradle Kotlin DSL):
 
 ```kotlin
 dependencies {
-    implementation("sk.ainet.core:SKaiNET-lang-core:0.20.0")
-    implementation("sk.ainet.core:SKaiNET-backend-cpu:0.20.0")
+    implementation("sk.ainet.core:SKaiNET-lang-core:0.21.0")
+    implementation("sk.ainet.core:SKaiNET-backend-cpu:0.21.0")
 }
 ```
 
@@ -137,13 +137,11 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,
 
 ---
 
-## What's New in 0.20.0
+## What's New in 0.21.0
 
-- **Q6_K Native Matmul** — New `Q6_KTensorData` stores 210-byte ggml blocks verbatim and a Vector-API SIMD kernel (`matmulQ6_KVec`) dispatches from `DefaultCpuOpsJvm.chooseQuantizedMatmul`. Together with the existing Q4_K infra, this unblocks running Gemma 4 E2B Q4_K_M (and any mostly-Q4_K + Q6_K checkpoint) through the DSL path without a ~12 GB FP32 dequant blow-up at load.
-- **Q4_K / Q6_K Lazy Shape-Swap Transpose** — `ops.transpose` on `Q4_KTensorData` / `Q6_KTensorData` now returns a new tensor wrapping the *same* packed byte array with swapped shape, matching the existing Q4/Q8 MemorySegment path. `linearProject(x, W)` can run `matmul(x, transpose(W))` on Q4_K/Q6_K weights without round-tripping through FP32 (Δ logits = 4.29e-6 vs FP32 baseline on Gemma).
-- **SDPA → StableHLO / IREE** — `scaledDotProductAttention` is now recorded by `RecordingExecution` and lowered to StableHLO as `dot_general(Q, K.T)` → scale → optional mask → softmax → `dot_general(weights, V)`, so attention blocks compile end-to-end through the SKaiNET → StableHLO → IREE path. (#543)
-- **SDPA Q/K/V Shape Validation** — Mismatched `head_dim` between Q/K or Q/V (seen in real Gemma 4 E2B with mixed-head-dim layers sharing a KV cache) used to surface as an `ArrayIndexOutOfBoundsException` deep in the dot-product loop; `scaledDotProductAttention` now fails fast with `require()` messages naming the offending dimensions.
-- **Toolchain bumps** — Kotlin 2.3.21, AGP 9.2.0, Ktor client 3.4.3.
+- **JVM CPU performance — Vector API SIMD across the board.** Pluggable `KernelProvider` SPI with priority-ordered lookup; FP32 matmul tile-blocked at **8.6×–10.8× over scalar**, Q4_K matmul fully SIMD-fused with inline dequant at **~30–73 GFLOPS** on Apple Silicon. Every quantized format we support (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0) is now SIMD'd to some degree.
+- **`ScratchPool` SPI and `TensorOps.permute(axes)`** — runtime workspace allocator for transient tensors and arbitrary-axis permutation.
+- **Native (FFM) kernel provider** captured as PRD in [`NATIVE_FFM_KERNEL_PROVIDER.md`](NATIVE_FFM_KERNEL_PROVIDER.md), deferred.
 
 See [CHANGELOG.md](CHANGELOG.md) for the full release history.
 
diff --git a/gradle.properties b/gradle.properties
@@ -1,5 +1,5 @@
 GROUP=sk.ainet.core
-VERSION_NAME=0.21.0-SNAPSHOT
+VERSION_NAME=0.21.0
 POM_DESCRIPTION=SKaiNET
 
 POM_URL=https://github.com/SKaiNET-developers/skainet/