Skip to content

Commit 5839da9

Browse files
Merge pull request #566 from SKaiNET-developers/release/0.21.0
chore(release): prepare 0.21.0
2 parents d48f172 + 0289bb6 commit 5839da9

3 files changed

Lines changed: 31 additions & 9 deletions

File tree

CHANGELOG.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,31 @@
22

33
## [Unreleased]
44

5+
## [0.21.0] - 2026-04-28
6+
7+
### Added
8+
9+
#### CPU kernel SPI (M5 — JVM Vector half complete)
10+
11+
This release lands the JVM Vector half of milestone M5 from the JVM inference performance roadmap — a pluggable kernel SPI parallel to `BackendProvider`, plus a Panama Vector provider that matches or beats the prior production path on every shape we measure. A native (FFM) priority-100 provider closing the milestone metric is deferred.
12+
13+
- **`KernelProvider` SPI**`skainet-backend-api` now exposes a `KernelProvider` interface with `name`, `priority`, `isAvailable()`, and per-kernel accessors (`matmulFp32()`, `matmulQ4K()`). `KernelRegistry` does priority-ordered `bestAvailable()` lookup; a JVM-only `KernelServiceLoader.installAll()` auto-discovers providers via `META-INF/services/sk.ainet.backend.api.kernel.KernelProvider`. Manual `register(...)` still works for tests and non-JVM platforms. (PRs #554, #559)
14+
- **`Fp32MatmulKernel` + `PanamaVectorMatmulKernel`** — JDK Vector API implementation using `FloatVector.SPECIES_PREFERRED` + `fma` + `reduceLanes`, cache-blocked with 8×8×128 tiles. `KernelMatmulBench` measures **8.61× / 8.62× / 10.83×** speedup over scalar at 256/512/1024 (JDK 21.0.10, M-series macOS). Within JMH noise of — and often slightly faster than — the prior `JvmVectorKernels.matmulFloatBlocked` production path, so routing introduced no regression. (PRs #557, #558, #560)
15+
- **Production matmul routes through `KernelRegistry`**`DefaultCpuOpsJvm.matmul` now resolves the FP32 kernel via `KernelRegistry.bestAvailable()` instead of calling `JvmVectorKernels.matmulFloat*` directly. Production `MatmulBench` numbers post-routing match pre-routing within JMH noise. (PR #561)
16+
- **`Q4KMatmulKernel` SPI + SIMD-fused Panama implementation** — Sibling kernel interface in `skainet-backend-api/commonMain`, `KernelProvider.matmulQ4K()` accessor (default-`null` for backwards compat). `PanamaVectorQ4KMatmulKernel` fuses Q4_K dequant inline with the FMA accumulator: a single `ByteVector` load feeds both lo and hi sub-block accumulators per qs slab via AND/LSHR nibble extract → `castShape(B2F)` → FMA, with the lazy-`dmin` correction (`acc += scale·codeSum − offset·inputSum` once per sub-block). `QuantizedMatmulBench` measures 0.07/0.15/0.46 ms at 1024×1024 / 4096×1024 / 4096×4096 (≈30/55/73 GFLOPS — same throughput regime as the FP32 SIMD kernel, meaning fused dequant adds essentially zero cost on top of the FMA). `DefaultCpuOpsJvm.chooseQuantizedMatmul`'s `Q4_KTensorData` branch routes through the SPI with a fall-through to the legacy kernel when no provider resolves. (PR #562)
17+
- **Q4_K MemSeg SIMD** — Same fused-pipeline algorithm applied inline to `JvmQuantizedVectorKernels.matmulF32Q4_KMemSeg` (the path mmap'd weights take). `ByteVector.fromMemorySegment` instead of `ByteVector.fromArray` — no heap copy. (PR #563)
18+
- **Q6_K SIMD dequant**`dequantQ6_KBlock` replaces its scalar 32-iteration loop with a `ByteVector`-based ql + qh extraction pipeline: per `floatStep`-wide chunk of `l`, loads ql + qh slices, assembles `q1..q4 = (ql nibble) | ((qh slice) << 4) − 32` per lane, multiplies by per-sub-block `d·scale`, stores to four 32-element regions of the scratch FloatArray. (PR #564)
19+
- **Q4_0 partial SIMD**`dotQ4_0BlockMemSeg` two-stage pattern: scalar byte-pair unpack into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic) followed by a `FloatVector` FMA reduction. Closes the last fully-scalar quantized kernel; every quantized format in `JvmQuantizedVectorKernels` (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0) is now SIMD'd to some degree. (PR #565)
20+
21+
#### Other
22+
23+
- **`ScratchPool` SPI** — Runtime workspace allocation for transient tensor scratch buffers. Per-runtime size-classed slabs, scoped acquire/release. Closes the framework-side primitive for milestone M1 of the JVM perf roadmap. (PR #550)
24+
- **`TensorOps.permute(axes)`** — Arbitrary-axis permutation (generalizes the existing `transpose` to N-D). (PR #552)
25+
26+
### Fixed
27+
28+
- **Q4_K / Q5_K canonical ggml layout + FP32 MemSeg arena leak**`Q4_KTensorData` and Q5_K dequant now apply the canonical ggml layout (super-block scale + per-sub-block scaleIdx/minIdx via `get_scale_min_k4` mixing, strided 4-bit codes layout). `MemorySegmentTensorDataFactory` uses `Arena.ofAuto()` for per-op outputs so the matmul / transpose output segments are GC-reclaimable; the prior `ofConfined()` builds leaked tens of MB per matmul, which over a 35-layer Gemma 4 forward pass exhausted the JVM direct-memory cap. Liveness-based freeing of intermediate tensors in `ComputeGraphExecutor`. (PR #556)
29+
530
## [0.20.0] - 2026-04-24
631

732
### Added

README.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ Add the core dependencies (Gradle Kotlin DSL):
1919

2020
```kotlin
2121
dependencies {
22-
implementation("sk.ainet.core:SKaiNET-lang-core:0.20.0")
23-
implementation("sk.ainet.core:SKaiNET-backend-cpu:0.20.0")
22+
implementation("sk.ainet.core:SKaiNET-lang-core:0.21.0")
23+
implementation("sk.ainet.core:SKaiNET-backend-cpu:0.21.0")
2424
}
2525
```
2626

@@ -137,13 +137,10 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,
137137

138138
---
139139

140-
## What's New in 0.20.0
140+
## What's New in 0.21.0
141141

142-
- **Q6_K Native Matmul** — New `Q6_KTensorData` stores 210-byte ggml blocks verbatim and a Vector-API SIMD kernel (`matmulQ6_KVec`) dispatches from `DefaultCpuOpsJvm.chooseQuantizedMatmul`. Together with the existing Q4_K infra, this unblocks running Gemma 4 E2B Q4_K_M (and any mostly-Q4_K + Q6_K checkpoint) through the DSL path without a ~12 GB FP32 dequant blow-up at load.
143-
- **Q4_K / Q6_K Lazy Shape-Swap Transpose**`ops.transpose` on `Q4_KTensorData` / `Q6_KTensorData` now returns a new tensor wrapping the *same* packed byte array with swapped shape, matching the existing Q4/Q8 MemorySegment path. `linearProject(x, W)` can run `matmul(x, transpose(W))` on Q4_K/Q6_K weights without round-tripping through FP32 (Δ logits = 4.29e-6 vs FP32 baseline on Gemma).
144-
- **SDPA → StableHLO / IREE**`scaledDotProductAttention` is now recorded by `RecordingExecution` and lowered to StableHLO as `dot_general(Q, K.T)` → scale → optional mask → softmax → `dot_general(weights, V)`, so attention blocks compile end-to-end through the SKaiNET → StableHLO → IREE path. (#543)
145-
- **SDPA Q/K/V Shape Validation** — Mismatched `head_dim` between Q/K or Q/V (seen in real Gemma 4 E2B with mixed-head-dim layers sharing a KV cache) used to surface as an `ArrayIndexOutOfBoundsException` deep in the dot-product loop; `scaledDotProductAttention` now fails fast with `require()` messages naming the offending dimensions.
146-
- **Toolchain bumps** — Kotlin 2.3.21, AGP 9.2.0, Ktor client 3.4.3.
142+
- **JVM CPU performance — Vector API SIMD across the board.** Pluggable `KernelProvider` SPI with priority-ordered lookup; FP32 matmul tile-blocked at **8.6×–10.8× over scalar**, Q4_K matmul fully SIMD-fused with inline dequant at **~30–73 GFLOPS** on Apple Silicon. Every quantized format we support (Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0) is now SIMD'd to some degree.
143+
- **`ScratchPool` SPI and `TensorOps.permute(axes)`** — runtime workspace allocator for transient tensors and arbitrary-axis permutation.
147144

148145
See [CHANGELOG.md](CHANGELOG.md) for the full release history.
149146

gradle.properties

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
GROUP=sk.ainet.core
2-
VERSION_NAME=0.21.0-SNAPSHOT
2+
VERSION_NAME=0.21.0
33
POM_DESCRIPTION=SKaiNET
44

55
POM_URL=https://github.com/SKaiNET-developers/skainet/

0 commit comments

Comments
 (0)