You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,16 @@
2
2
3
3
## [Unreleased]
4
4
5
+
## [0.31.0] - 2026-06-15
6
+
7
+
### Fixed
8
+
9
+
-**`ops.transpose` now lazily handles every packed matmul dtype.** The CPU backend's 2-D transpose rewraps the packed bytes with a flipped shape (a metadata-only "lazy transpose") for the K-series (Q4_K/Q5_K/Q6_K) and Q5_0/Q5_1, but **Q8_0 and Q4_0 fell through to the generic FP32 path**, which cast the Byte-backed buffer to Float and threw `ClassCastException`. Added the `Q8_0TensorData` and `Q4_0TensorData` cases so a packed Q8_0/Q4_0 matmul weight (e.g. a model's tied Q8_0 `lm_head`) survives `linearProject`'s `matmul(x, ops.transpose(W))` and dispatches to its packed kernel instead of crashing — `ops.transpose` now covers the full `chooseQuantizedMatmulHeap` dispatch set (Q4_K/Q5_K/Q6_K/Q5_0/Q5_1/Q8_0/Q4_0). Adds `transpose_preserves_every_packed_quant_type` (commonTest, jvm + linuxX64) as a regression guard. (PR #736, #737)
10
+
11
+
### Changed
12
+
13
+
-**Bumped `com.networknt:json-schema-validator` to 3.0.4.** (PR #733)
-**First-class Q5_K packed in-kernel dequant-matmul** across the CPU backends — a `Q5_KBlockTensorData` packed type and a `Q5KMatmulKernel` SPI with scalar (commonMain / Kotlin-Native), JVM Panama Vector, and native-C implementations, wired into `DefaultCpuOps` matmul dispatch + lazy transpose and the GGUF streaming loader. Q5_K weights now stay packed (no FP32 inflation) and dequantize inside the matmul, like Q4_K/Q6_K.
233
-
-**Hand-written ARM NEON kernels** for the native CPU backend (fp32, q8_0, q4k, q5k), guarded by `__ARM_NEON` so x86 keeps its scalar / auto-vectorized path. The native CMake build gains an aarch64 branch (`-march=armv8.2-a+fp16+dotprod`, dotprod for Cortex-A55) plus an opt-in cross-compile.
234
-
-**Kotlin/Native consumption of the C kernels via cinterop** — `skainet-backend-native-cpu` now also builds a static archive and exposes the kernels to Kotlin/Native (`linuxX64` + `linuxArm64`) through a `KernelProvider`, so on-device (non-JVM) binaries get the same hand-tuned kernels the JVM reaches via FFM. (PR #734)
232
+
-**`ops.transpose` lazily handles every packed matmul dtype.** The CPU backend rewraps packed bytes with a flipped shape (metadata-only "lazy transpose") so a packed weight survives `linearProject`'s `matmul(x, transpose(W))` instead of inflating to FP32 — but **Q8_0 and Q4_0** were missing and threw `Byte → Float ClassCastException`. Now the full dispatch set (Q4_K/Q5_K/Q6_K/Q5_0/Q5_1/Q8_0/Q4_0) transposes lazily, so a packed Q8_0/Q4_0 matmul weight (e.g. a tied Q8_0 `lm_head`) stays packed end-to-end on its NEON/SIMD kernel. Regression-tested across all seven packed types. (PRs #736, #737)
-**0.30.0** — First-class **Q5_K packed in-kernel dequant-matmul** across the CPU backends (`Q5_KBlockTensorData` + `Q5KMatmulKernel` SPI: scalar / Panama Vector / native-C), **hand-written ARM NEON kernels** (fp32/q8_0/q4k/q5k, `-march=armv8.2-a+fp16+dotprod`), and **Kotlin/Native consumption of the C kernels via cinterop** (`skainet-backend-native-cpu` static archive + `linuxX64`/`linuxArm64``KernelProvider`). (PR #734)
238
+
238
239
-**0.29.1** — `sk.ainet.core:skainet-compile-minerva` now publishes to Maven Central (packaging fix for the Minerva export module shipped in 0.29.0).
239
240
-**0.29.0** — **Minerva secure-MCU export module**: an end-to-end pipeline that lowers a SKaiNET model through shared graph-export contracts → Minerva IR → an `.npz` compiler input → a libminerva-packaged secure MCU project bundle, with host-side runtime verification and fingerprinted manifest artifacts (runnable sample, examples, ONNX workflow, getting-started docs). Plus **packed-quant matmul kernels with Kotlin/Native parity** (Q5_0/Q5_1/Q4_K/Q6_K — commonMain scalar + SPI, packed-quant dispatch in `DefaultCpuOpsBase`, Panama Vector for Q5_1/Q5_0 and Q6_K via the `KernelRegistry`), and an **auto-generated, CI-gated kernel × platform support matrix**. (PRs #697–#726)
240
241
-**0.28.1** — Kotlin DSL → StableHLO → IREE is green end-to-end for the whole conformance suite (7/7 models, 27/27 ops compile to a `vmfb`): `inferDagOutputSpecs` now infers correct output shapes for shape-changing ops, and `reduce_window` (pooling) emits IREE's generic region form. (PRs #674, #676)
Copy file name to clipboardExpand all lines: docs/modules/ROOT/pages/reference/kernel-support-matrix.adoc
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
= Kernel × platform support matrix
2
2
:description: Which compute-kernel provider serves each weight format on each KMP target.
3
3
4
-
Generated from `kernel-support.json` (version `0.30.0`) by `KernelSupportMatrixTest` — registry introspection of the registered `KernelProvider` implementations. Do not edit by hand; run `./gradlew generateKernelMatrix` to refresh.
4
+
Generated from `kernel-support.json` (version `0.31.0`) by `KernelSupportMatrixTest` — registry introspection of the registered `KernelProvider` implementations. Do not edit by hand; run `./gradlew generateKernelMatrix` to refresh.
5
5
6
6
Each cell is the best (highest-priority) provider that serves `Float32 × format``matmul` on that platform: *native-ffm* (100) → *panama-vector* (50) → *scalar* (0). An empty cell (`—`) means no provider carries a kernel there (the format is dequant-to-FP32 only).
0 commit comments