Skip to content

Commit b5c3fe1

Browse files
Merge pull request #181 from SKaiNET-developers/release/0.31.0
chore(release): prepare SKaiNET-transformers 0.31.0
2 parents 19d62d4 + 9f4dde7 commit b5c3fe1

8 files changed

Lines changed: 95 additions & 18 deletions

File tree

CHANGELOG.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,47 @@ version line is kept in lock-step with the underlying SKaiNET engine
77
The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
88
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
99

10+
## [0.31.0] — 2026-06-15
11+
12+
Version-aligned with **SKaiNET 0.31.0**. Completes the eager board-decode path
13+
for FunctionGemma: the tied **Q8_0 lm_head now stays packed** (paired with the
14+
engine's `ops.transpose` fix for all packed dtypes), and `load()` can cap the
15+
context to fit constrained devices.
16+
17+
### Added
18+
19+
- **`maxInferenceLen` on `GemmaNetworkLoader.load()`** — an optional cap on the
20+
context length the eager network sizes its KV cache + RoPE tables for (default
21+
`min(contextLength, 4096)`, threaded through `applyWeightsToNetwork`
22+
`gemmaNetwork`). A constrained-device consumer (e.g. the 1.9 GB SL2610 board)
23+
can pass a small value (e.g. `32` for a short tool-call prompt) to shrink the
24+
KV cache ~100×, which otherwise allocates ~0.4 GB at the first forward and OOMs
25+
the board after the weights load. Default `null` preserves existing behaviour. (#180)
26+
27+
### Changed
28+
29+
- **`gradle/libs.versions.toml` `skainet` pin: 0.30.0 → 0.31.0.** Picks up the
30+
engine's `ops.transpose` lazy-rewrap fix for **all** packed matmul dtypes
31+
(Q8_0/Q4_0 added) — required so the packed Q8_0 lm_head below transposes
32+
through `linearProject` instead of throwing `ClassCastException`. Downstream
33+
consumers get the upstream SKaiNET BOM transparently via `:llm-bom`.
34+
- **`gradle.properties` `VERSION_NAME=0.31.0`.** Lock-step with the engine.
35+
- **`com.networknt:json-schema-validator` → 3.0.4.** (#175)
36+
37+
### Fixed
38+
39+
- **Tied Q8_0 lm_head stays packed in the eager `NATIVE_OPTIMIZED` Gemma path.**
40+
FunctionGemma's `token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked`
41+
was dequantizing **both** `token_embd` and `output` to FP32 (2×~0.67 GB) —
42+
OOM on the 1.9 GB SL2610. `output`/lm_head now packs as Q8_0
43+
(`packGemmaKQuant` gained a Q8_0 case; the row-major→block-major relayout is
44+
generalized with a `blockSize` param) and runs on the (NEON) Q8_0 kernel;
45+
`token_embd` stays FP32 (it is gathered, not matmul'd) but is wrapped no-copy
46+
via `DenseFloatArrayTensorData` instead of `ctx.fromFloatArray` (which
47+
allocated a second ~0.67 GB buffer). Tied embed/lm_head footprint
48+
~1.34 GB → ~0.76 GB. Verified byte-identical decode parity
49+
(`GemmaQ5KPackedParityTest`) and a stable ~1.06 GB load on the SL2610. (#179)
50+
1051
## [0.30.0] — 2026-06-14
1152

1253
Version-aligned with **SKaiNET 0.30.0**. Skips 0.29.x — SKaiNET-transformers
@@ -489,6 +530,7 @@ Version-aligned with **SKaiNET 0.21.0**.
489530
Last published transformers release before the engine-aligned version line.
490531
See `git log v0.16.0..0.18.0` for details.
491532

533+
[0.31.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.31.0
492534
[0.30.0]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.30.0
493535
[0.28.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.28.1
494536
[0.23.1]: https://github.com/SKaiNET-developers/SKaiNET-transformers/releases/tag/0.23.1

README.md

Lines changed: 25 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -103,21 +103,20 @@ Honest status — see the project-status note at the top of this README.
103103

104104
## Current release
105105

106-
The current release is **0.30.0** — version-aligned with **SKaiNET 0.30.0**.
107-
Skips 0.29.x: SKaiNET-transformers tracked the engine internally across that
108-
window without a tagged release. The headline is that **Q5_K weights now stay
109-
packed in the eager Gemma runtime** (SKaiNET 0.30.0 ships a first-class Q5_K
110-
packed matmul) and the Gemma `NATIVE_OPTIMIZED` packed-weight path is now
111-
**Kotlin/Native–ready** — the board binary can keep K-quant weights packed
112-
without the JVM's `java.lang.foreign` MemSeg path. FunctionGemma-270M (`Q5_K_M`)
113-
decodes byte-identically across the FP32 baseline and both packed paths
114-
(`GemmaQ5KPackedParityTest`).
106+
The current release is **0.31.0** — version-aligned with **SKaiNET 0.31.0**.
107+
The headline is that the eager `NATIVE_OPTIMIZED` Gemma path now keeps the
108+
**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix
109+
for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional
110+
`maxInferenceLen` to cap the KV cache for constrained devices — together
111+
dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
112+
Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically
113+
across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`).
115114

116115
The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.
117116

118117
```kotlin
119118
dependencies {
120-
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
119+
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
121120

122121
// Versions resolved from the BOM:
123122
implementation("sk.ainet.transformers:skainet-transformers-core")
@@ -194,6 +193,22 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n
194193

195194
See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.
196195

196+
## What's new in 0.31.0
197+
198+
- **Tied Q8_0 lm_head stays packed (eager `NATIVE_OPTIMIZED`).** FunctionGemma's
199+
`token_embd` is Q8_0 and tied, so `convertGemmaWeightsPacked` was dequantizing
200+
*both* `token_embd` and `output` to FP32 (2×~0.67 GB) — OOM on the 1.9 GB
201+
SL2610. `output`/lm_head now packs as Q8_0 (runs on the NEON Q8_0 kernel);
202+
`token_embd` stays FP32 (it's gathered) but is wrapped no-copy. Footprint
203+
~1.34 GB → ~0.76 GB; byte-identical decode (`GemmaQ5KPackedParityTest`),
204+
stable ~1.06 GB load on the SL2610.
205+
- **`GemmaNetworkLoader.load(maxInferenceLen = …)`** — cap the context so the KV
206+
cache + RoPE tables stay tiny on constrained devices (default
207+
`min(contextLength, 4096)`).
208+
- **Engine pin `skainet 0.30.0 → 0.31.0`** — picks up `ops.transpose`'s
209+
lazy-rewrap fix for all packed matmul dtypes (Q8_0/Q4_0), required so the
210+
packed lm_head transposes through `linearProject` instead of `ClassCastException`.
211+
197212
## What's new in 0.30.0
198213

199214
- **Q5_K stays packed in the eager Gemma runtime.** `GemmaMemSegConverter` used to

docs/modules/ROOT/pages/tutorials/getting-started-java.adoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ In your `build.gradle.kts`:
2525
[source,kotlin]
2626
----
2727
dependencies {
28-
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
28+
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
2929
3030
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
3131
implementation("sk.ainet.transformers:skainet-transformers-agent")
@@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
4141
<dependency>
4242
<groupId>sk.ainet.transformers</groupId>
4343
<artifactId>skainet-transformers-bom</artifactId>
44-
<version>0.30.0</version>
44+
<version>0.31.0</version>
4545
<type>pom</type>
4646
<scope>import</scope>
4747
</dependency>

docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ The pieces you need live in three modules:
5252
[source,kotlin]
5353
----
5454
dependencies {
55-
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.30.0"))
55+
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
5656
5757
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
5858
implementation("sk.ainet.transformers:skainet-transformers-agent")

gradle.properties

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
GROUP=sk.ainet.transformers
2-
VERSION_NAME=0.30.0
2+
VERSION_NAME=0.31.0
33

44
POM_DESCRIPTION=SKaiNET-transformers
55

@@ -33,6 +33,15 @@ kotlin.mpp.enableCInteropCommonization=true
3333
#Android
3434
android.useAndroidX=true
3535
android.nonTransitiveRClass=true
36+
# AGP's DependencyResolutionChecks fails the build when a configuration resolves
37+
# at configuration time. KGP's KotlinPackageJsonTask resolves the Kotlin/JS + Wasm
38+
# `*NpmAggregated` configs at config time (we have JS npm deps: ktor-client-js,
39+
# kotlinx-browser), so `assemble`/`allTests` throw `Configuration 'jsNpmAggregated'
40+
# was resolved during configuration time` (gradle#31483) — a false positive against
41+
# KGP's known behaviour. Downgrade AGP's check from fail to warn. NOTE: AGP reads
42+
# this option only from the project gradle.properties — NOT from -P or the CI's
43+
# ~/.gradle/gradle.properties.
44+
android.dependencyResolutionAtConfigurationTime.disallow=false
3645

3746
kotlin.mpp.stability.nowarn=true
3847

gradle/libs.versions.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[versions]
2-
skainet = "0.30.0"
2+
skainet = "0.31.0"
33
agp = "9.2.1"
44
jacksonDatabind = "2.22.0"
55
jsonSchemaValidator = "3.0.4"

llm-inference/gemma/api/jvm/gemma.api

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -862,7 +862,8 @@ public final class sk/ainet/models/gemma/GemmaNetworkLoader$WeightsProvider$Safe
862862
}
863863

864864
public final class sk/ainet/models/gemma/GemmaNetworkLoaderKt {
865-
public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;Z)Lsk/ainet/lang/nn/Module;
865+
public static final fun applyWeightsToNetworkNonReified (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;)Lsk/ainet/lang/nn/Module;
866+
public static synthetic fun applyWeightsToNetworkNonReified$default (Lsk/ainet/context/ExecutionContext;Lsk/ainet/models/gemma/Gemma4Weights;Lkotlin/reflect/KClass;ZLjava/lang/Integer;ILjava/lang/Object;)Lsk/ainet/lang/nn/Module;
866867
}
867868

868869
public final class sk/ainet/models/gemma/GemmaPackedWeightsKt {

llm-inference/gemma/src/commonTest/kotlin/sk/ainet/models/gemma/GemmaQuantLayoutTest.kt

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import sk.ainet.context.DirectCpuExecutionContext
88
import sk.ainet.io.gguf.GGMLQuantizationType
99
import sk.ainet.lang.tensor.Shape
1010
import sk.ainet.lang.tensor.data.Q5_KBlockTensorData
11+
import sk.ainet.lang.tensor.data.Q8_0BlockTensorData
1112
import sk.ainet.lang.types.FP32
1213
import sk.ainet.lang.types.Int8
1314

@@ -55,8 +56,17 @@ class GemmaQuantLayoutTest {
5556
}
5657

5758
@Test
58-
fun pack_non_kquant_returns_null() {
59-
assertNull(packGemmaKQuant<FP32>(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32)))
59+
fun pack_q8_0_produces_block_tensor() {
60+
// Q8_0 is now packed (32 elems / 34 B per block) so a tied Q8_0 lm_head
61+
// stays packed and runs on the Q8_0 kernel instead of dequanting to FP32.
62+
val td = packGemmaKQuant<FP32>(ByteArray(34), GGMLQuantizationType.Q8_0, Shape(1, 32))
63+
assertTrue(td is Q8_0BlockTensorData, "Q8_0 should pack to Q8_0BlockTensorData")
64+
}
65+
66+
@Test
67+
fun pack_unsupported_quant_returns_null() {
68+
// A quant type with no packed kernel (e.g. Q4_1) falls back to FP32 dequant.
69+
assertNull(packGemmaKQuant<FP32>(ByteArray(20), GGMLQuantizationType.Q4_1, Shape(1, 32)))
6070
}
6171

6272
@Test

0 commit comments

Comments
 (0)