Skip to content

Commit 217c1cb

Browse files
Merge pull request #197 from SKaiNET-developers/release/0.32.0
release: SKaiNET-transformers 0.32.0
2 parents 3179e9e + 068647f commit 217c1cb

10 files changed

Lines changed: 403 additions & 338 deletions

File tree

CHANGELOG.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,39 @@ version line is kept in lock-step with the underlying SKaiNET engine
77
The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
88
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
99

10+
## [0.32.0] — 2026-06-25
11+
12+
Brings the real-GGUF **Llama** eager path up to the Gemma standard (packed
13+
`NATIVE_OPTIMIZED`) and **unblocks StableHLO/IREE export for Llama-family models**
14+
(traceable interleaved RoPE). Ships against engine **0.32.0**.
15+
16+
### Added
17+
18+
- **Eager `NATIVE_OPTIMIZED` packed path for Llama.** `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)`
19+
keeps `Q4_K`/`Q6_K` weights packed and runs them through `OptimizedLLMRuntime` — new `LlamaQuantLayout`
20+
+ `LlamaPackedWeights.convertLlamaWeightsPacked`, mirroring `convertGemmaWeightsPacked`. Coherent
21+
output matching llama.cpp; the low-footprint path real-GGUF Llama inference on constrained ARM was
22+
missing. (ccbd87e)
23+
24+
### Changed
25+
26+
- **Fused decode-attention fast path.** `MultiHeadAttention`'s decode step (`seqQ == 1`) now computes
27+
scores → softmax → GQA-weighted-V directly from the cached K/V, bypassing the `repeatKVHeads` concat
28+
and the `unsqueeze → SDPA → squeeze → permute` chain — ~1.5× decode throughput, bit-identical output.
29+
Prefill (`seqLen > 1`) keeps the general SDPA path. (3791f88)
30+
- **Engine pin `skainet 0.31.0 → 0.32.0`.**
31+
32+
### Fixed
33+
34+
- **Packed token-embedding gather for Llama**`fromGguf(NATIVE_OPTIMIZED)` no longer fails with
35+
`gather: unsupported input rank 1`; the packed embedding is wired through the canonical loader. (ccbd87e)
36+
- **Interleaved RoPE is now traceable.** In `INTERLEAVED` mode (Llama / Mistral / most GGUF) the rotation
37+
used a raw float-array path (`copyToFloatArray` / `fromFloatArray`) that, under graph tracing, baked the
38+
rotated Q/K as a *disconnected constant* — severing them from the projection weights and crashing
39+
`iree-compile` (null-deref in constant folding) on the exported graph. `RoPE` now records the rotation
40+
as tensor ops when running under the tracing wrapper; eager execution keeps the byte-identical raw-array
41+
fast path. Unblocks Llama/Mistral/GGUF StableHLO/IREE export. (019b049)
42+
1043
## [0.31.1] — 2026-06-17
1144

1245
Adds **`transformer-core`** — the framework NN primitives (attention, the KV-cache family, embedding,

README.md

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -103,25 +103,27 @@ Honest status — see the project-status note at the top of this README.
103103

104104
## Current release
105105

106-
The current release is **0.31.1** (against **SKaiNET 0.31.0**). It adds
107-
**`transformer-core`** — the framework NN primitives (attention, KV-cache family,
108-
embedding, norms, RoPE, FFNs, linear projection) extracted out of `llm-core` so they
109-
build on the **full target matrix including `androidNative`** (32-bit + 64-bit ARM);
110-
`llm-core` re-exports it, so nothing changes for existing consumers, and ARM-native
111-
downstreams (e.g. on-device whisper) can reuse the primitives instead of reimplementing
112-
them. The 0.31.0 highlights still apply: the eager `NATIVE_OPTIMIZED` Gemma path keeps the
113-
**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix
114-
for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional
115-
`maxInferenceLen` to cap the KV cache for constrained devices — together
116-
dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
117-
Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically
118-
across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`).
106+
The current release is **0.32.0** (against **SKaiNET 0.32.0**). It brings the
107+
real-GGUF **Llama** eager path up to the Gemma standard and **unblocks StableHLO/IREE
108+
export for Llama-family models**:
109+
110+
- The eager **`NATIVE_OPTIMIZED` path now works for Llama** (`Q4_K`/`Q6_K`): weights stay
111+
packed and `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED) + OptimizedLLMRuntime` decodes
112+
coherently, matching llama.cpp — fixing the packed token-embedding
113+
`gather: unsupported input rank 1`.
114+
- **Fused decode-attention** (`seqQ == 1`) skips the `repeatKVHeads` concat + SDPA plumbing
115+
for a faster decode loop (~1.5×), bit-identical output.
116+
- **Interleaved RoPE is now traceable**, so Llama/Mistral/GGUF graphs export to StableHLO
117+
(and `iree-compile` to a `vmfb`) instead of baking a disconnected constant.
118+
119+
The earlier `transformer-core` extraction (0.31.1) and the Gemma `NATIVE_OPTIMIZED`
120+
footprint work (0.31.0) still apply.
119121

120122
The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.
121123

122124
```kotlin
123125
dependencies {
124-
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
126+
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))
125127

126128
// Versions resolved from the BOM:
127129
implementation("sk.ainet.transformers:skainet-transformers-core")
@@ -199,6 +201,24 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n
199201

200202
See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.
201203

204+
## What's new in 0.32.0
205+
206+
- **Eager `NATIVE_OPTIMIZED` for real-GGUF Llama.** `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)`
207+
now keeps `Q4_K`/`Q6_K` weights packed and runs them through `OptimizedLLMRuntime`, mirroring the
208+
Gemma path (new `LlamaQuantLayout` + `LlamaPackedWeights.convertLlamaWeightsPacked`). Output is
209+
coherent and matches llama.cpp; fixes the packed token-embedding `gather: unsupported input rank 1`.
210+
This is the low-footprint path real-GGUF Llama inference on constrained ARM was missing. (ccbd87e)
211+
- **Fused decode-attention fast path.** For the decode step (`seqQ == 1`), `MultiHeadAttention` runs
212+
scores → softmax → GQA-weighted-V straight from the cached K/V, bypassing the `repeatKVHeads` concat
213+
and the `unsqueeze → SDPA → squeeze → permute` chain. ~1.5× decode throughput on the JVM eager path;
214+
bit-for-bit-equivalent output. Prefill keeps the general SDPA path. (3791f88)
215+
- **Traceable interleaved RoPE (graph export).** `RoPE` in `INTERLEAVED` mode (Llama / Mistral / most
216+
GGUF) used a raw-array path (`copyToFloatArray` / `fromFloatArray`) that, under graph tracing, recorded
217+
the rotated Q/K as a *disconnected constant* — severing them from the projection weights and crashing
218+
`iree-compile` downstream. It now records the rotation as tensor ops when tracing (gated on the tracing
219+
wrapper; eager keeps the fast raw-array path byte-identical). Unblocks TinyLlama → StableHLO → IREE. (019b049)
220+
- **Engine pin `skainet 0.31.0 → 0.32.0`.**
221+
202222
## What's new in 0.31.1
203223

204224
- **`transformer-core` module — NN primitives reusable on all targets incl. `androidNative`.** The

docs/modules/ROOT/pages/tutorials/getting-started-java.adoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ In your `build.gradle.kts`:
2525
[source,kotlin]
2626
----
2727
dependencies {
28-
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
28+
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))
2929
3030
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
3131
implementation("sk.ainet.transformers:skainet-transformers-agent")
@@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
4141
<dependency>
4242
<groupId>sk.ainet.transformers</groupId>
4343
<artifactId>skainet-transformers-bom</artifactId>
44-
<version>0.31.1</version>
44+
<version>0.32.0</version>
4545
<type>pom</type>
4646
<scope>import</scope>
4747
</dependency>

docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ The pieces you need live in three modules:
5252
[source,kotlin]
5353
----
5454
dependencies {
55-
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
55+
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))
5656
5757
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
5858
implementation("sk.ainet.transformers:skainet-transformers-agent")

gradle.properties

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
GROUP=sk.ainet.transformers
2-
VERSION_NAME=0.31.1
2+
VERSION_NAME=0.32.0
33

44
POM_DESCRIPTION=SKaiNET-transformers
55

gradle/libs.versions.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[versions]
2-
skainet = "0.31.0"
2+
skainet = "0.32.0"
33
agp = "9.2.1"
44
jacksonDatabind = "2.22.0"
55
jsonSchemaValidator = "3.0.5"

0 commit comments

Comments
 (0)