SKaiNET-developers
diff --git a/‎CHANGELOG.md‎
Lines changed: 33 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 34 additions & 14 deletions b/‎README.md‎
Lines changed: 34 additions & 14 deletions
diff --git a/‎docs/modules/ROOT/pages/tutorials/getting-started-java.adoc‎
Lines changed: 2 additions & 2 deletions b/‎docs/modules/ROOT/pages/tutorials/getting-started-java.adoc‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc‎
Lines changed: 1 addition & 1 deletion b/‎docs/modules/ROOT/pages/tutorials/llama3-tool-calling.adoc‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎gradle.properties‎
Lines changed: 1 addition & 1 deletion b/‎gradle.properties‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎gradle/libs.versions.toml‎
Lines changed: 1 addition & 1 deletion b/‎gradle/libs.versions.toml‎
Lines changed: 1 addition & 1 deletion
@@ -7,6 +7,39 @@ version line is kept in lock-step with the underlying SKaiNET engine
 The format roughly follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.32.0] — 2026-06-25
+
+Brings the real-GGUF **Llama** eager path up to the Gemma standard (packed
+`NATIVE_OPTIMIZED`) and **unblocks StableHLO/IREE export for Llama-family models**
+(traceable interleaved RoPE). Ships against engine **0.32.0**.
+
+### Added
+
+- **Eager `NATIVE_OPTIMIZED` packed path for Llama.** `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)`
+  keeps `Q4_K`/`Q6_K` weights packed and runs them through `OptimizedLLMRuntime` — new `LlamaQuantLayout`
+  + `LlamaPackedWeights.convertLlamaWeightsPacked`, mirroring `convertGemmaWeightsPacked`. Coherent
+  output matching llama.cpp; the low-footprint path real-GGUF Llama inference on constrained ARM was
+  missing. (ccbd87e)
+
+### Changed
+
+- **Fused decode-attention fast path.** `MultiHeadAttention`'s decode step (`seqQ == 1`) now computes
+  scores → softmax → GQA-weighted-V directly from the cached K/V, bypassing the `repeatKVHeads` concat
+  and the `unsqueeze → SDPA → squeeze → permute` chain — ~1.5× decode throughput, bit-identical output.
+  Prefill (`seqLen > 1`) keeps the general SDPA path. (3791f88)
+- **Engine pin `skainet 0.31.0 → 0.32.0`.**
+
+### Fixed
+
+- **Packed token-embedding gather for Llama** — `fromGguf(NATIVE_OPTIMIZED)` no longer fails with
+  `gather: unsupported input rank 1`; the packed embedding is wired through the canonical loader. (ccbd87e)
+- **Interleaved RoPE is now traceable.** In `INTERLEAVED` mode (Llama / Mistral / most GGUF) the rotation
+  used a raw float-array path (`copyToFloatArray` / `fromFloatArray`) that, under graph tracing, baked the
+  rotated Q/K as a *disconnected constant* — severing them from the projection weights and crashing
+  `iree-compile` (null-deref in constant folding) on the exported graph. `RoPE` now records the rotation
+  as tensor ops when running under the tracing wrapper; eager execution keeps the byte-identical raw-array
+  fast path. Unblocks Llama/Mistral/GGUF StableHLO/IREE export. (019b049)
+
 ## [0.31.1] — 2026-06-17
 
 Adds **`transformer-core`** — the framework NN primitives (attention, the KV-cache family, embedding,
 
@@ -103,25 +103,27 @@ Honest status — see the project-status note at the top of this README.
 
 ## Current release
 
-The current release is **0.31.1** (against **SKaiNET 0.31.0**). It adds
-**`transformer-core`** — the framework NN primitives (attention, KV-cache family,
-embedding, norms, RoPE, FFNs, linear projection) extracted out of `llm-core` so they
-build on the **full target matrix including `androidNative`** (32-bit + 64-bit ARM);
-`llm-core` re-exports it, so nothing changes for existing consumers, and ARM-native
-downstreams (e.g. on-device whisper) can reuse the primitives instead of reimplementing
-them. The 0.31.0 highlights still apply: the eager `NATIVE_OPTIMIZED` Gemma path keeps the
-**tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's `ops.transpose` fix
-for all packed dtypes), and `GemmaNetworkLoader.load()` takes an optional
-`maxInferenceLen` to cap the KV cache for constrained devices — together
-dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
-Astra Machina SL2610. FunctionGemma (`Q5_K_M`) still decodes byte-identically
-across the FP32 baseline and both packed paths (`GemmaQ5KPackedParityTest`).
+The current release is **0.32.0** (against **SKaiNET 0.32.0**). It brings the
+real-GGUF **Llama** eager path up to the Gemma standard and **unblocks StableHLO/IREE
+export for Llama-family models**:
+
+- The eager **`NATIVE_OPTIMIZED` path now works for Llama** (`Q4_K`/`Q6_K`): weights stay
+  packed and `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED) + OptimizedLLMRuntime` decodes
+  coherently, matching llama.cpp — fixing the packed token-embedding
+  `gather: unsupported input rank 1`.
+- **Fused decode-attention** (`seqQ == 1`) skips the `repeatKVHeads` concat + SDPA plumbing
+  for a faster decode loop (~1.5×), bit-identical output.
+- **Interleaved RoPE is now traceable**, so Llama/Mistral/GGUF graphs export to StableHLO
+  (and `iree-compile` to a `vmfb`) instead of baking a disconnected constant.
+
+The earlier `transformer-core` extraction (0.31.1) and the Gemma `NATIVE_OPTIMIZED`
+footprint work (0.31.0) still apply.
 
 The recommended way to consume is via the BOM. It pins every published `skainet-transformers-*` artifact and re-exports the upstream `sk.ainet:skainet-bom`, so the engine-side `sk.ainet.core:skainet-*` artifacts get the matching version too — you only need to declare the BOM version in one place.
 
 ```kotlin
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))
 
     // Versions resolved from the BOM:
     implementation("sk.ainet.transformers:skainet-transformers-core")
@@ -199,6 +201,24 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n
 
 See `llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java` for a runnable reference.
 
+## What's new in 0.32.0
+
+- **Eager `NATIVE_OPTIMIZED` for real-GGUF Llama.** `LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED)`
+  now keeps `Q4_K`/`Q6_K` weights packed and runs them through `OptimizedLLMRuntime`, mirroring the
+  Gemma path (new `LlamaQuantLayout` + `LlamaPackedWeights.convertLlamaWeightsPacked`). Output is
+  coherent and matches llama.cpp; fixes the packed token-embedding `gather: unsupported input rank 1`.
+  This is the low-footprint path real-GGUF Llama inference on constrained ARM was missing. (ccbd87e)
+- **Fused decode-attention fast path.** For the decode step (`seqQ == 1`), `MultiHeadAttention` runs
+  scores → softmax → GQA-weighted-V straight from the cached K/V, bypassing the `repeatKVHeads` concat
+  and the `unsqueeze → SDPA → squeeze → permute` chain. ~1.5× decode throughput on the JVM eager path;
+  bit-for-bit-equivalent output. Prefill keeps the general SDPA path. (3791f88)
+- **Traceable interleaved RoPE (graph export).** `RoPE` in `INTERLEAVED` mode (Llama / Mistral / most
+  GGUF) used a raw-array path (`copyToFloatArray` / `fromFloatArray`) that, under graph tracing, recorded
+  the rotated Q/K as a *disconnected constant* — severing them from the projection weights and crashing
+  `iree-compile` downstream. It now records the rotation as tensor ops when tracing (gated on the tracing
+  wrapper; eager keeps the fast raw-array path byte-identical). Unblocks TinyLlama → StableHLO → IREE. (019b049)
+- **Engine pin `skainet 0.31.0 → 0.32.0`.**
+
 ## What's new in 0.31.1
 
 - **`transformer-core` module — NN primitives reusable on all targets incl. `androidNative`.** The
 
@@ -25,7 +25,7 @@ In your `build.gradle.kts`:
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))
 
     implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
     implementation("sk.ainet.transformers:skainet-transformers-agent")
@@ -41,7 +41,7 @@ Or in Maven (Maven needs the `-jvm` classifier suffix on platform artifacts):
     <dependency>
       <groupId>sk.ainet.transformers</groupId>
       <artifactId>skainet-transformers-bom</artifactId>
-      <version>0.31.1</version>
+      <version>0.32.0</version>
       <type>pom</type>
       <scope>import</scope>
     </dependency>
 
@@ -52,7 +52,7 @@ The pieces you need live in three modules:
 [source,kotlin]
 ----
 dependencies {
-    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.1"))
+    implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.32.0"))
 
     implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
     implementation("sk.ainet.transformers:skainet-transformers-agent")
 
@@ -1,5 +1,5 @@
 GROUP=sk.ainet.transformers
-VERSION_NAME=0.31.1
+VERSION_NAME=0.32.0
 
 POM_DESCRIPTION=SKaiNET-transformers
 
 
@@ -1,5 +1,5 @@
 [versions]
-skainet = "0.31.0"
+skainet = "0.32.0"
 agp = "9.2.1"
 jacksonDatabind = "2.22.0"
 jsonSchemaValidator = "3.0.5"