@@ -103,25 +103,27 @@ Honest status — see the project-status note at the top of this README.
103103
104104## Current release
105105
106- The current release is ** 0.31.1** (against ** SKaiNET 0.31.0** ). It adds
107- ** ` transformer-core ` ** — the framework NN primitives (attention, KV-cache family,
108- embedding, norms, RoPE, FFNs, linear projection) extracted out of ` llm-core ` so they
109- build on the ** full target matrix including ` androidNative ` ** (32-bit + 64-bit ARM);
110- ` llm-core ` re-exports it, so nothing changes for existing consumers, and ARM-native
111- downstreams (e.g. on-device whisper) can reuse the primitives instead of reimplementing
112- them. The 0.31.0 highlights still apply: the eager ` NATIVE_OPTIMIZED ` Gemma path keeps the
113- ** tied Q8_0 lm_head packed** (paired with SKaiNET 0.31.0's ` ops.transpose ` fix
114- for all packed dtypes), and ` GemmaNetworkLoader.load() ` takes an optional
115- ` maxInferenceLen ` to cap the KV cache for constrained devices — together
116- dropping FunctionGemma-270M's footprint enough to load eagerly on the 1.9 GB
117- Astra Machina SL2610. FunctionGemma (` Q5_K_M ` ) still decodes byte-identically
118- across the FP32 baseline and both packed paths (` GemmaQ5KPackedParityTest ` ).
106+ The current release is ** 0.32.0** (against ** SKaiNET 0.32.0** ). It brings the
107+ real-GGUF ** Llama** eager path up to the Gemma standard and ** unblocks StableHLO/IREE
108+ export for Llama-family models** :
109+
110+ - The eager ** ` NATIVE_OPTIMIZED ` path now works for Llama** (` Q4_K ` /` Q6_K ` ): weights stay
111+ packed and ` LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED) + OptimizedLLMRuntime ` decodes
112+ coherently, matching llama.cpp — fixing the packed token-embedding
113+ ` gather: unsupported input rank 1 ` .
114+ - ** Fused decode-attention** (` seqQ == 1 ` ) skips the ` repeatKVHeads ` concat + SDPA plumbing
115+ for a faster decode loop (~ 1.5×), bit-identical output.
116+ - ** Interleaved RoPE is now traceable** , so Llama/Mistral/GGUF graphs export to StableHLO
117+ (and ` iree-compile ` to a ` vmfb ` ) instead of baking a disconnected constant.
118+
119+ The earlier ` transformer-core ` extraction (0.31.1) and the Gemma ` NATIVE_OPTIMIZED `
120+ footprint work (0.31.0) still apply.
119121
120122The recommended way to consume is via the BOM. It pins every published ` skainet-transformers-* ` artifact and re-exports the upstream ` sk.ainet:skainet-bom ` , so the engine-side ` sk.ainet.core:skainet-* ` artifacts get the matching version too — you only need to declare the BOM version in one place.
121123
122124``` kotlin
123125dependencies {
124- implementation(platform(" sk.ainet.transformers:skainet-transformers-bom:0.31.1 " ))
126+ implementation(platform(" sk.ainet.transformers:skainet-transformers-bom:0.32.0 " ))
125127
126128 // Versions resolved from the BOM:
127129 implementation(" sk.ainet.transformers:skainet-transformers-core" )
@@ -199,6 +201,24 @@ try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ n
199201
200202See ` llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java ` for a runnable reference.
201203
204+ ## What's new in 0.32.0
205+
206+ - ** Eager ` NATIVE_OPTIMIZED ` for real-GGUF Llama.** ` LlamaNetworkLoader.fromGguf(NATIVE_OPTIMIZED) `
207+ now keeps ` Q4_K ` /` Q6_K ` weights packed and runs them through ` OptimizedLLMRuntime ` , mirroring the
208+ Gemma path (new ` LlamaQuantLayout ` + ` LlamaPackedWeights.convertLlamaWeightsPacked ` ). Output is
209+ coherent and matches llama.cpp; fixes the packed token-embedding ` gather: unsupported input rank 1 ` .
210+ This is the low-footprint path real-GGUF Llama inference on constrained ARM was missing. (ccbd87e)
211+ - ** Fused decode-attention fast path.** For the decode step (` seqQ == 1 ` ), ` MultiHeadAttention ` runs
212+ scores → softmax → GQA-weighted-V straight from the cached K/V, bypassing the ` repeatKVHeads ` concat
213+ and the ` unsqueeze → SDPA → squeeze → permute ` chain. ~ 1.5× decode throughput on the JVM eager path;
214+ bit-for-bit-equivalent output. Prefill keeps the general SDPA path. (3791f88)
215+ - ** Traceable interleaved RoPE (graph export).** ` RoPE ` in ` INTERLEAVED ` mode (Llama / Mistral / most
216+ GGUF) used a raw-array path (` copyToFloatArray ` / ` fromFloatArray ` ) that, under graph tracing, recorded
217+ the rotated Q/K as a * disconnected constant* — severing them from the projection weights and crashing
218+ ` iree-compile ` downstream. It now records the rotation as tensor ops when tracing (gated on the tracing
219+ wrapper; eager keeps the fast raw-array path byte-identical). Unblocks TinyLlama → StableHLO → IREE. (019b049)
220+ - ** Engine pin ` skainet 0.31.0 → 0.32.0 ` .**
221+
202222## What's new in 0.31.1
203223
204224- ** ` transformer-core ` module — NN primitives reusable on all targets incl. ` androidNative ` .** The
0 commit comments