TensorDataFactory.placeholder(shape, dtype)— returns aTensorDatawhose underlying primitive array materializes lazily on first read, instead of allocating aFloatArray(shape.volume)eagerly. The default interface implementation falls back tozeros, preserving behavior for any custom factory;DenseTensorDataFactoryoverrides withLazyZeroFloatArrayTensorData/LazyZeroIntArrayTensorData.ExecutionContext.placeholder(...)exposes the same path at theTensorlevel. (PR #588)PosixPreadRandomAccessSourcefor Kotlin/Native — new public class inskainet-io-core'snativeMainsource set wrapping POSIXpread(2).preadis positional and atomic, so concurrent reads from different positions are safe without locking. Companionopen(path)returnsnullon open/stat failure to match the JVMJvmRandomAccessSource.open(...)behaviour, letting callers cleanly fall back to the legacy sequential reader if needed. CoversmacosArm64,linuxX64,linuxArm64,iosArm64,iosSimulatorArm64— every target in the defaultnativeMainsource set on this module. 11nativeTestcases pin the contract (size, partial reads, offset/length variants, EOF/argument validation, idempotent close, missing-file null return). (PR #591)
- Kotlin/Native consumers couldn't load GGUFs larger than ~2 GiB —
sk.ainet.io.gguf.createRandomAccessSource(filePath)on the native target was a placeholderactual fun … = null, forcing every K/N caller (StreamingGGUFReader.open(...)via the gguf-specific factory, every*NetworkLoader.fromGguf(...)path,LlamaWeightLoader) to fall through to the legacy reader, which slurps the entire file into a singleByteArray. Kotlin arrays cap atInt.MAX_VALUEbytes (~2 GiB), so any GGUF over ~1.9 GiB threwIllegalStateException: Can't create an array of size 2147483648. Practical impact: macOS / Linux / iOS native builds couldn't open Q8 models above ~1B parameters or Q4 models above ~3B — the JVM target had no such cap becauseJvmRandomAccessSourcewas already implemented. Theskainet-io-gguffactory's native actual now delegates to the newPosixPreadRandomAccessSource(see Added above) and returns the samenullsentinel on open/stat failure, so existing fall-back code paths remain valid. Verified on macOS arm64 againstQwen3-1.7B-Q8_0.gguf(~1.8 GiB), which previously OOMed at construction time. (Issue #589, PR #591) - DSL eagerly allocated zero tensors for every Linear / Conv1d / Conv2d, OOMing real-model loaders —
NetworkBuilder.kt'screateLinear,DenseImpl,Conv1dImpl, andConv2dImplpaths calledtensorDataFactory.zeros<T, V>(shape, kClass)eagerly to satisfy each module's constructor whenever the user had not provided initial weights or bias. Downstream loaders always build the network first and only then substitute weights viaWeightMapper.applyWeights, so the eager zeros were always immediately discarded — but they determined the JVM's peak heap footprint. Forunsloth/Apertus-8B-Instruct-2509-GGUF(Q4_K_S, 4.7 GB on disk) that was ~27 GB of FP32 zeros allocated and thrown away. Switched every eager-init call site to the newplaceholder(...)API; the lazy fires only if a caller actually reads the tensor, which never happens on the substitution path becauseparameter.value =swaps the entireTensor. Verified against the real Apertus-8B Q4_K_S GGUF:ApertusNetworkLoader.fromGguf().load<FP32, Float>(ctx)now succeeds in 12 GB heap (previously OOMed at 12 GB), constructs all 35 top-level modules in 13 s. Same fix benefits Gemma / Llama / Qwen / Voxtral DSL paths transparently. (Issue #587, PR #588)
skainet-bompublished at the wrong Maven coordinates — the umbrella BOM was being emitted assk.ainet.core:skainet-bombecause the engine-wideGROUP=sk.ainet.corefrom the rootgradle.propertiesclobbered the per-modulegroup = "sk.ainet"override picked up byvanniktech.maven.publish. Downstream BOMs (e.g.sk.ainet.transformers:skainet-transformers-bom) import this with<groupId>sk.ainet</groupId>, so they were unresolvable from a freshmavenCentral()-only project. Fix uses vanniktech's explicitmavenPublishing { coordinates("sk.ainet", "skainet-bom", VERSION_NAME) }so the BOM publishes atsk.ainet:skainet-bom:0.22.2.validate-published-poms.shextended to assert the BOM landed at the expected path so the regression cannot ship again. (Issue #584)GgufModelMetadatasilently droppedUInt/ULongnumeric fields — modern GGUFs (recent llama.cpp converters) store dimensions and counts asuint32, which the reader preserves as KotlinUInt. Kotlin's unsigned types do not extendkotlin.Number, so the previous private(value as? Number)?.toInt()helper returnednullfor everyUInt/ULongfield. Result:contextLength,embeddingLength,layerCount,headCount,vocabSize(fallback),bosTokenId, andeosTokenIdall came backnullon real-world GGUFs and downstream loaders fell back to defaults (e.g.blockCount=0→ zero-layer transformer). New public fileGgufFieldAccessors.ktexposesMap<String, Any?>extensions (getInt/getLong/getString/getIntList/getStringList) covering every signed and unsigned integer type the reader can emit, plus the matching primitive arrays for the list variant.GgufModelMetadata.from()now routes through these public accessors; the buggy private helpers are deleted. NewGgufModelMetadataUnsignedTestpins the contract. Non-breaking — only adds public API and fixes existing methods to return correct values. (Issue #585)
StreamingShardedSafeTensorsReader.loadTensorStorageMapped— by-name and by-ShardedTensorInfooverloads that mirror the existing single-fileStreamingSafeTensorsReader.loadTensorStorageMapped(tensor, filePath). Both return aTensorStoragewhoseBufferHandle.FileBackedreferences the resolved shard file's tensor byte range, enabling zero-copy / memory-mapped reads of tensors that exceed the 2 GB JVMByteArraylimit. The new methods delegate internally to the per-shard reader; callers don't need to know which physical shard contains a given tensor. Unblocks downstream consumers (e.g. SKaiNET-transformers' Gemma 4 PLE token-embedding table at ~4.7 GB BF16 on E2B) that previously rolled their ownFileChannel.map. (PR #582)
This release closes milestone M5 of the JVM inference performance roadmap with a priority-100 native kernel provider that wraps a bundled C shared library via Java's Foreign Function & Memory API. Plugs into the existing KernelProvider SPI so KernelRegistry.bestAvailable() automatically routes Q4_K and FP32 matmul through native when the lib loads, falling back cleanly to the priority-50 Panama Vector kernels otherwise.
skainet-backend-native-cpumodule — new JVM-only KMP module wrapping a CMake-built shared library (libskainet_kernels.{so,dylib,dll}). Bundled into the JAR resources atnative/<os>-<arch>/, extracted at runtime to a process-scoped temp dir, loaded viaSystem.load, and accessed viaLinker.nativeLinker().downcallHandle(...). ServiceLoader auto-registersNativeKernelProviderFactoryviaMETA-INF/services/sk.ainet.backend.api.kernel.KernelProvider. (PR #571)- Native Q4_K matmul — single-source scalar C kernel (
-O3 -ffast-math -funroll-loops); the inner 32-iteration loop auto-vectorizes cleanly intovfmadd231ps(AVX2) /fmla(NEON). MirrorsPanamaVectorQ4KMatmulKernelbyte-for-byte on the canonical ggml super-block layout (256 elements / 144 bytes, FP16 d/dMin, 12-byteget_scale_min_k4packed sub-scales, 128 bytes of strided 4-bit codes, lazy-dminaccumulation). Microbench (Linux x86_64, JDK 21.0.10): 5.87× / 4.71× / 4.17× faster than Panama Vector at 1024² / 2048² / 4096² Q4_K matmul shapes — single-threaded native beating Panama'sparallelChunksmulti-threaded path on every measured shape. Numerical parity vs Panama within1e-4relative tolerance. (PR #572) Q4KMemSegMatmulKernelSPI sibling + zero-copy native variant — JVM-only sibling kernel interface inskainet-backend-api/jvmMaintaking weights asMemorySegmentinstead ofByteArray, plus a JVM-onlyMemSegKernelProviderprovider interface that providers can implement alongsideKernelProviderfor the smart-cast lookup pattern at the call site. Reuses the same C symbol as the heap-input kernel — the bytes just don't round-trip through the JVM heap. +20% wall-clock at 4096² vs the heap-copy path (9 MB weight transfer eliminated); noise-level at smaller shapes. Bit-identical output to the heap variant. (PR #573)- Cross-arch CI matrix — new
.github/workflows/native-cpu-multiarch.ymlbuilds and tests the native module onubuntu-latest,macos-14(Apple Silicon), andwindows-latestfor every push/PR that touches the native module. Catches portability regressions (linker, alignment, compiler-specific syntax) at PR time rather than after release. C portability tightened:SKAINET_RESTRICTmacro maps to__restrict__on GCC/Clang and__restricton MSVC; CMake grows an MSVC compile-flag branch (/O2 /fp:fast /W3) alongside the existing GCC/Clang one. Linux ARM64 was attempted but Kotlin/Native plugin 2.3.21 doesn't supportlinux aarch64as a HOST target ("Unknown host target") — left out for now. (PRs #574, #577) - Native FP32 SGEMM — row-major
C(m,n) = A(m,k) * B(k,n)with stride support, i-p-j outer-product order so the innerc[j] += a*b[j]loop streams two contiguous arrays and auto-vectorizes into FMA. Wired into the existingmatmulFp32()SPI accessor. Microbench at 256³ / 512³ / 1024³: 1.77× / 1.58× / 1.55× faster thanPanamaVectorMatmulKernel. The narrower margin vs Q4_K reflects Panama's already-polished FP32 path (tile-blocking + B-pack +parallelChunks); native still wins on every measured shape. Numerical parity within1e-5 * krelative tolerance. (PR #575) - Multi-arch fat JAR publishing —
.github/workflows/publish.ymlextended to a two-phase flow: a matrixbuild-nativejob buildslibskainet_kernelson each supported host (linux-x86_64, macos-arm64, windows-x86_64), and thepublishjob downloads all three artifacts, stages them into the native module's resources tree, and publishes with every supported arch bundled. Consumers on any of the three arches get a working native path out of the box — no manual side-loading.
skainet-backend-native-cpuregistered in BOM —skainet-bomnow constrains the new module alongsideskainet-backend-apiandskainet-backend-cpu. Consumers depending on the BOM get a constrained version without a separate pin. (PR #576)- Publishing config wired —
vanniktech.mavenPublishplugin + per-modulegradle.properties(POM_ARTIFACT_ID + POM_NAME) on the new module. Composite-build consumers (e.g. SKaiNET-transformers viaincludeBuild) substitute the published coordinates with the local project ref through the same path every other SKaiNET module uses. (PR #576)
NativeKernelProviderconsumption kdoc — covers two gotchas downstream consumers hit on first wiring: (1) the module is JVM-only (FFM has no Native/JS/Wasm equivalents) so KMP consumers must add the dep tojvmMain.dependencies, nevercommonMain; (2)com.gradleup.shadow:9.4.xmergeServiceFiles()silently drops theNativeKernelProviderFactoryentry when bothskainet-backend-cpuandskainet-backend-native-cpuare on a shadow JAR's classpath — workaround pointer to thekllama-clidoLastfix in SKaiNET-transformers PR #88. (PR #579)docs/.../perf/native-ffm-plan.adoc— design baseline for the native FFM provider (recovered from the 0.21.0-cycle PRD that was dropped from the repo root and rehomed as asciidoc). Documents module layout, FFM binding pattern, staged delivery, success metrics, and risks.
- Linux ARM64 native lib is not in the published JAR. Kotlin/Native plugin 2.3.21 doesn't support
linux aarch64as a HOST target on the runners GitHub provides, so the cross-arch CI matrix excludes it. Linux ARM64 consumers (Raspberry Pi, AWS Graviton) cleanly fall back to the priority-50 Panama Vector provider — no functional regression, just no native speedup. Re-add when either the Kotlin/Native plugin gains the host or a self-hosted ARM64 runner is wired in. - Shadow-jar consumers using
com.gradleup.shadow:9.4.xstill need adoLastworkaround to merge theMETA-INF/services/sk.ainet.backend.api.kernel.KernelProviderentries — see SKaiNET-transformers PR #88'skllama-cli/skainet-clifix for the canonical implementation. Spring Boot apps consuming via Maven (BOOT-INF/lib/) are unaffected.
This release lands the JVM Vector half of milestone M5 from the JVM inference performance roadmap — a pluggable kernel SPI parallel to BackendProvider, plus a Panama Vector provider that matches or beats the prior production path on every shape we measure. A native (FFM) priority-100 provider closing the milestone metric is deferred.
KernelProviderSPI —skainet-backend-apinow exposes aKernelProviderinterface withname,priority,isAvailable(), and per-kernel accessors (matmulFp32(),matmulQ4K()).KernelRegistrydoes priority-orderedbestAvailable()lookup; a JVM-onlyKernelServiceLoader.installAll()auto-discovers providers viaMETA-INF/services/sk.ainet.backend.api.kernel.KernelProvider. Manualregister(...)still works for tests and non-JVM platforms. (PRs #554, #559)Fp32MatmulKernel+PanamaVectorMatmulKernel— JDK Vector API implementation usingFloatVector.SPECIES_PREFERRED+fma+reduceLanes, cache-blocked with 8×8×128 tiles.KernelMatmulBenchmeasures 8.61× / 8.62× / 10.83× speedup over scalar at 256/512/1024 (JDK 21.0.10, M-series macOS). Within JMH noise of — and often slightly faster than — the priorJvmVectorKernels.matmulFloatBlockedproduction path, so routing introduced no regression. (PRs #557, #558, #560)- Production matmul routes through
KernelRegistry—DefaultCpuOpsJvm.matmulnow resolves the FP32 kernel viaKernelRegistry.bestAvailable()instead of callingJvmVectorKernels.matmulFloat*directly. ProductionMatmulBenchnumbers post-routing match pre-routing within JMH noise. (PR #561) Q4KMatmulKernelSPI + SIMD-fused Panama implementation — Sibling kernel interface inskainet-backend-api/commonMain,KernelProvider.matmulQ4K()accessor (default-nullfor backwards compat).PanamaVectorQ4KMatmulKernelfuses Q4_K dequant inline with the FMA accumulator: a singleByteVectorload feeds both lo and hi sub-block accumulators per qs slab via AND/LSHR nibble extract →castShape(B2F)→ FMA, with the lazy-dmincorrection (acc += scale·codeSum − offset·inputSumonce per sub-block).QuantizedMatmulBenchmeasures 0.07/0.15/0.46 ms at 1024×1024 / 4096×1024 / 4096×4096 (≈30/55/73 GFLOPS — same throughput regime as the FP32 SIMD kernel, meaning fused dequant adds essentially zero cost on top of the FMA).DefaultCpuOpsJvm.chooseQuantizedMatmul'sQ4_KTensorDatabranch routes through the SPI with a fall-through to the legacy kernel when no provider resolves. (PR #562)- Q4_K MemSeg SIMD — Same fused-pipeline algorithm applied inline to
JvmQuantizedVectorKernels.matmulF32Q4_KMemSeg(the path mmap'd weights take).ByteVector.fromMemorySegmentinstead ofByteVector.fromArray— no heap copy. (PR #563) - Q6_K SIMD dequant —
dequantQ6_KBlockreplaces its scalar 32-iteration loop with aByteVector-based ql + qh extraction pipeline: perfloatStep-wide chunk ofl, loads ql + qh slices, assemblesq1..q4 = (ql nibble) | ((qh slice) << 4) − 32per lane, multiplies by per-sub-blockd·scale, stores to four 32-element regions of the scratch FloatArray. (PR #564) - Q4_0 partial SIMD —
dotQ4_0BlockMemSegtwo-stage pattern: scalar byte-pair unpack into a caller-supplied scratch FloatArray (16 byte loads, two nibbles each — half the byte traffic) followed by aFloatVectorFMA reduction. Closes the last fully-scalar quantized kernel; every quantized format inJvmQuantizedVectorKernels(Q4_0, Q4_K, Q4_K MemSeg, Q6_K, Q8_0) is now SIMD'd to some degree. (PR #565)
ScratchPoolSPI — Runtime workspace allocation for transient tensor scratch buffers. Per-runtime size-classed slabs, scoped acquire/release. Closes the framework-side primitive for milestone M1 of the JVM perf roadmap. (PR #550)TensorOps.permute(axes)— Arbitrary-axis permutation (generalizes the existingtransposeto N-D). (PR #552)
- Q4_K / Q5_K canonical ggml layout + FP32 MemSeg arena leak —
Q4_KTensorDataand Q5_K dequant now apply the canonical ggml layout (super-block scale + per-sub-block scaleIdx/minIdx viaget_scale_min_k4mixing, strided 4-bit codes layout).MemorySegmentTensorDataFactoryusesArena.ofAuto()for per-op outputs so the matmul / transpose output segments are GC-reclaimable; the priorofConfined()builds leaked tens of MB per matmul, which over a 35-layer Gemma 4 forward pass exhausted the JVM direct-memory cap. Liveness-based freeing of intermediate tensors inComputeGraphExecutor. (PR #556)
- Q6_K Native Matmul: New
Q6_KTensorData/Q6_KBlockTensorDatainskainet-lang-corestores 210-byte ggml Q6_K blocks verbatim (128ql+ 64qh+ 16 scales + 2 f16d), row-major by default, with adequantizeBlockpath matching theDequantOpsreference line-for-line.DefaultCpuOpsJvm.chooseQuantizedMatmuldispatches to a newJvmQuantizedVectorKernels.matmulQ6_KVecSIMD kernel (Kotlin Vector API, samefloatSpeciesas the Q4_K / Q8_0 kernels) using a dequant-one-block-to-scratch-then-SIMD-dot pattern. NewTensorEncoding.Q6_Kvariant. Unblocks running Gemma 4 E2B Q4_K_M (and any mostly-Q4_K + Q6_K checkpoint) through the DSL path without a ~12 GB FP32 dequant blow-up at load. - Q4_K Lazy Shape-Swap Transpose:
DefaultCpuOpsJvm.transpose(Q4_KTensorData)now returns a newQ4_KBlockTensorDatawrapping the same packed byte array with swapped shape — mirroring the existing Q4/Q8 MemorySegment lazy-transpose path.matmulQ4_KVec's input-block-major layout produces correct values under the swapped shape without any physical data reordering, solinearProject(x, W)can runmatmul(x, transpose(Q4_K_W))without round-tripping through FP32. Validated at the DSL level byGemmaDslQ4KTestin the transformers repo (Δ logits = 4.29e-6 vs the FP32 baseline). - Q6_K Lazy Transpose: Same shape-swap specialization extended to
Q6_KTensorData, enabling the same DSL path for Q6_K weights. - Lazy-Transpose Invariant Tests: New
QuantizedMemSegMatmulTestcases pin the two load-bearing properties of the Q4_K and Q6_K transpose specializations — (1) shape is swapped; (2)packedDatais the SAME byte-array reference, not a copy — so the path cannot silently regress to the generic element-wise transpose (which wouldClassCastExceptionon packed nibbles).
- SDPA Recording + StableHLO Emission:
scaledDotProductAttentionis now recorded byRecordingExecution(was silently delegating without recording, likeconv1dbefore #532) and lowered to StableHLO byNeuralNetOperationsConverter. The decomposition isdot_general(Q, K.T)(batching dims[0,1], contracting dims[3]×[3]) → scale → optional mask → softmax (max-subtract-exp-sum-div) →dot_general(weights, V)(contracting dims[3]×[2]). NewScaledDotProductAttentionOperationinTensorOperationswith output-shape inference (output shape = query shape). NewSdpaHloExportTestverifies tape → graph → MLIR withdot_general;TapeAttentionPermuteBugTestpins a regression around raw array permute producing zero constants.ShapeOperationsConverter.concatenateinput-type annotation fix. (#543)
- SDPA Q/K/V Shape Validation:
scaledDotProductAttentionpreviously required only rank-4 inputs, so a mismatch inhead_dim(e.g. Q=512 vs K=256, as seen in real Gemma 4 E2B where mixed-head-dim layers share a KV cache) surfaced as anArrayIndexOutOfBoundsExceptionburied 2000+ lines deep in the dot-product loop. Addedrequire()preconditions on matching batch, head count, Q/K head_dim, Q/V head_dim, and K/VseqKV, each with a message naming the offending dimensions. NewSDPAShapeValidationTest(5 cases,commonTest) pins the contract.
- Kotlin: 2.3.20 → 2.3.21 (including JVM toolchain and
plugin.serialization). - Android Gradle Plugin: 9.1.1 → 9.2.0.
io.ktor:ktor-client-core: 3.4.2 → 3.4.3.
- Broken POM for
skainet-backend-cpu: The 0.19.0 POM forsk.ainet.core:skainet-backend-cpu-*declared a runtime dependency onsk.ainet:skainet-backend-api-jvm:unspecified— wrong group coordinate and no valid version, becauseskainet-backend-apiwas not configured to publish and the rootallprojects { group = "sk.ainet" }disagreed with theGROUP=sk.ainet.coreused by vanniktech's maven publish plugin. Consumers pulling 0.19.0 hit unresolved-dependency errors. Fixed by:- Applying
vanniktech.mavenPublishand settingPOM_ARTIFACT_ID=skainet-backend-apionskainet-backend-apiso it is actually published alongside the BOM entry that already referenced it. - Aligning
allprojects { group = "sk.ainet.core" }with theGROUPproperty and pinningversionfromVERSION_NAMEsoproject(...)coordinates in generated POMs are consistent.
- Applying
- CI guard: New
verify-published-pomsjob publishes to the local Maven repository and fails the build if any generated.pomcontains<version>unspecified</version>or references a project-local group outsidesk.ainet.core, preventing a regression of this class of coordinate bug.
- Qwen / GPT-2 Byte-Level BPE Tokenizer:
QwenByteLevelBpeTokenizerimplements the full GPT-2-style pipeline — byte-to-unicode mapping, GPT-2 pretokenization regex, merge-rank BPE, and atomic special-token splitting. Builds from either GGUF metadata (fromGgufFields) or a HuggingFacetokenizer.json(fromTokenizerJson). Verified against Qwen2.5-0.5B reference token IDs from HuggingFacetransformers. (#463) - LLaMA / SentencePiece Tokenizer:
SentencePieceTokenizerimplements the llama.cpp SPM pipeline — whitespace escape (▁), code-point symbol split, score-priority BPE (the SPM rule, opposite of the merge-rank rule used for GPT-2 BPE), and<0xNN>byte fallback for unknown characters. Builds from GGUF (tokenizer.ggml.model == "llama") and HuggingFacetokenizer.json(model.type == "Unigram"). Verified against TinyLlama-1.1B reference token IDs from HuggingFacetransformers. (#464) TokenizerFactorywith Per-Architecture Dispatch: Tokenizer selection is now per-architecture, not per file format.TokenizerFactory.fromGguf(fields)and.fromTokenizerJson(json)inspecttokenizer.ggml.model/model.typeand dispatch to the right implementation — Qwen/GPT-2 → byte-level BPE, LLaMA/Gemma/TinyLlama → SentencePiece — regardless of whether weights come from GGUF or SafeTensors. (#463)TokenizerInterface: Common surface implemented byTekkenTokenizer,QwenByteLevelBpeTokenizer, andSentencePieceTokenizer(encode,decode,vocabSize,bosTokenId,eosTokenId).- GGUF Tokenizer Metadata:
GgufModelMetadatanow exposestokenizerModel,tokenizerTokens,tokenizerMerges,tokenizerTokenTypes,bosTokenId, andeosTokenIdso callers can build a tokenizer without re-parsing the raw field map.
- Whisper Encoder E2E: Whisper encoder now compiles end-to-end via SKaiNET → StableHLO → IREE.
- Real StableHLO Lowerings:
softmax,layerNorm, andrmsnormnow lower to real StableHLO ops (reductions,broadcast_in_dim, standard ops) instead ofcustom_callstubs. (#467, #479, #480) - New Op Converters:
gather/embedding, andconcat/slice/castStableHLO converters. (#483, #489) - Activation Alias:
silu/SiLUregistered as an alias forswishinActivationOperationsConverter. (#484) ConstantMaterializationPolicy: Seam for externalizing large weight tensors out of the StableHLO module (enables.irpaexternalization). (#524)- Splat Constant Folding: Uniform-value tensor constants collapsed to
dense<v>splat instead of fully materialized arrays. (#522) - SSA Value Type Tracking: Tracks SSA value types so
reshapeemits the operand's declared type, producing valid MLIR. (#521) - Tensor Encoding in Output:
tensor_encodingcomments in StableHLO output and a top-levelskainet.tensor_encodingsmodule attribute. (#473, #477)
skainet-io-iree-paramsModule: New module withIrpaWriterfor writing IREE Parameter Archive (.irpa) files. AcceptsFileBackedhandles via mmap on JVM / Android for zero-copy weight export. (#523, #525, #528, #529)
skainet-backend-apiModule: New module cleanly separating backend contracts; CPU backend now depends on it. (#468)TensorEncodingMetadata: Accessor forTensorSpec.metadataand propagation throughTraceToGraphBuilder.finalize, keeping quantization encoding visible end-to-end. (#469)
- Annotated
StableHloConverterFactoryandTokenizerFactoryfor idiomatic Java call sites. (#400) - Renamed
TensorSpecEncoding.ktclass for Java callers. (#400) - Added
skainet-backend-apito the BOM. (#400) - New
ReleaseApiJavaTestcovering the 0.19.0 Java surface. (#400)
- Antora + Diátaxis: Migrated docs to Antora with Divio / Diátaxis layout (tutorials, how-tos, reference, explanation). (#494)
skainet-docs-uiv1.1.1: Adopted the new theme with Diátaxis card-grid landing page. (#501)- Operator Coverage Matrix: Emit cross-backend Operator Coverage Matrix generated from
TensorOpssurface scan. (#494, #511) - Ops Docs: KDoc
@paramextraction, real version stamps, LaTeX rendering, fixed partials, and dropped void backend. (#511, #513) - Dokka API Bundle: Wired into the Antora site build. (#494)
- Local Mermaid: Drop kroki, render Mermaid locally via
mmdc. (#496)
androidNativeArm32: Added across core modules. (#503)
- Byte-Level BPE Broken for Qwen/GPT-2 Models: Previously there was no GPT-2-style byte-level BPE tokenizer in the repo, and
GgufModelMetadataignoredtokenizer.ggml.mergesentirely — so any Qwen / GPT-2 / Mistral-Nemo model encoded text into garbage tokens (byte-level chars instead of merged vocab IDs), blocking chat mode and tool calling. The newQwenByteLevelBpeTokenizer+TokenizerFactorydispatch fix the issue for both GGUF and SafeTensors sources. (#463) - No SentencePiece Path for LLaMA-Family GGUF Models:
TokenizerFactorypreviously threwUnsupportedTokenizerExceptionfortokenizer.ggml.model == "llama", leaving LLaMA / TinyLlama / Gemma / Mistral-v0.1 GGUFs untokenizable. The newSentencePieceTokenizercloses that gap. (#464) - GGUF UInt Fields Silently Dropped: GGUF UINT32 fields (e.g.
tokenizer.ggml.bos_token_id) arrive fromStreamingGGUFReaderaskotlin.UInt, which is a value class — not a subclass ofkotlin.Number— so a plainas? Numbercast was returning null. The newtoIntFlexiblehelper handles every signed and unsigned numeric type GGUF can produce, restoring the BOS/EOS/UNK ids on the tokenizer builders. - Graph Conv Output Shape Inference:
conv1d/conv2d/conv3doperations in graph inference previously produced placeholder output shapes, breaking downstream shape-dependent passes. Graph ops now compute real output shapes. (#536, #537) - Conv1d/Conv3d Not Recorded:
conv1dandconv3dwere not routed through the recording decorator, so they disappeared from traced computation graphs. (#532, #533) - Static Conv1d HLO Shape Crash: Conv1d StableHLO lowering crashed when trace attributes were missing; now falls back to
TensorRefshape / dtype. (#530, #531) - Flatten Hardcoded to MNIST Shape:
NetworkBuilder.flatten()returned a hardcodedlastDimension = 1568(the MNIST CNN value); any other architecture — e.g. a 64-channel CNN over 32×32 inputs — crashed withArrayIndexOutOfBoundsExceptionin the followingdense()layer. The DSL now tracks per-sample shape through a newinput(IntArray)overload,conv1d/conv2d/conv3d,maxPool2d,avgPool2d, andupsample2d, reusing theConvShapeUtilsarithmetic introduced in #537;flatten()reads the tracked shape and honorsstartDim/endDim, andConv*layers can auto-inferinChannelsfrom the declared input. (#535, #538) - StableHLO
transpose/dot_generalMLIR Emission: Fixed malformed MLIR produced bystablehlo.transposeandstablehlo.dot_generalthat blocked IREE compilation. (#520) - WasmJS / JS / Native Compile: Replaced JVM-only
putIfAbsentwith a common-stdlib idiom. (#485) - Antora Container:
HOME=/tmpso Chromium crashpad can launch during Mermaid rendering in CI. (#534) bundleDokkaIntoSiteCI Permission Failure: Fixed docs pipeline permission error. (#496)- Pandoc Artifacts in Docs: Stripped pandoc anchors and demoted heading levels in migrated pages. (#496)
compile-hloDependencies: Dropped vestigialskainet-backend-cpudependency fromcompile-hlojvmMain. (#472)- Moved-LLM Docs: Replaced relocated LLM pages with redirect stubs pointing at the standalone repo. (#499)
- Maven Group / Version Refs: Bumped stale version references and fixed Maven group coordinates. (#499)
- Stale
TURBOQUANT_ISSUES.mdtracker at the repo root. (#490)
- agp: 9.1.0 → 9.1.1.
- com.networknt:json-schema-validator: 3.0.1 → 3.0.2.
- org.jetbrains.kotlinx:kotlinx-serialization-json: bumped to 1.11.0.
- actions/checkout: 4 → 6.
- actions/upload-pages-artifact: 3 → 5.
- actions/cache: 4 → 5.
- actions/setup-java: 4 → 5.
- actions/deploy-pages: 4 → 5.
- actions/github-script: 8 → 9.
- docker/build-push-action: 5 → 7.
- docker/setup-buildx-action: 3 → 4.
- TurboQuant KV-Cache Compression: Runtime KV-cache compression for LLM inference using rotation-based quantization (Google Research TurboQuant paper). Supports PolarOnly and PolarPlusQjl variants with 2/3/4/8-bit encoding.
TurboQuantCodec: End-to-end encode/decode pipeline (random rotation, scalar quantization, QJL residual, bit-packing).TurboQuantKvCacheStore: Compressed KV cache with per-head TurboQuant blocks and asymmetric K/V policies.TurboQuantPresets: Named presets —safe-lowbit(Q8_0-K + TQ4-V),balanced(TQ4/TQ4),experimental-max(TQ3/TQ3).KvCacheStore.turboQuant("balanced", ...): One-line factory for skainet-transformers integration.CompressedKvAttention: SDPA bridge with FULL_TILE and RAW_STORAGE dequant strategies.@KvCacheand@KvCacheBypassDSL annotations for declarative KV cache configuration.KvCacheAnnotationResolver: Resolve annotations to cache instances.TurboQuantUsage: Documented integration guide with compilable examples.
- Memory Architecture Hardening: First-class storage and placement abstractions for zero-copy, quantization-preserving tensor management.
TensorStorage: Runtime descriptor replacing ad-hoc array passing (logical type, physical encoding, buffer ownership, placement).TensorEncoding: Sealed hierarchy —Dense,Q4_K,Q8_0,TernaryPacked,TurboQuantPolar,TurboQuantPolarQjl,Opaque.BufferHandle: Five ownership modes —Owned,Borrowed,Aliased,FileBacked,DeviceResident.Placement: Device/memory-domain intent with fallback policies (CPU_HEAP,MMAP_WEIGHTS,GPU_PREFERRED).LogicalDType: Semantic numeric types separate from physical encoding.PackedBlockStorage: Unified contract for all packed quantized formats.MemoryPlanner,MemoryTracker,ActiveMemoryTracker: Placement resolution and copy diagnostics.
- KV-Cache Subsystem:
KvCacheStoreinterface with append-by-token writes, layer/head addressing, eviction, andDefaultKvCacheStore(dense FP32 baseline). - Quantization-Preserving Loaders:
StreamingGGUFReaderandStreamingSafeTensorsReaderproduceTensorStoragewithFileBackedorBorrowedhandles (no forced densification).StorageAwareSafeTensorsLoader: Zero-copy file-backed SafeTensors loading.- Completed
Quants.ktport:byteShapeToQuantShape,quantByteSize,isBlockQuantized,validateQuantizedBytes.
- Tekken Tokenizer: Mistral Tekken (tiktoken-based BPE) tokenizer support.
- CPU SIMD TurboQuant Kernels:
JvmTurboQuantKernelswith Java Vector API acceleration for abs-max, quantize, dequantize, and Walsh-Hadamard butterfly. - JMH Benchmarks: TurboQuant encode/decode throughput, bit-packing, rotation, and KV cache append/read benchmarks (
TurboQuantBenchmarks.kt). - Storage Benchmarks: Dequantization throughput (Q4_K, Q8_0, Ternary), buffer accessor, and TensorData bridge benchmarks (
StorageBenchmarks.kt). - New Ops:
sin,cos,tanh,convTranspose1d. - New Layers:
TransposedConv1d,Snakeactivation,LayerScale.
- Streaming GGUF as Default:
StreamingGGUFReaderis now the recommended GGUF loading path (memory-efficient, supports quantized types). - DSL Annotations: Extended
PlacementAnnotations.ktwith@KvCache(preset=...)and@KvCacheBypassfor TurboQuant configuration.
- Int Overflow for Large Tensors: Fixed
StreamingTensorInfo.nBytesandStreamingSafeTensorInfo.sizeInBytesfromInttoLong, preventing silent overflow for tensors > 2 GB. Fixes loading of Gemma 4 E4B and future large models. (#452) - Legacy GGUFReader Overflow Guard: Added explicit overflow check with actionable error message for tensors > 2 GB in the legacy eager loader.
- io.github.kotest:kotest: 6.1.9 → 6.1.11.
- com.squareup:kotlinpoet: 2.2.0 → 2.3.0.
- Core Engine Focus: Refactored the repository to focus on the core
ComputeGraphframework, compiler, and backends. - Standalone Ecosystem: Extracted high-level LLM and transformer implementations to dedicated repositories (SKaiNET-LLM and SKaiNET-transformers).
- LLM-as-DSL: High-level DSL for defining and running LLM architectures within the core
ComputeGraphframework. - ComputeGraphExecutor: New optimized executor with support for fusion passes and trace-to-DAG bridging.
- SDPA & Gather: Implementation of Scaled Dot-Product Attention (SDPA) and
gather/indexSelectops across backends. - EmbeddingAdapter: Streamlined embedding layer integration for transformer models.
- Optimized LLM execution: Integrated fusion passes for faster inference on supported backends.
- Improved Tensor API: Refined
Tensorinterface and updatedComputeGraphExecutorfor better type safety and performance. - Dependency Cleanups: Removed stale references to LLM and transformer code already moved to the standalone
skainet-transformersrepository.
- Embedding Padding: Fixed
paddingIdxhandling in embedding layers. - Concatenation: Resolved rank-specific issues in tensor concatenation (rank > 1).
- Compilation: Fixed various build and compilation errors after module migrations.
- Deduplicated LLM infrastructure: unified
KvCache,softmax,RoPE, andsamplinglogic across modules for improved maintainability. - Updated skainet-bom: Refactored the Bill of Materials (BOM) to use local
project()references for better build consistency.
- LLM Module Extraction: Extracted and moved core LLM modules to the standalone SKaiNET-LLM repository to reduce core codebase footprint.
- Transformer Code Cleanup: Removed redundant code that has been moved to the SKaiNET-transformers repository.
- Dependency Graph: Resolved inverted dependency issues in the LLM infrastructure.
- System Prompt Support (Java): Added
systemPromptsupport toKLlamaJavaandKLlamaSessionfor prepending system instructions to conversations. - Model Module Extraction: Extracted model-specific code into dedicated
skainet-modelsmodules for better separation of concerns and maintainability. - Enhanced Smoke Tests: Refactored
smoke-test.shto support multiple runners via JSON configuration and improved LLM loading verification.
- Whisper HLO Generation: Fixed StableHLO MLIR generation for Whisper models.
- Compilation: Fixed various Kotlin/JVM compilation errors.
- First-Class Java 21+ Support: Complete Java API surface with
SKaiNETentry point,TensorJavaOps, builder-pattern model definition (SequentialModelBuilder),KLlamaJava/KBertJavafacades,JavaAgentLoopfor tool-calling agents, andTrainingLoopbuilder. - Maven BOM: New
sk.ainet:skainet-bomartifact for one-line version management across all modules. - Java Documentation: Added Getting Started, LLM Inference, and Model Training guides.
- Java 25 Performance Documentation: Added documentation for JVM CPU backend performance advantages.
- WasmWasi Target: Added
wasmWasitarget support across all KMP modules. - StableHLO MLIR Streaming API: New
HloGeneratorpublic API with generic Model + Tensor interface and streaming MLIR output. - ReductionOperationsConverter: Added support for reduction operations in StableHLO export.
- JVM Performance (Jlama Techniques): MemorySegment-based tensors, SIMD GEMM kernels, paged KV cache, batch attention for prompt prefill, fused QKV projections, and cached quantized weights.
- Native RandomAccessSource: POSIX
pread()-based source for memory-efficient GGUF parsing. - MemorySegment Weight Conversion: New
NATIVE_OPTIMIZEDquant policy andMemSegWeightConverterpipeline with Arena lifecycle management. - Lazy Transpose: Added lazy transpose for Q4/Q8 MemorySegment tensors and MemSeg FP32 transpose.
- Java CLI App: New Java-based KLlama CLI application.
- Android KMP Plugin Migration: Migrated Android subprojects to
androidMultiplatformLibraryplugin for AGP 9 compatibility. - Refactored Model Loading: Extracted shared dequantization, registry, tensor naming, and decoder runtime into reusable components.
- JDK Requirement Relaxed: Allow JDK >= 21 instead of requiring exactly JDK 21.
- Gradle Upgrade: Updated to Gradle 9.3.1.
- Kotlin Upgrade: Bumped Kotlin from 2.2.21 to 2.3.10.
- Kotlin Compile Testing: Replaced abandoned
kotlin-compile-testingwithkctforkfor Kotlin 2.3.0 compatibility.
- StableHLO MLIR Export: Fixed MLIR export to produce valid IREE-compilable output.
- OOM in Dequantization Benchmark: Fixed out-of-memory in
DEQUANTIZE_TO_FP32E2E benchmark test. - Quantized MatMul: Fixed block offset calculation in quantized matrix multiplication.
- CI Stability: Fixed AAPT2 daemon crashes and improved Android build stability.
- Documentation CI: Fixed workflow permissions for PR comments.
- Deprecated API Usage: Fixed
createTempDir()deprecation in data-simple integration tests.
- com.gradleup.shadow: 9.3.1 → 9.3.2.
- com.fasterxml.jackson.core:jackson-databind: 2.21.0 → 2.21.1.
- ch.qos.logback:logback-classic: 1.5.27 → 1.5.32.
- io.github.kotest:kotest: 6.1.3 → 6.1.4.
- org.jetbrains.kotlinx:kotlinx-io-core: 0.8.2 → 0.9.0.
- com.vanniktech.maven.publish: → 0.36.0.
- org.jetbrains.kotlinx.kover: → 0.9.7.
- actions/setup-node: 4 → 6.
- actions/upload-artifact: 6 → 7.
- actions/download-artifact: 7 → 8.
- junit-platform-launcher added for CI test execution.
Thank you to the following contributors for their work on this release:
- Dhia Chemingui (@dhiaspaner) — Android KMP plugin migration (#385, #386)
- Tool Calling: Added support for tool calling in KLlama, including a new
skainet-kllama-agentmodule. - Gemma 3n Support: New
skainet-kgemmamodule for Google's Gemma 3n E2B multimodal models. - Extended SafeTensors Support: Added SafeTensors weight loading support for both KLlama CLI and Gemma models.
- HuggingFace Tokenizer: Initial support for HuggingFace-style tokenizers in Gemma models.
- Named Arguments: Refactored various internal APIs to use named arguments for better optional parameter support.
- System Prompt Handling: Improved system prompt formatting and handling in agentic workflows.
- BERT Support: Full support for BERT-based models with
SafeTensorsweight loading. - kbert-cli: New CLI tool for running BERT inference, supporting text encoding and cosine similarity computation.
- WordPiece Tokenizer: Implementation of WordPiece tokenizer for BERT models.
- TinyFoA Support: Implemented missing operators (
abs,sign,clamp,lt,ge,narrow,pad2d,unfold) to support TinyFoA (AAAI 2025) training pipeline for memory-efficient on-device learning. - Multi-platform KLlama: Added macOS target support for the KLlama runtime.
- Custom Backends Documentation: Added detailed guide and examples for injecting custom backends into KLlama.
- Improved robustness of TinyFoA operations with comprehensive unit tests.
- Benchmarking DSL: New
BenchmarkDslandBenchmarkRunnerfor measuring model performance and latency. - Execution Observers: Added
ExecutionObserverAPI withLatencyExecutionObserverandMemorySnapshotObserverfor profiling. - New Layers: Added
RMSNormalizationlayer support. - KLlama Enhancements: Improved weight loading and initial support for GPU-accelerated attention (experimental).
- Refactored
ExecutionContextto support execution observers and better phase management. - Updated KLlama runtime with improved ingestion and benchmarking utilities.
- Generative AI Section: New README section with simple code for GGUF text generation.
- Tokenizer Strategies: Automatic detection of tokenizer type (SentencePiece, BPE, WordPiece) from GGUF metadata.
- Improved Token Decoding: Support for multi-byte UTF-8 character decoding from byte tokens.
- Llama Runtime: Rewritten
matmulNoBiasfor better performance and support for row-major weights. - GGUF Loading: Improved dequantization for Q2_K, Q4_K, Q5_K, and Q6_K formats matching llama.cpp logic.
- GGUF Storage Order: Fixed critical bug with column-major storage in GGUF files by implementing proper transposition during loading.
- Llama Attention: Fixed missing attention output projection (wo) in the runtime.
- Tokenizer: Fixed BOS token handling and multi-byte character reconstruction.
- SafeTensors Support: Initial implementation of
skainet-io-safetensorsfor reading SafeTensors format. - Generalized I/O & Weight Mapping:
- New
WeightMapperandWeightLoaderAPIs for unified model parameter loading across formats. LoadingProgressAPI for tracking model loading state.GgufModelMetadataandOnnxModelMetadatafor better inspection of model files.
- New
- JVM Performance: Enhanced
DefaultCpuOpsJvmwithJvmVectorKernelsfor SIMD-accelerated tensor operations using the Java Vector API. - Llama Enhancements:
- Added
GGUFTokenizerfor better text processing. - Improved
LlamaIngestionand ingestion pipelines.
- Added
- Improved GGUF/ONNX Loading: Robust weight loading and metadata parsing for GGUF and ONNX models.
- Streamlined CLI: Removed unfinished CLI samples and reorganized
skainet-tensor-tools. - Documentation Cleanup: Removed outdated technical docs and consolidated architecture information.
- Improved robustness of GGUF and ONNX streaming readers.
- Fixed various issues in WASM/JS weight parsing.
- Updated version to 0.8.3.
- KLlama (Llama 2 port): Initial version ported from
llama2-kmp, supporting GGUF models. - GGUF Enhancements:
- Support for
mmapfor zero-copy GGUF tensor loading. - Embedded tokenizer support in GGUF.
- New quantization formats:
Q8_0,Q4_K, and BitNet/Ternary support (TQ1_0,TQ2_0). - Improved loading and bug fixes for quantization and mapping.
- Added
int64support for GGUF. - Improved GGUF metadata loading.
- Support for
- Streaming Support: Added streaming support for GGUF and ONNX models.
- Advanced Operations:
- New activations:
LeakyReLU,ELU. - New pooling:
AvgPool2d. - New convolutions:
Conv1d,Conv3d.
- New activations:
- Optimizers & Training:
- Added
AdamandAdamWoptimizers. - Comprehensive loss function library.
- New
Metricinterface withAccuracyimplementation. - KSP-based DSL generator for Network activations.
- Added
- Data & Datasets:
- Support for
CIFAR-10andFashion-MNISTdatasets. - New
Data Transform APIandImage Transform DSL.
- Support for
- Testing & Documentation:
skainet-test-groundtruthmodule for validation against PyTorch.- Integration tests for quantized inference and
KvCache. - Shadow JAR support for JVM fat JAR builds.
- New documentation for testing architecture with Mermaid diagrams.
- WASM/JS: Initial version of a simple WASM/JS sample.
- Simplified model support to GGUF-only (removed legacy Karpathy
.binformat support). - Improved KLlama loading and robustness.
- Updated roadmap with Phase 1 completion and multi-backend storage abstraction plans.
- Improved I/O system and overall robustness.
- Fixed various bugs in quantization and memory mapping.
- Resolved compilation errors and failing tests in CIFAR-10 support.
- Fixed KSP and TracingWrapperProcessor tests to match updated log messages.
- Fixed GGUF metadata loading issues.
- Initial release of 0.8.x series.
- Sine Approximation CLI (
skainet-sine-approx-cli) as a new example application for training models. TapeRecordingStrategyto handle different recording behaviors for prediction and backpropagation.- Comprehensive E2E tests for training sine wave approximations.
- New documentation:
autograd-basic.mdexplaining the autograd engine.
- Refined
Linear,Flatten,Inputmodules andreluactivation to better support gradient tracking and context propagation. - Improved
DefaultExecutionTapeandDefaultGraphExecutionContextfor more robust computation tracing. - Optimized internal
OpSinkandTraceSessionhandling.
- Infinite loop error during backpropagation tracing by implementing specialized tape recording strategies.
- Context mismatch errors in backpropagation tracing.
- Broken testing in the sinus sample application.
- Initial Autograd engine (
DefaultGradientTape) for automatic differentiation and reverse-mode gradients. - Optimizer API with
SgdOptimizerimplementation for training neural networks. - Loss functions module including
MSELossandCrossEntropyLosswith configurable reductions (MEAN, SUM, NONE). - Training DSL and helper utilities for building training loops (
trainStep,evaluateLoss). - Improved Graph DSL with better context propagation and support for recording computation traces.
- Updated dependency versions and refined internal execution context APIs to support gradient tracking.
- Refactored
skainet-compile-dagto support autograd and graph inversion.
- StableHLO implementation and E2E CLI app for compiling models to CUDA via IREE.
ArduinoCodegenfor exporting models to standalone C99 code with static memory allocation, optimized for Arduino.- KSP-based generation of
TracingOpsfor automated recording pipeline updates. - Initial implementation of
skainet-compile-hlofor high-level optimization.
- Improved CUDA backend strategy and IREE integration.
- Optimized long-running property tests for C code generation.
- Refactored
TracingTensorOpsto use execution context for code generation.
- Common I/O abstraction with
ModelReaderandTensorInfoinskainet-io-corefor unified model loading. - Efficient memory handling with non-copying
sliceviews inMemoryChunk. - Unified
skainet-tensor-toolsCLI combining ONNX and GGUF utilities. OnnxStatsClitool for analyzing ONNX model parameters and structure.
- Migrated project to
SKaiNET-developersorganization; updated repository URLs and deployment configurations. - Standardized artifact naming in documentation (e.g.,
SKaiNET-lang-core). - Improved
GGUFReaderwith better alignment parsing and tensor data handling. - Optimized test infrastructure: increased heap size to 8GB for large model tests and added
ReadmeSnippetsTestfor documentation verification.
- Legacy standalone applications and tools:
skainet-KGPChat,skainet-mnist, and separate ONNX/GGUF tool modules.
- ONNX import module (
skainet-io-onnx) with pbandk-generated proto surface, loader utilities, and importer that maps ONNX graphs into SKaiNET compute graphs, plus doc and tests. - CLI tooling:
skainet-onnx-toolsto export ONNX initializers to JSON andskainet-onnx-detectCLI to run YOLO detections from ONNX weights. - YOLOv8 model upgrades: depth/width scaling, decoupled heads with DFL projection, class-name parsing, and detection helpers to align with ONNX exports.
- Image IO module now published with explicit API surface for bitmap <-> tensor conversions across platforms.
- BatchNorm now reshapes stats for broadcasting and exercises JVM/native tests; CPU backend implements
sqrtto support it.
- Added pbandk runtime 0.16.0 for ONNX protobuf decoding.
- Recording/tracing pipeline for tensor ops (RecordingExecution/TracingTensorOps) and compute-graph DAG under
sk.ainet.lang.graph, including tape-to-graph conversion and GraphViz export helpers/tests. - JSON export proof of concept via new
skainet-compile-jsonmodule with serialization models,exportJsonCLI, and tiny graph golden fixtures. - Multiplatform image IO module to convert platform bitmaps <-> tensors and RGB byte arrays; includes macOS implementation fixes.
- Dedicated YOLOv8 model module (
skainet-models:skainet-model-yolo) with graph assembly, config/pre/post-processing, and missing upsample/concat ops required by the model. - NN DSL additions: multi-input
Functionalwrapper, newUpsample2d/Softmax helpers, scalar DSL builder plus tensor/number operator overloads, and extra tensor view/pprint utilities.
- Graph DSL relocated into the lang namespace with refreshed default execution tape/graph context wiring; removed unused integration module scaffolding.
- Removed committed MNIST training assets; rely on download at runtime.
- Added scalar arithmetic support across backends and void ops to match new operator overloads.
- Corrected unsqueeze view handling and data DSL dtype reuse; stabilized tracing/JSON/tape tests.
- Fixed macOS image conversion path and cleaned duplicate files in the new IO/image pipeline.
- io.ktor client 3.3.3 (from 3.3.2).
- logback-classic 1.5.21 (from 1.5.20).
- Kolmogorov–Arnold Network (KAN/AKN) module and DSL support, including public factory and aliases for direct construction. Introduces
Akn/AknConfigandcreateAknmirroring DSL defaults. - Example KAN models and graphs (e.g., Sine function examples and pretrained variant) with tests and Graphviz export.
- Additional NN DSL conveniences around initialization scopes (weights/basis/bias) and activation hooks used by KAN.
- Minor API refinements in lang/nn DSL to better align with execution context usage for new KAN modules.
- Stabilized integration tests for KAN modules and examples.
- Minor initialization performance tweaks for new modules.
- Updated docs and samples to include KAN usage and references.
- Initial support for model code sharing API (model definition, execution, loading). Implements #196, related to #169.
- Batch Normalization layer. Implements #193.
- Forward hooks and simple tape recording for NN. Implements #190, related to #104.
- Common traversal base for modules, with tests; Embedding implementation with dual value types; switched EEmbeddings to DualModule implementation.
- Dropout (initial implementation) and phase support (training/eval) in execution context so modules can behave differently by phase. Related to #5.
trilop (initial version).- MaxPool op with DSL support; Conv2D DSL support.
- Data API: initial version including MNIST data loader; JSON loading support (renamed loader classes from CSV to JSON) with tests. Implements #180, #181; related to #176, #179.
- GGUF model loading implementation (initial import and working version). Implements #178, #182; related to #176, #177.
- MatMul support in backends.
- Nested data blocks support in DSL (data block returns a tensor); contexts for creating and collecting tensors (returning last or all created tensors).
- JVM Ops using the Java Vector API (initial implementation) and SIMD Vector API acceleration.
- JMH benchmarks (JVM module) and additional benchmarks.
- Sample showing general tensor calculations (e.g., image color transformations).
- NN DSL refactored to use
ExecutionContext; addedExecutionContextparameter toforwardfunctions. - Models and data APIs improved; unified tensor value creation in DSL; moved tensor creation context for safer vector/matrix/tensor creation.
- Default CPU compute used for JS target.
- JS and WASM Kotlin targets aligned for library packaging.
- Gradle updated to 9.0.0; Android target namespaces fixed.
- Crash in schema validation task; added Kotlin compiler plugin configuration for expect/actual.
- Activation not applied in Dense layer (fixed).
- JVM target issues; fixed failing JVM tests; added regression tests; stabilized platform matching test (temporarily ignored) and additional general test fixes.
- Miscellaneous build-signing validation added to avoid CI failures.
- SIMD/Java Vector API acceleration for JVM backend operations.
- com.vanniktech.maven.publish: 0.34.0 → 0.35.0.
- io.ktor (android, cio, content-negotiation, core, darwin, js, logging): 3.3.1 → 3.3.2.
- com.fasterxml.jackson.core:jackson-databind: 2.15.2 → 2.20.0 → 2.20.1.
- GitHub Actions: use Java 22.
- Bump actions/checkout from v4 to v5.
- Add Gradle local caches to .gitignore.
- Preparations for 0.2.0 release and ability to build local Maven version of the upcoming release.
- Added hint/reference on normalization layer paper. Related to #192.
- Initial public release of SKaiNET 0.1.0.