Skip to content

Latest commit

 

History

History
116 lines (85 loc) · 6.35 KB

File metadata and controls

116 lines (85 loc) · 6.35 KB

SKaiNET-transformers

License: MIT Maven Central DeepWiki

Group: sk.ainet.transformers

High-performance LLM application layer on top of the SKaiNET engine. Provides model-specific inference, agentic chat with tool calling, and a unified CLI for transformer-based models, all in Kotlin Multiplatform.

Key features

  • Multi-model support. Llama 3 / 3.1 / 3.2, Gemma 2 / 3 / 4, Qwen 2 / 3, Apertus (Swiss AI), Mistral, BERT.
  • Native CPU performance. Auto-discovers SKaiNET's priority-100 FFM (Foreign Function & Memory) native kernel provider when present (4–6× faster Q4_K matmul, 1.5–1.8× faster FP32 SGEMM vs the priority-50 Panama Vector path; Linux x86_64 / macOS ARM64 / Windows x86_64 in the published JAR — no manual setup).
  • Native tool calling. Family-specific chat templates and tool-call parsers for Llama 3, Gemma 4, Qwen, Apertus, and ChatML/Hermes. Includes a Java surface (KLlamaJava, JavaTools.definition, JavaAgentLoop) for plain-Java consumers.
  • GGUF + SafeTensors loading. Streaming reader for any model size; NATIVE_OPTIMIZED quant policy keeps weights in their packed SIMD-friendly form.
  • Kotlin Multiplatform. JVM, Android, Kotlin/Native (Linux x64/ARM64, macOS ARM64, iOS arm64/sim arm64), JS, Wasm targets where applicable.

Current release

The current release is 0.23.1, version-aligned with the matching SKaiNET engine release. Coordinates:

dependencies {
    implementation("sk.ainet.transformers:llm-core:0.23.1")
    implementation("sk.ainet.transformers:llm-runtime-kllama:0.23.1") // or kgemma, kqwen, kapertus
    implementation("sk.ainet.transformers:llm-agent:0.23.1")          // chat templates + tool calling
}

To opt in to the native FFM CPU provider (recommended for JVM consumers):

dependencies {
    implementation("sk.ainet.core:skainet-backend-cpu:0.23.1")        // priority-50 Panama Vector
    implementation("sk.ainet.core:skainet-backend-native-cpu:0.23.1") // priority-100 FFM (auto-discovered)
}

KernelRegistry picks the highest-priority available provider; on hosts where the native lib doesn't load (sandboxed JDKs, unsupported arches), it cleanly falls back to Panama with no functional regression.

Project structure

Module Purpose
llm-api Framework-neutral interfaces (ChatModel, EmbeddingModel, ToolDefinition) — Spring AI-shaped.
llm-core OptimizedLLMRuntime, ModelRegistry, UnifiedModelLoader, shared abstractions.
llm-inference/<arch> Per-architecture network DSLs and weight loaders (llama, gemma, qwen, apertus, bert).
llm-runtime/<arch> Per-architecture runtime facades (kllama, kgemma, kqwen, kapertus).
llm-agent Chat templates, tool-call parsers, agent loops; Java surface.
llm-apps CLIs: skainet-cli (unified), kllama-cli, kbert-cli, plus kllama-java-sample.
llm-test/llm-test-java JUnit 5 end-to-end tests for the Java surface (gated on TINYLLAMA_MODEL_PATH).

Getting started

Prerequisites

  • JDK 21 or higher
  • Gradle 8.10+

CLI: unified skainet-cli

# Plain generation
./gradlew :llm-apps:skainet-cli:shadowJar
java -jar llm-apps/skainet-cli/build/libs/skainet-all.jar \
  -m /path/to/model.gguf "The capital of France is"

# Tool-calling demo (calculator + file-listing tools auto-registered)
java -jar skainet-all.jar -m model.gguf --demo --template=llama3 "What is 17 * 23?"

# Interactive agent
java -jar skainet-all.jar -m model.gguf --agent --template=apertus

--template accepts llama3, chatml, qwen, gemma, apertus (auto-detected from GGUF metadata if omitted).

Java consumers

try (KLlamaSession session = KLlamaJava.loadGGUF(modelPath, /* systemPrompt */ null)) {
    JavaTool calc = new JavaTool() {
        @Override public ToolDefinition getDefinition() {
            return JavaTools.definition(
                "calculator", "Evaluate an arithmetic expression.",
                "{\"type\":\"object\",\"properties\":{\"expression\":{\"type\":\"string\"}},\"required\":[\"expression\"]}"
            );
        }
        @Override public String execute(Map<String, ?> args) { /* ... */ }
    };
    JavaAgentLoop agent = JavaAgentLoop.builder()
        .session(session).tool(calc).template("llama3").build();
    String response = agent.chat("What is 17 * 23?");
}

See llm-test/llm-test-java/src/test/java/.../KLlamaJavaToolCallingTest.java for a runnable reference.

What's new in 0.23.1

  • Apertus end-to-end. Routing fix (now goes through OptimizedLLMRuntime + apertusNetwork()), chat template + tool calling, and real-GGUF loading on top of skainet 0.23.1's block-major Q4_K TensorData wiring. See APERTUS_ROLLOUT.md.
  • Gemma 4 chat-model JVM facade (Gemma4ChatModel) for embedded text-only deployments, with close() propagating to the mmap arena and the PLE mmap path consuming upstream loadTensorStorageMapped.
  • Multi-id EOS / stop-token support in the chat layer — required for templates that emit several end markers.
  • Tokenizer auto-detect for SentencePiece in fromTokenizerJson, fixing decoding for models that omit the explicit marker.
  • End-to-end smoke test in llm-test/llm-test-java that wires LEAF (KBertJava) and Llama 3 (KLlamaJava) in one JVM.
  • skainet-cli and kllama-cli shadow-jar ServiceLoader fix-up so the priority-100 native-cpu provider is picked up post-merge.

See CHANGELOG.md for the full set of changes.

Engine

This project uses SKaiNET as its underlying execution engine — tensor ops, neural-network DSL, kernel SPI, GGUF / SafeTensors I/O.

License

MIT — see LICENCE.