SKaiNET-developers
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 204 additions & 0 deletions b/‎ARCHITECTURE.md‎
Lines changed: 204 additions & 0 deletions
diff --git a/‎settings.gradle.kts‎
Lines changed: 1 addition & 0 deletions b/‎settings.gradle.kts‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎skainet-apps/skainet-kllama-agent/build.gradle.kts‎
Lines changed: 59 additions & 0 deletions b/‎skainet-apps/skainet-kllama-agent/build.gradle.kts‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎skainet-apps/skainet-kllama-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/agent/GenerateExtensions.kt‎
Lines changed: 125 additions & 0 deletions b/‎skainet-apps/skainet-kllama-agent/src/commonMain/kotlin/sk/ainet/apps/kllama/agent/GenerateExtensions.kt‎
Lines changed: 125 additions & 0 deletions
@@ -0,0 +1,204 @@
+# SKaiNET Architecture: Where Agentic AI Fits
+
+## The Core Question
+
+After implementing tool calling support for KLlama, the question arises: **how does the agentic/tool-calling layer relate to the deep learning foundation?** Is it "real ML" or a higher-level orchestration concern?
+
+**Answer**: Agentic AI is **not a deep learning primitive** — it's a **higher-level architectural pattern** that *consumes* the ML inference layer. The LLM (transformer forward pass, attention, embeddings) is pure deep learning. The agent loop that wraps it (chat formatting, tool parsing, execution, re-prompting) is application-level orchestration. Both are essential — one without the other is either a raw token generator or a tool executor with no intelligence.
+
+---
+
+## Diagram 1 — Full SKaiNET Layer Cake
+
+All modules organized by abstraction level, with the agentic layer at the top:
+
+```mermaid
+graph TB
+    subgraph APP["Application Layer"]
+        CLI["skainet-kllama-cli<br/>--chat / --agent"]
+    end
+
+    subgraph AGENTIC["Agentic AI Layer  (skainet-kllama-agent, orchestration, not ML)"]
+        IR["InferenceRuntime&lt;T&gt;"]
+        AL["AgentLoop&lt;T&gt;"]
+        CT["ChatTemplate<br/>Llama3ChatTemplate / ChatMLTemplate"]
+        TR["ToolRegistry"]
+        TCP["ToolCallParser"]
+        GEN["generateUntilStop()"]
+    end
+
+    subgraph INFERENCE["Inference Runtime Layer  (skainet-kllama, ML forward pass)"]
+        LR["LlamaRuntime&lt;T&gt;"]
+        AB["AttentionBackend&lt;T&gt;<br/>CpuAttentionBackend / GpuAttentionBackend"]
+        KV["KvCache<br/>HeapKvCache"]
+        TOK["GGUFTokenizer"]
+    end
+
+    subgraph IO["Model I/O Layer"]
+        GGUF["skainet-io-gguf"]
+        ST["skainet-io-safetensors"]
+        ONNX["skainet-io-onnx"]
+    end
+
+    subgraph COMPILE["Compilation Layer"]
+        CC["skainet-compile-core<br/>Tape Recording"]
+        CD["skainet-compile-dag<br/>Graph Optimization"]
+        HLO["skainet-compile-hlo<br/>StableHLO Lowering"]
+        CGEN["skainet-compile-c<br/>C99 Codegen"]
+    end
+
+    subgraph LANG["Tensor & NN Primitives Layer"]
+        LC["skainet-lang-core<br/>Tensor&lt;T,V&gt;, DType, Shape"]
+        NN["NN Layers<br/>Embedding, RMSNormalization, Linear"]
+        OPS["Operators<br/>matmul, silu, softmax"]
+    end
+
+    subgraph BACKEND["Backend Execution Layer"]
+        CPU["skainet-backend-cpu<br/>DirectCpuExecutionContext<br/>JDK 21 Vector API / SIMD"]
+    end
+
+    CLI --> AL
+    AL --> CT
+    AL --> TR
+    AL --> TCP
+    AL --> GEN
+    AL --> IR
+    GEN --> IR
+    LR -.->|implements| IR
+    LR --> AB
+    AB --> KV
+    LR --> TOK
+    LR --> GGUF
+    GGUF --> LC
+    CC --> LC
+    CD --> CC
+    HLO --> CD
+    NN --> LC
+    OPS --> LC
+    LC --> CPU
+```
+
+---
+
+## Diagram 2 — Agent Loop Data Flow
+
+The generate-parse-execute cycle that makes the system "agentic":
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant AgentLoop
+    participant ChatTemplate
+    participant LlamaRuntime
+    participant ToolCallParser
+    participant ToolRegistry
+    participant Tool
+
+    User->>AgentLoop: "What is 42 * 17?"
+
+    loop Up to maxToolRounds
+        AgentLoop->>ChatTemplate: apply(messages + toolDefs)
+        ChatTemplate-->>AgentLoop: formatted prompt string
+
+        AgentLoop->>LlamaRuntime: generateUntilStop(tokens)
+
+        Note over LlamaRuntime: ML BOUNDARY<br/>Embedding → Transformer Layers<br/>→ RoPE + Attention + KV Cache<br/>→ FFN (SiLU) → RMSNorm → Logits → Sample
+
+        LlamaRuntime-->>AgentLoop: "I'll calculate that.<br/>{\"name\":\"calculator\",\"arguments\":{\"expression\":\"42*17\"}}"
+
+        AgentLoop->>ToolCallParser: parse(response)
+        ToolCallParser-->>AgentLoop: [ToolCall("calculator", {expression: "42*17"})]
+
+        AgentLoop->>ToolRegistry: execute(toolCall)
+        ToolRegistry->>Tool: execute({expression: "42*17"})
+        Tool-->>ToolRegistry: "714"
+        ToolRegistry-->>AgentLoop: "714"
+
+        Note over AgentLoop: Append tool result as ChatMessage<br/>with role=TOOL, continue loop
+    end
+
+    AgentLoop-->>User: "42 * 17 = 714"
+```
+
+---
+
+## Diagram 3 — ML vs Orchestration Boundary
+
+What is deep learning and what is application architecture:
+
+```mermaid
+graph LR
+    subgraph ORCHESTRATION["Higher-Level: Orchestration"]
+        direction TB
+        A1["AgentLoop&lt;T&gt;<br/><i>control flow</i>"]
+        A2["ChatTemplate<br/><i>string formatting</i>"]
+        A3["ToolCallParser<br/><i>regex + JSON parsing</i>"]
+        A4["ToolRegistry<br/><i>dispatch table</i>"]
+        A5["ChatMessage / ChatRole<br/><i>data structures</i>"]
+    end
+
+    subgraph ML["Deep Learning: Math"]
+        direction TB
+        M1["LlamaRuntime.forward()<br/><i>transformer decoder</i>"]
+        M2["Embedding lookup"]
+        M3["RoPE + Multi-Head Attention"]
+        M4["SiLU-gated FFN"]
+        M5["RMSNormalization"]
+        M6["Softmax sampling"]
+        M7["KvCache management"]
+        M8["Tensor&lt;T,V&gt; operations<br/><i>matmul, add, silu</i>"]
+        M9["SIMD kernels<br/><i>JDK 21 Vector API</i>"]
+    end
+
+    ORCHESTRATION -->|"calls"| ML
+    ML -->|"returns tokens"| ORCHESTRATION
+
+    style ORCHESTRATION fill:#ffe0e0,stroke:#cc0000
+    style ML fill:#e0ffe0,stroke:#00aa00
+```
+
+---
+
+## Key Design Insights
+
+### The agent layer adds no trainable parameters
+
+It's pure control flow. The "intelligence" comes entirely from the LLM weights loaded from GGUF files via `LlamaWeightLoader`. `AgentLoop` decides *when* to call the model, not *what* the model says. The orchestration layer is stateless in the ML sense — it holds conversation history (`List<ChatMessage>`) but no learned weights.
+
+### Why it matters anyway
+
+Without the agent loop, the model is a one-shot text completer — you feed it tokens, it predicts the next ones, done. With it, the model can reason over multiple steps, call external tools, and incorporate real-world data. The same `LlamaRuntime<T>` that powers `--chat` mode becomes an autonomous agent in `--agent` mode, simply by wrapping it in `AgentLoop<T>`.
+
+### The clean boundary
+
+`InferenceRuntime<T>.forward(tokenId: Int): Tensor<T, Float>` is the ML boundary. The agent module (`skainet-kllama-agent`) defines this interface, and concrete runtimes like `LlamaRuntimeInterface<T>` extend it. Everything below (tensors, attention, SIMD kernels in `skainet-backend-cpu`) is deep learning. Everything above (chat formatting in `ChatTemplate`, tool parsing in `ToolCallParser`, the agent loop in `AgentLoop`) is software engineering orchestration.
+
+```
+         ┌──────────────────────────────────┐
+         │  AgentLoop / ChatTemplate / CLI   │  ← orchestration (skainet-kllama-agent)
+         ├──────────────────────────────────┤
+         │  InferenceRuntime<T>.forward()   │  ← THE BOUNDARY
+         ├──────────────────────────────────┤
+         │  LlamaRuntimeInterface<T>        │  ← extends InferenceRuntime (skainet-kllama)
+         │  Attention / FFN / KvCache        │  ← deep learning
+         │  Tensor<T,V> / SIMD kernels       │
+         └──────────────────────────────────┘
+```
+
+### Both layers are in `commonMain`
+
+The agent layer is multiplatform Kotlin, not JVM-specific. `AgentLoop`, `ChatTemplate`, `ToolRegistry`, `ToolCallParser`, and all supporting types live in `skainet-kllama-agent/src/commonMain/`. The same agent loop runs on JVM (with Vector API SIMD), Native, and WASM targets — the only platform-specific code is the backend execution layer (`skainet-backend-cpu`) and the CLI entry point (`skainet-kllama-cli`).
+
+---
+
+## Module Reference
+
+| Layer | Module | Key Types |
+|-------|--------|-----------|
+| Application | `skainet-apps:skainet-kllama-cli` | `Main.kt` (`--chat`, `--agent`) |
+| Agentic | `skainet-apps:skainet-kllama-agent` | `InferenceRuntime<T>`, `AgentLoop<T>`, `ChatTemplate`, `Llama3ChatTemplate`, `ChatMLTemplate`, `ToolRegistry`, `ToolCallParser`, `ToolCall`, `Tool`, `ToolDefinition`, `ChatMessage`, `ChatRole`, `GenerateResult`, `generateUntilStop()`, `sampleFromLogits()` |
+| Inference | `skainet-apps:skainet-kllama` | `LlamaRuntime<T>`, `LlamaRuntimeInterface<T>` (extends `InferenceRuntime<T>`), `AttentionBackend<T>`, `CpuAttentionBackend<T>`, `GpuAttentionBackend<T>`, `KvCache`, `HeapKvCache`, `GGUFTokenizer` |
+| Model I/O | `skainet-io:skainet-io-gguf`, `skainet-io:skainet-io-safetensors`, `skainet-io:skainet-io-onnx` | `LlamaWeightLoader`, `LlamaRuntimeWeights<T>` |
+| Compilation | `skainet-compile:skainet-compile-core`, `skainet-compile-dag`, `skainet-compile-hlo`, `skainet-compile-c` | Tape recording, graph optimization, StableHLO lowering, C99 codegen |
+| Tensor/NN | `skainet-lang:skainet-lang-core` | `Tensor<T,V>`, `Shape`, `DType`, `Embedding`, `Linear`, `RMSNormalization` |
+| Backend | `skainet-backends:skainet-backend-cpu` | `DirectCpuExecutionContext`, `DefaultCpuOps` |
@@ -75,6 +75,7 @@ include("skainet-apps:skainet-tensor-tools")
 include("skainet-apps:skainet-llm")
 include("skainet-apps:skainet-bert")
 include("skainet-apps:skainet-kllama")
+include("skainet-apps:skainet-kllama-agent")
 include("skainet-apps:skainet-kllama-cli")
 include("skainet-apps:skainet-kgemma")
 include("skainet-apps:skainet-kbert-cli")
 
@@ -0,0 +1,59 @@
+import org.jetbrains.kotlin.gradle.ExperimentalKotlinGradlePluginApi
+import org.jetbrains.kotlin.gradle.ExperimentalWasmDsl
+import org.jetbrains.kotlin.gradle.dsl.JvmTarget
+
+plugins {
+    alias(libs.plugins.kotlinMultiplatform)
+    alias(libs.plugins.androidLibrary)
+    alias(libs.plugins.kotlinSerialization)
+}
+
+kotlin {
+    jvmToolchain(21)
+
+    androidTarget {
+        publishLibraryVariants("release")
+        @OptIn(ExperimentalKotlinGradlePluginApi::class)
+        compilerOptions {
+            jvmTarget.set(JvmTarget.JVM_11)
+        }
+    }
+
+    linuxX64()
+    linuxArm64()
+    macosArm64()
+    jvm()
+
+    js {
+        browser()
+    }
+
+    @OptIn(ExperimentalWasmDsl::class)
+    wasmJs {
+        browser()
+    }
+
+    sourceSets {
+        commonMain.dependencies {
+            implementation(project(":skainet-lang:skainet-lang-core"))
+            implementation(libs.kotlinx.serialization.json)
+        }
+
+        commonTest.dependencies {
+            implementation(libs.kotlin.test)
+        }
+    }
+}
+
+android {
+    namespace = "sk.ainet.apps.kllama.agent"
+    compileSdk = libs.versions.android.compileSdk.get().toInt()
+
+    defaultConfig {
+        minSdk = libs.versions.android.minSdk.get().toInt()
+    }
+    compileOptions {
+        sourceCompatibility = JavaVersion.VERSION_11
+        targetCompatibility = JavaVersion.VERSION_11
+    }
+}
@@ -0,0 +1,125 @@
+package sk.ainet.apps.kllama.agent
+
+import kotlin.math.exp
+import kotlin.random.Random
+import sk.ainet.lang.tensor.Tensor
+import sk.ainet.lang.tensor.data.FloatArrayTensorData
+import sk.ainet.lang.types.DType
+
+/**
+ * Generate tokens until an EOS token is produced or [maxTokens] is reached.
+ *
+ * Unlike batch generation, this function:
+ * - Stops when the model emits [eosTokenId]
+ * - Does NOT prepend BOS automatically (the caller is responsible for encoding the
+ *   full prompt including special tokens via the chat template)
+ * - Returns a [GenerateResult] with all generated tokens and decoded text
+ *
+ * @param prompt Encoded prompt token IDs (should include BOS if needed).
+ * @param maxTokens Maximum number of tokens to generate.
+ * @param eosTokenId The EOS token ID to stop on.
+ * @param temperature Sampling temperature (0 = greedy).
+ * @param random Random generator for sampling.
+ * @param onToken Optional callback invoked for each generated token.
+ * @param decode Optional function to decode a token ID to a string.
+ */
+public fun <T : DType> InferenceRuntime<T>.generateUntilStop(
+    prompt: IntArray,
+    maxTokens: Int,
+    eosTokenId: Int,
+    temperature: Float = 0.8f,
+    random: Random = Random.Default,
+    onToken: ((Int) -> Unit)? = null,
+    decode: ((Int) -> String)? = null
+): GenerateResult {
+    // Feed prompt tokens through the model
+    var lastLogits: Tensor<T, Float>? = null
+    for (tokenId in prompt) {
+        lastLogits = forward(tokenId)
+    }
+
+    if (lastLogits == null) {
+        return GenerateResult(emptyList(), "", false)
+    }
+
+    val generated = mutableListOf<Int>()
+    val textBuilder = StringBuilder()
+    var stoppedByEos = false
+
+    var logits: Tensor<T, Float> = lastLogits
+    for (step in 0 until maxTokens) {
+        val nextToken = sampleFromLogits<T>(logits, temperature, random)
+
+        if (nextToken == eosTokenId) {
+            stoppedByEos = true
+            break
+        }
+
+        generated.add(nextToken)
+        onToken?.invoke(nextToken)
+        decode?.let { textBuilder.append(it(nextToken)) }
+
+        logits = forward(nextToken)
+    }
+
+    return GenerateResult(generated, textBuilder.toString(), stoppedByEos)
+}
+
+/**
+ * Sample a token ID from a logits tensor.
+ *
+ * @param logits The logits tensor (1D, vocabSize).
+ * @param temperature Sampling temperature. Values <= 1e-6 use greedy (argmax).
+ * @param random Random generator.
+ * @return The sampled token ID.
+ */
+public fun <T : DType> sampleFromLogits(
+    logits: Tensor<T, Float>,
+    temperature: Float,
+    random: Random = Random.Default
+): Int {
+    val buf = logits.toFloatArray()
+
+    // Greedy (argmax) for near-zero temperature
+    if (temperature <= 1e-6f) {
+        var best = 0
+        var bestVal = buf[0]
+        for (i in 1 until buf.size) {
+            if (buf[i] > bestVal) {
+                bestVal = buf[i]
+                best = i
+            }
+        }
+        return best
+    }
+
+    // Temperature-scaled softmax sampling
+    var maxLogit = Float.NEGATIVE_INFINITY
+    for (i in buf.indices) {
+        val v = buf[i] / temperature
+        buf[i] = v
+        if (v > maxLogit) maxLogit = v
+    }
+    var sum = 0f
+    for (i in buf.indices) {
+        val e = exp((buf[i] - maxLogit).toDouble()).toFloat()
+        buf[i] = e
+        sum += e
+    }
+    val r = random.nextFloat() * sum
+    var acc = 0f
+    for (i in buf.indices) {
+        acc += buf[i]
+        if (acc >= r) return i
+    }
+    return buf.lastIndex
+}
+
+/**
+ * Extract a FloatArray from a tensor, using the fast path if available.
+ */
+private fun <T : DType> Tensor<T, Float>.toFloatArray(): FloatArray {
+    val data = this.data
+    if (data is FloatArrayTensorData<*>) return data.buffer.copyOf()
+    return data.copyToFloatArray()
+}