|
| 1 | +# SKaiNET Architecture: Where Agentic AI Fits |
| 2 | + |
| 3 | +## The Core Question |
| 4 | + |
| 5 | +After implementing tool calling support for KLlama, the question arises: **how does the agentic/tool-calling layer relate to the deep learning foundation?** Is it "real ML" or a higher-level orchestration concern? |
| 6 | + |
| 7 | +**Answer**: Agentic AI is **not a deep learning primitive** — it's a **higher-level architectural pattern** that *consumes* the ML inference layer. The LLM (transformer forward pass, attention, embeddings) is pure deep learning. The agent loop that wraps it (chat formatting, tool parsing, execution, re-prompting) is application-level orchestration. Both are essential — one without the other is either a raw token generator or a tool executor with no intelligence. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Diagram 1 — Full SKaiNET Layer Cake |
| 12 | + |
| 13 | +All modules organized by abstraction level, with the agentic layer at the top: |
| 14 | + |
| 15 | +```mermaid |
| 16 | +graph TB |
| 17 | + subgraph APP["Application Layer"] |
| 18 | + CLI["skainet-kllama-cli<br/>--chat / --agent"] |
| 19 | + end |
| 20 | +
|
| 21 | + subgraph AGENTIC["Agentic AI Layer (skainet-kllama-agent, orchestration, not ML)"] |
| 22 | + IR["InferenceRuntime<T>"] |
| 23 | + AL["AgentLoop<T>"] |
| 24 | + CT["ChatTemplate<br/>Llama3ChatTemplate / ChatMLTemplate"] |
| 25 | + TR["ToolRegistry"] |
| 26 | + TCP["ToolCallParser"] |
| 27 | + GEN["generateUntilStop()"] |
| 28 | + end |
| 29 | +
|
| 30 | + subgraph INFERENCE["Inference Runtime Layer (skainet-kllama, ML forward pass)"] |
| 31 | + LR["LlamaRuntime<T>"] |
| 32 | + AB["AttentionBackend<T><br/>CpuAttentionBackend / GpuAttentionBackend"] |
| 33 | + KV["KvCache<br/>HeapKvCache"] |
| 34 | + TOK["GGUFTokenizer"] |
| 35 | + end |
| 36 | +
|
| 37 | + subgraph IO["Model I/O Layer"] |
| 38 | + GGUF["skainet-io-gguf"] |
| 39 | + ST["skainet-io-safetensors"] |
| 40 | + ONNX["skainet-io-onnx"] |
| 41 | + end |
| 42 | +
|
| 43 | + subgraph COMPILE["Compilation Layer"] |
| 44 | + CC["skainet-compile-core<br/>Tape Recording"] |
| 45 | + CD["skainet-compile-dag<br/>Graph Optimization"] |
| 46 | + HLO["skainet-compile-hlo<br/>StableHLO Lowering"] |
| 47 | + CGEN["skainet-compile-c<br/>C99 Codegen"] |
| 48 | + end |
| 49 | +
|
| 50 | + subgraph LANG["Tensor & NN Primitives Layer"] |
| 51 | + LC["skainet-lang-core<br/>Tensor<T,V>, DType, Shape"] |
| 52 | + NN["NN Layers<br/>Embedding, RMSNormalization, Linear"] |
| 53 | + OPS["Operators<br/>matmul, silu, softmax"] |
| 54 | + end |
| 55 | +
|
| 56 | + subgraph BACKEND["Backend Execution Layer"] |
| 57 | + CPU["skainet-backend-cpu<br/>DirectCpuExecutionContext<br/>JDK 21 Vector API / SIMD"] |
| 58 | + end |
| 59 | +
|
| 60 | + CLI --> AL |
| 61 | + AL --> CT |
| 62 | + AL --> TR |
| 63 | + AL --> TCP |
| 64 | + AL --> GEN |
| 65 | + AL --> IR |
| 66 | + GEN --> IR |
| 67 | + LR -.->|implements| IR |
| 68 | + LR --> AB |
| 69 | + AB --> KV |
| 70 | + LR --> TOK |
| 71 | + LR --> GGUF |
| 72 | + GGUF --> LC |
| 73 | + CC --> LC |
| 74 | + CD --> CC |
| 75 | + HLO --> CD |
| 76 | + NN --> LC |
| 77 | + OPS --> LC |
| 78 | + LC --> CPU |
| 79 | +``` |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## Diagram 2 — Agent Loop Data Flow |
| 84 | + |
| 85 | +The generate-parse-execute cycle that makes the system "agentic": |
| 86 | + |
| 87 | +```mermaid |
| 88 | +sequenceDiagram |
| 89 | + participant User |
| 90 | + participant AgentLoop |
| 91 | + participant ChatTemplate |
| 92 | + participant LlamaRuntime |
| 93 | + participant ToolCallParser |
| 94 | + participant ToolRegistry |
| 95 | + participant Tool |
| 96 | +
|
| 97 | + User->>AgentLoop: "What is 42 * 17?" |
| 98 | +
|
| 99 | + loop Up to maxToolRounds |
| 100 | + AgentLoop->>ChatTemplate: apply(messages + toolDefs) |
| 101 | + ChatTemplate-->>AgentLoop: formatted prompt string |
| 102 | +
|
| 103 | + AgentLoop->>LlamaRuntime: generateUntilStop(tokens) |
| 104 | +
|
| 105 | + Note over LlamaRuntime: ML BOUNDARY<br/>Embedding → Transformer Layers<br/>→ RoPE + Attention + KV Cache<br/>→ FFN (SiLU) → RMSNorm → Logits → Sample |
| 106 | +
|
| 107 | + LlamaRuntime-->>AgentLoop: "I'll calculate that.<br/>{\"name\":\"calculator\",\"arguments\":{\"expression\":\"42*17\"}}" |
| 108 | +
|
| 109 | + AgentLoop->>ToolCallParser: parse(response) |
| 110 | + ToolCallParser-->>AgentLoop: [ToolCall("calculator", {expression: "42*17"})] |
| 111 | +
|
| 112 | + AgentLoop->>ToolRegistry: execute(toolCall) |
| 113 | + ToolRegistry->>Tool: execute({expression: "42*17"}) |
| 114 | + Tool-->>ToolRegistry: "714" |
| 115 | + ToolRegistry-->>AgentLoop: "714" |
| 116 | +
|
| 117 | + Note over AgentLoop: Append tool result as ChatMessage<br/>with role=TOOL, continue loop |
| 118 | + end |
| 119 | +
|
| 120 | + AgentLoop-->>User: "42 * 17 = 714" |
| 121 | +``` |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +## Diagram 3 — ML vs Orchestration Boundary |
| 126 | + |
| 127 | +What is deep learning and what is application architecture: |
| 128 | + |
| 129 | +```mermaid |
| 130 | +graph LR |
| 131 | + subgraph ORCHESTRATION["Higher-Level: Orchestration"] |
| 132 | + direction TB |
| 133 | + A1["AgentLoop<T><br/><i>control flow</i>"] |
| 134 | + A2["ChatTemplate<br/><i>string formatting</i>"] |
| 135 | + A3["ToolCallParser<br/><i>regex + JSON parsing</i>"] |
| 136 | + A4["ToolRegistry<br/><i>dispatch table</i>"] |
| 137 | + A5["ChatMessage / ChatRole<br/><i>data structures</i>"] |
| 138 | + end |
| 139 | +
|
| 140 | + subgraph ML["Deep Learning: Math"] |
| 141 | + direction TB |
| 142 | + M1["LlamaRuntime.forward()<br/><i>transformer decoder</i>"] |
| 143 | + M2["Embedding lookup"] |
| 144 | + M3["RoPE + Multi-Head Attention"] |
| 145 | + M4["SiLU-gated FFN"] |
| 146 | + M5["RMSNormalization"] |
| 147 | + M6["Softmax sampling"] |
| 148 | + M7["KvCache management"] |
| 149 | + M8["Tensor<T,V> operations<br/><i>matmul, add, silu</i>"] |
| 150 | + M9["SIMD kernels<br/><i>JDK 21 Vector API</i>"] |
| 151 | + end |
| 152 | +
|
| 153 | + ORCHESTRATION -->|"calls"| ML |
| 154 | + ML -->|"returns tokens"| ORCHESTRATION |
| 155 | +
|
| 156 | + style ORCHESTRATION fill:#ffe0e0,stroke:#cc0000 |
| 157 | + style ML fill:#e0ffe0,stroke:#00aa00 |
| 158 | +``` |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +## Key Design Insights |
| 163 | + |
| 164 | +### The agent layer adds no trainable parameters |
| 165 | + |
| 166 | +It's pure control flow. The "intelligence" comes entirely from the LLM weights loaded from GGUF files via `LlamaWeightLoader`. `AgentLoop` decides *when* to call the model, not *what* the model says. The orchestration layer is stateless in the ML sense — it holds conversation history (`List<ChatMessage>`) but no learned weights. |
| 167 | + |
| 168 | +### Why it matters anyway |
| 169 | + |
| 170 | +Without the agent loop, the model is a one-shot text completer — you feed it tokens, it predicts the next ones, done. With it, the model can reason over multiple steps, call external tools, and incorporate real-world data. The same `LlamaRuntime<T>` that powers `--chat` mode becomes an autonomous agent in `--agent` mode, simply by wrapping it in `AgentLoop<T>`. |
| 171 | + |
| 172 | +### The clean boundary |
| 173 | + |
| 174 | +`InferenceRuntime<T>.forward(tokenId: Int): Tensor<T, Float>` is the ML boundary. The agent module (`skainet-kllama-agent`) defines this interface, and concrete runtimes like `LlamaRuntimeInterface<T>` extend it. Everything below (tensors, attention, SIMD kernels in `skainet-backend-cpu`) is deep learning. Everything above (chat formatting in `ChatTemplate`, tool parsing in `ToolCallParser`, the agent loop in `AgentLoop`) is software engineering orchestration. |
| 175 | + |
| 176 | +``` |
| 177 | + ┌──────────────────────────────────┐ |
| 178 | + │ AgentLoop / ChatTemplate / CLI │ ← orchestration (skainet-kllama-agent) |
| 179 | + ├──────────────────────────────────┤ |
| 180 | + │ InferenceRuntime<T>.forward() │ ← THE BOUNDARY |
| 181 | + ├──────────────────────────────────┤ |
| 182 | + │ LlamaRuntimeInterface<T> │ ← extends InferenceRuntime (skainet-kllama) |
| 183 | + │ Attention / FFN / KvCache │ ← deep learning |
| 184 | + │ Tensor<T,V> / SIMD kernels │ |
| 185 | + └──────────────────────────────────┘ |
| 186 | +``` |
| 187 | + |
| 188 | +### Both layers are in `commonMain` |
| 189 | + |
| 190 | +The agent layer is multiplatform Kotlin, not JVM-specific. `AgentLoop`, `ChatTemplate`, `ToolRegistry`, `ToolCallParser`, and all supporting types live in `skainet-kllama-agent/src/commonMain/`. The same agent loop runs on JVM (with Vector API SIMD), Native, and WASM targets — the only platform-specific code is the backend execution layer (`skainet-backend-cpu`) and the CLI entry point (`skainet-kllama-cli`). |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +## Module Reference |
| 195 | + |
| 196 | +| Layer | Module | Key Types | |
| 197 | +|-------|--------|-----------| |
| 198 | +| Application | `skainet-apps:skainet-kllama-cli` | `Main.kt` (`--chat`, `--agent`) | |
| 199 | +| Agentic | `skainet-apps:skainet-kllama-agent` | `InferenceRuntime<T>`, `AgentLoop<T>`, `ChatTemplate`, `Llama3ChatTemplate`, `ChatMLTemplate`, `ToolRegistry`, `ToolCallParser`, `ToolCall`, `Tool`, `ToolDefinition`, `ChatMessage`, `ChatRole`, `GenerateResult`, `generateUntilStop()`, `sampleFromLogits()` | |
| 200 | +| Inference | `skainet-apps:skainet-kllama` | `LlamaRuntime<T>`, `LlamaRuntimeInterface<T>` (extends `InferenceRuntime<T>`), `AttentionBackend<T>`, `CpuAttentionBackend<T>`, `GpuAttentionBackend<T>`, `KvCache`, `HeapKvCache`, `GGUFTokenizer` | |
| 201 | +| Model I/O | `skainet-io:skainet-io-gguf`, `skainet-io:skainet-io-safetensors`, `skainet-io:skainet-io-onnx` | `LlamaWeightLoader`, `LlamaRuntimeWeights<T>` | |
| 202 | +| Compilation | `skainet-compile:skainet-compile-core`, `skainet-compile-dag`, `skainet-compile-hlo`, `skainet-compile-c` | Tape recording, graph optimization, StableHLO lowering, C99 codegen | |
| 203 | +| Tensor/NN | `skainet-lang:skainet-lang-core` | `Tensor<T,V>`, `Shape`, `DType`, `Embedding`, `Linear`, `RMSNormalization` | |
| 204 | +| Backend | `skainet-backends:skainet-backend-cpu` | `DirectCpuExecutionContext`, `DefaultCpuOps` | |
0 commit comments