|
| 1 | +# Plan: Unified Model Pipeline with Decoupled Tool Calling |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +Currently SKaiNET-transformers has: |
| 6 | +- **5+ hand-coded runtimes** (LlamaRuntime, Qwen35Runtime, Gemma3nRuntime, ApertusRuntime, VoxtralRuntimes) — each reimplements the forward pass, weight loading, and layer execution |
| 7 | +- **Tool calling tightly coupled to kllama** — the AgentLoop, ToolCallingDemo, and chat modes only exist in the kllama runner. Other models (Gemma, Apertus) cannot use tool calling without duplicating code |
| 8 | +- **Two execution paths** — legacy hand-coded runtimes AND the newer `OptimizedLLMRuntime` with DSL/compute-graph/AOT. LlamaRuntime and ApertusRuntime are already marked deprecated |
| 9 | + |
| 10 | +The goal: converge on **one unified pipeline** where model definition, weight loading, tokenization, and tool calling are cleanly separated pipeline stages. |
| 11 | + |
| 12 | +## Architecture Overview |
| 13 | + |
| 14 | +``` |
| 15 | +GGUF/SafeTensors File |
| 16 | + | |
| 17 | +WeightLoader (parse metadata + tensors) |
| 18 | + | |
| 19 | +DSL Network Definition (model-specific, declarative) |
| 20 | + | |
| 21 | +ComputeGraph (DAG) |
| 22 | + | |
| 23 | +Optimization Pipeline (TransposeElim -> WeightDedup -> LLMFusion -> DCE) |
| 24 | + | |
| 25 | +ComputeGraphExecutor (fused kernels) |
| 26 | + | |
| 27 | +InferenceRuntime (unified: forward + generate) |
| 28 | + | |
| 29 | +TokenizationPipeline (encode/decode, special tokens, byte-level BPE) |
| 30 | + | |
| 31 | +ChatPipeline (template formatting, tool calling, agent loop) |
| 32 | +``` |
| 33 | + |
| 34 | +## Phase 1: Decouple Tool Calling from kllama (immediate value) -- DONE |
| 35 | + |
| 36 | +**What was done:** |
| 37 | + |
| 38 | +1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize` |
| 39 | + - Updated all implementations: `GGUFTokenizer`, `TokenizerImpl`, `HuggingFaceBPETokenizer`, `TekkenTokenizerAdapter`, `HuggingFaceTokenizer` (BERT) |
| 40 | + |
| 41 | +2. **Created `ChatSession` abstraction** in `llm-agent` |
| 42 | + - File: `llm-agent/.../chat/ChatSession.kt` |
| 43 | + - Bundles `InferenceRuntime` + `Tokenizer` + `ModelMetadata` |
| 44 | + - Provides `createAgentLoop()` and `runSingleTurn()` for any runner |
| 45 | + |
| 46 | +3. **Refactored `ToolCallingDemo` and `AgentCli`** to use `Tokenizer` interface instead of `GGUFTokenizer` |
| 47 | + - Both now accept any `Tokenizer`, not just `GGUFTokenizer` |
| 48 | + - Both use `ChatSession` internally for agent loop creation |
| 49 | + |
| 50 | +4. **Removed `GGUFTokenizer` cast from kllama Main.kt** dispatch |
| 51 | + - Chat/agent/demo modes now work with any `Tokenizer` |
| 52 | + |
| 53 | +5. **Fixed `JavaAgentLoop`** — replaced `GGUFTokenizer` instanceof hack with `tokenizer.eosTokenId` |
| 54 | + |
| 55 | +## Phase 2: Unified DSL-Based Model Definition (converge on OptimizedLLMRuntime) -- PARTIAL |
| 56 | + |
| 57 | +**What was done:** |
| 58 | + |
| 59 | +1. **Created `ModelRegistry`** in `llm-core/.../ModelRegistry.kt` |
| 60 | + - `ModelFamily` enum: LLAMA, QWEN, GEMMA, APERTUS, BERT, VOXTRAL, UNKNOWN |
| 61 | + - `ModelRegistry.detect(architecture)` maps GGUF arch strings to families |
| 62 | + - Tracks capabilities (supportsToolCalling, chatTemplateFamily) |
| 63 | + |
| 64 | +2. **Created `UnifiedModelLoader`** in `llm-core/.../UnifiedModelLoader.kt` |
| 65 | + - `UnifiedModelLoader.peek(source)` extracts `GGUFModelInfo` from GGUF metadata |
| 66 | + - Returns architecture, family, dimensions without loading weights |
| 67 | + |
| 68 | +**Already existing (no changes needed):** |
| 69 | +- DSL networks: `llamaNetwork()`, `qwenNetwork()`, `apertusNetwork()`, `bertNetwork()`, `voxtralBackboneNetwork()`, `voxtralAcousticNetwork()` |
| 70 | +- `OptimizedLLMRuntime` with DIRECT/OPTIMIZED/HYBRID modes |
| 71 | +- Per-model `NetworkLoader` classes (LlamaNetworkLoader, ApertusNetworkLoader, etc.) |
| 72 | + |
| 73 | +**Remaining (future work):** |
| 74 | +- `gemmaNetwork()` DSL definition (Gemma3n has unique features: GELU, MatFormer variable FFN, sliding window) |
| 75 | +- Migrate CLI runners from deprecated runtimes to OptimizedLLMRuntime |
| 76 | +- Remove deprecated LlamaRuntime and ApertusRuntime |
| 77 | + |
| 78 | +## Phase 3: Tokenization as Pipeline Stage -- DONE |
| 79 | + |
| 80 | +**What was done:** |
| 81 | + |
| 82 | +1. **Enhanced `Tokenizer` interface** with `eosTokenId`, `bosTokenId`, `vocabSize` (done in Phase 1) |
| 83 | + |
| 84 | +2. **Moved `GGUFTokenizer` from kllama to `llm-core`** |
| 85 | + - New location: `llm-core/.../tokenizer/GGUFTokenizer.kt` |
| 86 | + - Old location has a typealias for backwards compatibility |
| 87 | + - Added `skainet-io-gguf` and `kotlinx-io-core` dependencies to `llm-core` |
| 88 | + |
| 89 | +3. **Created `TokenizerFactory`** in `llm-core/.../tokenizer/TokenizerFactory.kt` |
| 90 | + - `TokenizerFactory.fromGGUF(source)` — from GGUF file metadata |
| 91 | + - `TokenizerFactory.fromTokenizerJson(json)` — from HuggingFace tokenizer.json |
| 92 | + - `TokenizerFactory.fromHuggingFace(json, config)` — full HF BPE tokenizer |
| 93 | + |
| 94 | +4. All runners can now use `GGUFTokenizer` and `TokenizerFactory` directly from `llm-core` |
| 95 | + |
| 96 | +## Phase 4: Unified Runner (single CLI entry point) -- DONE |
| 97 | + |
| 98 | +**What was done:** |
| 99 | + |
| 100 | +1. **Created `llm-apps/skainet-cli`** — new unified CLI module |
| 101 | + - Auto-detects architecture from GGUF metadata via `UnifiedModelLoader.peek()` |
| 102 | + - Loads any LLaMA-compatible model (LLaMA, Qwen, Mistral) |
| 103 | + - Supports `--chat`, `--agent`, `--demo` modes with tool calling |
| 104 | + - Uses `TokenizerFactory.fromGGUF()` for tokenizer loading |
| 105 | + - Registered as `skainet` runner in smoke test script |
| 106 | + |
| 107 | +2. **Usage:** |
| 108 | + ```bash |
| 109 | + skainet -m model.gguf "The capital of France is" # auto-detect, generate |
| 110 | + skainet -m model.gguf --chat # interactive chat |
| 111 | + skainet -m model.gguf --demo "What is 2+2?" # tool calling demo |
| 112 | + ``` |
| 113 | + |
| 114 | +3. **Existing per-model CLIs are preserved** — no breaking changes |
| 115 | + |
| 116 | +**Remaining (future work):** |
| 117 | +- Add Gemma3n loading path to unified CLI (requires gemmaNetwork() DSL) |
| 118 | +- Add Apertus loading path to unified CLI |
| 119 | +- Eventually deprecate per-model CLIs |
| 120 | + |
| 121 | +## All Phases Complete |
| 122 | + |
| 123 | +| Phase | Status | Summary | |
| 124 | +|-------|--------|---------| |
| 125 | +| 1. Decouple tool calling | DONE | ChatSession, Tokenizer interface, no GGUFTokenizer coupling | |
| 126 | +| 2. Model registry | DONE | ModelRegistry, UnifiedModelLoader, ModelFamily enum | |
| 127 | +| 3. Tokenization pipeline | DONE | GGUFTokenizer in llm-core, TokenizerFactory | |
| 128 | +| 4. Unified runner | DONE | skainet-cli with auto-detection | |
| 129 | +3. **Phase 2** then — biggest refactor, needs per-model validation |
| 130 | +4. **Phase 4** last — depends on all other phases |
0 commit comments