|
| 1 | +# Feature Investigation — ideas from `llama-stack-client-kotlin` |
| 2 | + |
| 3 | +Comparison source: [ogx-ai/llama-stack-client-kotlin](https://github.com/ogx-ai/llama-stack-client-kotlin) |
| 4 | +(version 0.2.14, a Stainless-generated Kotlin SDK for the Llama Stack REST API |
| 5 | +with an optional ExecuTorch-backed local-inference path). |
| 6 | + |
| 7 | +This document inventories candidate features for `java-llama.cpp` derived from |
| 8 | +that comparison, with rough effort estimates. Effort is given in |
| 9 | +T-shirt sizes: |
| 10 | + |
| 11 | +| Size | Calendar effort (1 engineer) | Description | |
| 12 | +|------|------------------------------|-------------| |
| 13 | +| XS | < 0.5 day | Trivial Java-side change, no JNI | |
| 14 | +| S | 0.5 – 2 days | Java surface + minor JNI/JSON wiring | |
| 15 | +| M | 2 – 5 days | New JNI methods, native plumbing, tests | |
| 16 | +| L | 1 – 2 weeks | New native subsystem or large API surface | |
| 17 | +| XL | > 2 weeks | Architectural addition | |
| 18 | + |
| 19 | +## 1. What `java-llama.cpp` already covers |
| 20 | + |
| 21 | +| Capability | Status | |
| 22 | +|---------------------------------------------------------|--------| |
| 23 | +| Chat / completion (blocking + streaming via `LlamaIterable`) | ✅ | |
| 24 | +| Embeddings (`embed(String)`) | ✅ | |
| 25 | +| Rerank (`rerank(query, docs)`) | ✅ | |
| 26 | +| Grammar + JSON-schema constrained output | ✅ | |
| 27 | +| Rich sampling (DRY, mirostat, dyn-temp, XTC, top-n-σ) | ✅ | |
| 28 | +| Reasoning budget / reasoning format | ✅ | |
| 29 | +| Slot save / restore / erase | ✅ | |
| 30 | +| `continueFinalMessage` | ✅ | |
| 31 | +| Tokenize / decode / template apply | ✅ | |
| 32 | +| Metrics string (`getMetrics()`) | ✅ | |
| 33 | +| Speculative draft model wiring | ✅ | |
| 34 | + |
| 35 | +These do not need work — they already match or exceed the Kotlin client. |
| 36 | + |
| 37 | +## 2. Recommended additions (in priority order) |
| 38 | + |
| 39 | +### 2.1 Multimodal image input (mtmd) — **L** |
| 40 | + |
| 41 | +**Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and |
| 42 | +the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No |
| 43 | +Java method currently accepts image input. Kotlin examples show base64 image |
| 44 | +chat against vision models. |
| 45 | + |
| 46 | +**Proposal.** |
| 47 | +- `InferenceParameters.addImage(byte[] png)` / `addImage(Path)` / `addImageBase64(String)`. |
| 48 | +- `ModelParameters.setMmproj(Path)` to load the mmproj projector file. |
| 49 | +- JNI: feed images into the server task params (`mtmd_*` API). |
| 50 | + |
| 51 | +**Effort: L** — non-trivial JNI plumbing, lifecycle of `mtmd_context`, |
| 52 | +test fixtures for vision models, but most of the heavy lifting is already |
| 53 | +upstream. |
| 54 | + |
| 55 | +**Value.** Biggest user-visible capability missing today. Unlocks Qwen-VL, |
| 56 | +Gemma 3, MiniCPM-V, LLaVA, etc. |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +### 2.2 Typed `ChatMessage` / `ChatResponse` model + tool calling — **M** |
| 61 | + |
| 62 | +**Gap.** Today: `setMessages(String system, List<Pair<String,String>>)` and |
| 63 | +`chatComplete → String`. The server *parses* tool calls |
| 64 | +(`common_chat_*` infrastructure) but Java callers must scrape JSON |
| 65 | +themselves. Kotlin exposes typed `ChatCompletionRequest` / `ChatResponse` |
| 66 | +with `toolCalls`, `finishReason`, `usage`. |
| 67 | + |
| 68 | +**Proposal.** |
| 69 | +- `ChatMessage` record: `role` (enum: system/user/assistant/tool), `content` |
| 70 | + (text + optional image parts), `toolCalls`, `toolCallId`, `name`. |
| 71 | +- `ToolDefinition` record (name, description, JSON-schema parameters) and |
| 72 | + `InferenceParameters.setTools(List<ToolDefinition>)`. |
| 73 | +- `ChatResponse` record: `content`, `finishReason` (enum), `toolCalls`, |
| 74 | + `usage` (see §2.5), `timings`, optional `logprobs`. |
| 75 | +- New `LlamaModel.chat(ChatRequest)` / `chatStream(ChatRequest)`. |
| 76 | + |
| 77 | +**Effort: M** — Java-side data model + JSON marshalling + a wrapper around |
| 78 | +the existing native chat path. No new native code needed. |
| 79 | + |
| 80 | +**Value.** Brings the library in line with every modern Java LLM SDK |
| 81 | +(LangChain4j, Spring AI). Removes the "parse the JSON string yourself" |
| 82 | +papercut. |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +### 2.3 Async / non-blocking API — **S–M** |
| 87 | + |
| 88 | +**Gap.** All `LlamaModel` methods are blocking. Kotlin offers |
| 89 | +`suspend fun` + Flow variants. JVM users currently dedicate platform |
| 90 | +threads per inference. |
| 91 | + |
| 92 | +**Proposal.** |
| 93 | +- `CompletableFuture<String> completeAsync(InferenceParameters)` |
| 94 | +- `CompletableFuture<ChatResponse> chatAsync(ChatRequest)` |
| 95 | +- Reactive Streams `Publisher<String> generateReactive(...)` (or |
| 96 | + `Flow.Publisher` from `java.util.concurrent.Flow` to avoid a new |
| 97 | + dependency). |
| 98 | +- Backed by the existing native worker thread inside `jllama_context`; |
| 99 | + no extra Java thread pool. |
| 100 | + |
| 101 | +**Effort: S** for `CompletableFuture` wrappers, **M** for a proper |
| 102 | +backpressure-aware `Flow.Publisher` streaming token producer. |
| 103 | + |
| 104 | +**Value.** Enables composition with virtual threads, project Reactor, |
| 105 | +RxJava, Kotlin coroutines from Java consumers. |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +### 2.4 Batch inference across slots — **M** |
| 110 | + |
| 111 | +**Gap.** llama.cpp natively serves parallel slots; the compiled-in server |
| 112 | +handles concurrent tasks. `LlamaModel` exposes no batch entry point. |
| 113 | + |
| 114 | +**Proposal.** |
| 115 | +- `List<String> completeBatch(List<InferenceParameters>)`. |
| 116 | +- `List<ChatResponse> chatBatch(List<ChatRequest>)`. |
| 117 | +- `Stream<String> generateBatch(...)` returning results in submission order. |
| 118 | +- Configure parallelism via existing `ModelParameters` (e.g. n_parallel). |
| 119 | + |
| 120 | +**Effort: M** — server already supports it; needs a Java entry that |
| 121 | +dispatches N tasks, awaits all, and maps results. |
| 122 | + |
| 123 | +**Value.** Throughput multiplier for embedding / classification / |
| 124 | +rerank pipelines; close to a free win. |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +### 2.5 Typed `Usage` / `Timings` result — **XS–S** |
| 129 | + |
| 130 | +**Gap.** `getMetrics()` returns a raw JSON `String`. Kotlin exposes |
| 131 | +`Usage(promptTokens, completionTokens, totalTokens)` plus a richer |
| 132 | +`Timings` (`tokensPerSecond`, `promptMs`, `predictedMs`, `cacheHit`, |
| 133 | +`drafted`, `accepted`). |
| 134 | + |
| 135 | +**Proposal.** |
| 136 | +- `Usage` and `Timings` records. |
| 137 | +- Returned as part of `ChatResponse` (§2.2) and from a new |
| 138 | + `LlamaModel.getMetricsTyped()` accessor. |
| 139 | + |
| 140 | +**Effort: XS** if scoped to `Timings`/`Usage` parsing only, |
| 141 | +**S** when wired into the new `ChatResponse`. |
| 142 | + |
| 143 | +**Value.** Removes ad-hoc JSON parsing. Cheap. |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +### 2.6 `Session` helper (multi-turn) — **S–M** |
| 148 | + |
| 149 | +**Gap.** Slots exist as a low-level primitive. Kotlin offers |
| 150 | +"agents/sessions/turns" with persistence and resume. |
| 151 | + |
| 152 | +**Proposal.** |
| 153 | +- `Session` (`AutoCloseable`) owning a slot id, an accumulating |
| 154 | + `List<ChatMessage>`, and `save(Path)` / `restore(Path)` that delegate |
| 155 | + to `saveSlot` / `restoreSlot`. |
| 156 | +- `session.send(userMessage)` → `ChatResponse`. |
| 157 | +- `session.stream(userMessage)` → `LlamaIterable`. |
| 158 | + |
| 159 | +**Effort: S** if it stays a thin wrapper over existing slot APIs, |
| 160 | +**M** if we add proper concurrency / per-session locking. |
| 161 | + |
| 162 | +**Value.** Common conversational use case becomes a one-liner. |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +### 2.7 Stream cancellation & `AutoCloseable` iterator — **S** |
| 167 | + |
| 168 | +**Gap.** `LlamaIterable` / `LlamaIterator` cannot be cancelled mid-stream; |
| 169 | +the underlying slot task keeps running until natural stop. Kotlin marks |
| 170 | +streaming returns `@MustBeClosed`. |
| 171 | + |
| 172 | +**Proposal.** |
| 173 | +- Make `LlamaIterator` implement `AutoCloseable` with `close()` cancelling |
| 174 | + the server task (slot stop). |
| 175 | +- Document try-with-resources usage. |
| 176 | + |
| 177 | +**Effort: S** — small JNI addition (`cancel_task(task_id)`), straightforward |
| 178 | +Java side. |
| 179 | + |
| 180 | +**Value.** Prevents wasted compute when the consumer gives up early. |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +### 2.8 Structured-output convenience helpers — **S** |
| 185 | + |
| 186 | +**Gap.** `setJsonSchema` / `setGrammar` already exist on `ModelParameters` |
| 187 | +but not on `InferenceParameters`. No typed-result helper. |
| 188 | + |
| 189 | +**Proposal.** |
| 190 | +- Add `setJsonSchema(String)` / `setGrammar(String)` to |
| 191 | + `InferenceParameters` (per-request schema). |
| 192 | +- `<T> T completeAsJson(Class<T>, InferenceParameters)` — derive the |
| 193 | + JSON schema from the class via Jackson (`JacksonModule` already on |
| 194 | + classpath?) or document a manual-schema variant. |
| 195 | + |
| 196 | +**Effort: S**. |
| 197 | + |
| 198 | +**Value.** Matches the "structured outputs" UX of OpenAI / Llama Stack |
| 199 | +SDKs. |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +### 2.9 Logprobs in the typed result — **S** |
| 204 | + |
| 205 | +**Gap.** `setNProbs` exists; the result type is a plain `String`, so |
| 206 | +per-token probabilities are not surfaced. |
| 207 | + |
| 208 | +**Proposal.** |
| 209 | +- Add `List<TokenLogprob> logprobs` to `ChatResponse` (§2.2) and to a |
| 210 | + new `LlamaOutput` field for non-chat completion. |
| 211 | +- `TokenLogprob { String token; int tokenId; float logprob; |
| 212 | + List<TokenLogprob> topLogprobs; }`. |
| 213 | + |
| 214 | +**Effort: S**. |
| 215 | + |
| 216 | +**Value.** Required for evaluation / uncertainty estimation use cases. |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +### 2.10 Cancellation token / abort for blocking calls — **S** |
| 221 | + |
| 222 | +**Gap.** A blocking `complete(...)` cannot be aborted from another thread. |
| 223 | + |
| 224 | +**Proposal.** |
| 225 | +- `complete(params, CancellationToken token)` overload. |
| 226 | +- Token wraps a native task id; `token.cancel()` issues a stop. |
| 227 | + |
| 228 | +**Effort: S** — overlaps with §2.7's `cancel_task` primitive. |
| 229 | + |
| 230 | +**Value.** Enables UI timeouts and request-scoped cancellation in |
| 231 | +web frameworks. |
| 232 | + |
| 233 | +--- |
| 234 | + |
| 235 | +## 3. Considered but **not recommended** |
| 236 | + |
| 237 | +| Kotlin feature | Why we should not port it | |
| 238 | +|------------------------------|---------------------------| |
| 239 | +| Agents framework / Turns | Belongs in a higher-level "stack" layer (LangChain4j, Spring AI). Out of scope for a thin JNI binding. | |
| 240 | +| Shields / Safety service | Same — separate library concern. | |
| 241 | +| Eval / Benchmarks / Scoring | Separate test/eval tooling, not inference. | |
| 242 | +| SyntheticDataGeneration | Out of scope. | |
| 243 | +| Telemetry spans (OTEL) | Users can add their own around `LlamaModel` calls. | |
| 244 | +| VectorDB / VectorStore / RAG | Would duplicate ObjectBox / Lucene / pgvector / Chroma integrations. Higher-level libs do this better. | |
| 245 | +| Files / Datasets | REST-only concept; no on-device analogue. | |
| 246 | +| HTTP client conveniences (retries, proxy, timeouts) | N/A for in-process JNI. | |
| 247 | + |
| 248 | +## 4. Suggested rollout order |
| 249 | + |
| 250 | +1. **§2.1 Multimodal (L)** — biggest capability gap, isolated subsystem. |
| 251 | +2. **§2.2 Typed Chat model + tool calling (M)** — foundational; other |
| 252 | + features (usage, logprobs, async) all return / accept these types. |
| 253 | +3. **§2.5 Usage / Timings (XS)** — quick, lands inside §2.2. |
| 254 | +4. **§2.7 + §2.10 Cancellation primitive (S)** — small JNI add, unblocks §2.3. |
| 255 | +5. **§2.3 Async API (S–M)** — `CompletableFuture` first, reactive later. |
| 256 | +6. **§2.4 Batch inference (M)**. |
| 257 | +7. **§2.6 Session helper (S)**. |
| 258 | +8. **§2.8 Structured-output helpers (S)**. |
| 259 | +9. **§2.9 Logprobs in result (S)**. |
| 260 | + |
| 261 | +Total realistic effort for the full list: **~5–7 engineer-weeks**, with |
| 262 | +multimodal alone accounting for roughly a third of that. |
| 263 | + |
| 264 | +## 5. Open questions |
| 265 | + |
| 266 | +- **Multimodal scope** — vision only first, or also audio (Qwen-Audio, |
| 267 | + Whisper-style mmproj)? |
| 268 | +- **Reactive dep** — use `java.util.concurrent.Flow` (JDK 9+ but project |
| 269 | + targets bytecode 1.8) or add Reactive Streams as an optional Maven |
| 270 | + classifier? |
| 271 | +- **Tool calling** — expose the raw `common_chat_*` parser output, or |
| 272 | + normalize to an OpenAI-compatible shape? |
| 273 | +- **Session persistence format** — reuse llama.cpp's slot binary file, or |
| 274 | + also store the message history as JSON next to it? |
0 commit comments