Skip to content

Commit f79fa05

Browse files
docs: add feature investigation comparing with llama-stack-client-kotlin (#183)
Inventory of candidate features for java-llama.cpp derived from a comparison with ogx-ai/llama-stack-client-kotlin, with T-shirt effort estimates and a suggested rollout order. Co-authored-by: Claude <noreply@anthropic.com>
1 parent 9cbe98f commit f79fa05

1 file changed

Lines changed: 274 additions & 0 deletions

File tree

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Feature Investigation — ideas from `llama-stack-client-kotlin`
2+
3+
Comparison source: [ogx-ai/llama-stack-client-kotlin](https://github.com/ogx-ai/llama-stack-client-kotlin)
4+
(version 0.2.14, a Stainless-generated Kotlin SDK for the Llama Stack REST API
5+
with an optional ExecuTorch-backed local-inference path).
6+
7+
This document inventories candidate features for `java-llama.cpp` derived from
8+
that comparison, with rough effort estimates. Effort is given in
9+
T-shirt sizes:
10+
11+
| Size | Calendar effort (1 engineer) | Description |
12+
|------|------------------------------|-------------|
13+
| XS | < 0.5 day | Trivial Java-side change, no JNI |
14+
| S | 0.5 – 2 days | Java surface + minor JNI/JSON wiring |
15+
| M | 2 – 5 days | New JNI methods, native plumbing, tests |
16+
| L | 1 – 2 weeks | New native subsystem or large API surface |
17+
| XL | > 2 weeks | Architectural addition |
18+
19+
## 1. What `java-llama.cpp` already covers
20+
21+
| Capability | Status |
22+
|---------------------------------------------------------|--------|
23+
| Chat / completion (blocking + streaming via `LlamaIterable`) ||
24+
| Embeddings (`embed(String)`) ||
25+
| Rerank (`rerank(query, docs)`) ||
26+
| Grammar + JSON-schema constrained output ||
27+
| Rich sampling (DRY, mirostat, dyn-temp, XTC, top-n-σ) ||
28+
| Reasoning budget / reasoning format ||
29+
| Slot save / restore / erase ||
30+
| `continueFinalMessage` ||
31+
| Tokenize / decode / template apply ||
32+
| Metrics string (`getMetrics()`) ||
33+
| Speculative draft model wiring ||
34+
35+
These do not need work — they already match or exceed the Kotlin client.
36+
37+
## 2. Recommended additions (in priority order)
38+
39+
### 2.1 Multimodal image input (mtmd) — **L**
40+
41+
**Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and
42+
the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No
43+
Java method currently accepts image input. Kotlin examples show base64 image
44+
chat against vision models.
45+
46+
**Proposal.**
47+
- `InferenceParameters.addImage(byte[] png)` / `addImage(Path)` / `addImageBase64(String)`.
48+
- `ModelParameters.setMmproj(Path)` to load the mmproj projector file.
49+
- JNI: feed images into the server task params (`mtmd_*` API).
50+
51+
**Effort: L** — non-trivial JNI plumbing, lifecycle of `mtmd_context`,
52+
test fixtures for vision models, but most of the heavy lifting is already
53+
upstream.
54+
55+
**Value.** Biggest user-visible capability missing today. Unlocks Qwen-VL,
56+
Gemma 3, MiniCPM-V, LLaVA, etc.
57+
58+
---
59+
60+
### 2.2 Typed `ChatMessage` / `ChatResponse` model + tool calling — **M**
61+
62+
**Gap.** Today: `setMessages(String system, List<Pair<String,String>>)` and
63+
`chatComplete → String`. The server *parses* tool calls
64+
(`common_chat_*` infrastructure) but Java callers must scrape JSON
65+
themselves. Kotlin exposes typed `ChatCompletionRequest` / `ChatResponse`
66+
with `toolCalls`, `finishReason`, `usage`.
67+
68+
**Proposal.**
69+
- `ChatMessage` record: `role` (enum: system/user/assistant/tool), `content`
70+
(text + optional image parts), `toolCalls`, `toolCallId`, `name`.
71+
- `ToolDefinition` record (name, description, JSON-schema parameters) and
72+
`InferenceParameters.setTools(List<ToolDefinition>)`.
73+
- `ChatResponse` record: `content`, `finishReason` (enum), `toolCalls`,
74+
`usage` (see §2.5), `timings`, optional `logprobs`.
75+
- New `LlamaModel.chat(ChatRequest)` / `chatStream(ChatRequest)`.
76+
77+
**Effort: M** — Java-side data model + JSON marshalling + a wrapper around
78+
the existing native chat path. No new native code needed.
79+
80+
**Value.** Brings the library in line with every modern Java LLM SDK
81+
(LangChain4j, Spring AI). Removes the "parse the JSON string yourself"
82+
papercut.
83+
84+
---
85+
86+
### 2.3 Async / non-blocking API — **S–M**
87+
88+
**Gap.** All `LlamaModel` methods are blocking. Kotlin offers
89+
`suspend fun` + Flow variants. JVM users currently dedicate platform
90+
threads per inference.
91+
92+
**Proposal.**
93+
- `CompletableFuture<String> completeAsync(InferenceParameters)`
94+
- `CompletableFuture<ChatResponse> chatAsync(ChatRequest)`
95+
- Reactive Streams `Publisher<String> generateReactive(...)` (or
96+
`Flow.Publisher` from `java.util.concurrent.Flow` to avoid a new
97+
dependency).
98+
- Backed by the existing native worker thread inside `jllama_context`;
99+
no extra Java thread pool.
100+
101+
**Effort: S** for `CompletableFuture` wrappers, **M** for a proper
102+
backpressure-aware `Flow.Publisher` streaming token producer.
103+
104+
**Value.** Enables composition with virtual threads, project Reactor,
105+
RxJava, Kotlin coroutines from Java consumers.
106+
107+
---
108+
109+
### 2.4 Batch inference across slots — **M**
110+
111+
**Gap.** llama.cpp natively serves parallel slots; the compiled-in server
112+
handles concurrent tasks. `LlamaModel` exposes no batch entry point.
113+
114+
**Proposal.**
115+
- `List<String> completeBatch(List<InferenceParameters>)`.
116+
- `List<ChatResponse> chatBatch(List<ChatRequest>)`.
117+
- `Stream<String> generateBatch(...)` returning results in submission order.
118+
- Configure parallelism via existing `ModelParameters` (e.g. n_parallel).
119+
120+
**Effort: M** — server already supports it; needs a Java entry that
121+
dispatches N tasks, awaits all, and maps results.
122+
123+
**Value.** Throughput multiplier for embedding / classification /
124+
rerank pipelines; close to a free win.
125+
126+
---
127+
128+
### 2.5 Typed `Usage` / `Timings` result — **XS–S**
129+
130+
**Gap.** `getMetrics()` returns a raw JSON `String`. Kotlin exposes
131+
`Usage(promptTokens, completionTokens, totalTokens)` plus a richer
132+
`Timings` (`tokensPerSecond`, `promptMs`, `predictedMs`, `cacheHit`,
133+
`drafted`, `accepted`).
134+
135+
**Proposal.**
136+
- `Usage` and `Timings` records.
137+
- Returned as part of `ChatResponse` (§2.2) and from a new
138+
`LlamaModel.getMetricsTyped()` accessor.
139+
140+
**Effort: XS** if scoped to `Timings`/`Usage` parsing only,
141+
**S** when wired into the new `ChatResponse`.
142+
143+
**Value.** Removes ad-hoc JSON parsing. Cheap.
144+
145+
---
146+
147+
### 2.6 `Session` helper (multi-turn) — **S–M**
148+
149+
**Gap.** Slots exist as a low-level primitive. Kotlin offers
150+
"agents/sessions/turns" with persistence and resume.
151+
152+
**Proposal.**
153+
- `Session` (`AutoCloseable`) owning a slot id, an accumulating
154+
`List<ChatMessage>`, and `save(Path)` / `restore(Path)` that delegate
155+
to `saveSlot` / `restoreSlot`.
156+
- `session.send(userMessage)``ChatResponse`.
157+
- `session.stream(userMessage)``LlamaIterable`.
158+
159+
**Effort: S** if it stays a thin wrapper over existing slot APIs,
160+
**M** if we add proper concurrency / per-session locking.
161+
162+
**Value.** Common conversational use case becomes a one-liner.
163+
164+
---
165+
166+
### 2.7 Stream cancellation & `AutoCloseable` iterator — **S**
167+
168+
**Gap.** `LlamaIterable` / `LlamaIterator` cannot be cancelled mid-stream;
169+
the underlying slot task keeps running until natural stop. Kotlin marks
170+
streaming returns `@MustBeClosed`.
171+
172+
**Proposal.**
173+
- Make `LlamaIterator` implement `AutoCloseable` with `close()` cancelling
174+
the server task (slot stop).
175+
- Document try-with-resources usage.
176+
177+
**Effort: S** — small JNI addition (`cancel_task(task_id)`), straightforward
178+
Java side.
179+
180+
**Value.** Prevents wasted compute when the consumer gives up early.
181+
182+
---
183+
184+
### 2.8 Structured-output convenience helpers — **S**
185+
186+
**Gap.** `setJsonSchema` / `setGrammar` already exist on `ModelParameters`
187+
but not on `InferenceParameters`. No typed-result helper.
188+
189+
**Proposal.**
190+
- Add `setJsonSchema(String)` / `setGrammar(String)` to
191+
`InferenceParameters` (per-request schema).
192+
- `<T> T completeAsJson(Class<T>, InferenceParameters)` — derive the
193+
JSON schema from the class via Jackson (`JacksonModule` already on
194+
classpath?) or document a manual-schema variant.
195+
196+
**Effort: S**.
197+
198+
**Value.** Matches the "structured outputs" UX of OpenAI / Llama Stack
199+
SDKs.
200+
201+
---
202+
203+
### 2.9 Logprobs in the typed result — **S**
204+
205+
**Gap.** `setNProbs` exists; the result type is a plain `String`, so
206+
per-token probabilities are not surfaced.
207+
208+
**Proposal.**
209+
- Add `List<TokenLogprob> logprobs` to `ChatResponse` (§2.2) and to a
210+
new `LlamaOutput` field for non-chat completion.
211+
- `TokenLogprob { String token; int tokenId; float logprob;
212+
List<TokenLogprob> topLogprobs; }`.
213+
214+
**Effort: S**.
215+
216+
**Value.** Required for evaluation / uncertainty estimation use cases.
217+
218+
---
219+
220+
### 2.10 Cancellation token / abort for blocking calls — **S**
221+
222+
**Gap.** A blocking `complete(...)` cannot be aborted from another thread.
223+
224+
**Proposal.**
225+
- `complete(params, CancellationToken token)` overload.
226+
- Token wraps a native task id; `token.cancel()` issues a stop.
227+
228+
**Effort: S** — overlaps with §2.7's `cancel_task` primitive.
229+
230+
**Value.** Enables UI timeouts and request-scoped cancellation in
231+
web frameworks.
232+
233+
---
234+
235+
## 3. Considered but **not recommended**
236+
237+
| Kotlin feature | Why we should not port it |
238+
|------------------------------|---------------------------|
239+
| Agents framework / Turns | Belongs in a higher-level "stack" layer (LangChain4j, Spring AI). Out of scope for a thin JNI binding. |
240+
| Shields / Safety service | Same — separate library concern. |
241+
| Eval / Benchmarks / Scoring | Separate test/eval tooling, not inference. |
242+
| SyntheticDataGeneration | Out of scope. |
243+
| Telemetry spans (OTEL) | Users can add their own around `LlamaModel` calls. |
244+
| VectorDB / VectorStore / RAG | Would duplicate ObjectBox / Lucene / pgvector / Chroma integrations. Higher-level libs do this better. |
245+
| Files / Datasets | REST-only concept; no on-device analogue. |
246+
| HTTP client conveniences (retries, proxy, timeouts) | N/A for in-process JNI. |
247+
248+
## 4. Suggested rollout order
249+
250+
1. **§2.1 Multimodal (L)** — biggest capability gap, isolated subsystem.
251+
2. **§2.2 Typed Chat model + tool calling (M)** — foundational; other
252+
features (usage, logprobs, async) all return / accept these types.
253+
3. **§2.5 Usage / Timings (XS)** — quick, lands inside §2.2.
254+
4. **§2.7 + §2.10 Cancellation primitive (S)** — small JNI add, unblocks §2.3.
255+
5. **§2.3 Async API (S–M)**`CompletableFuture` first, reactive later.
256+
6. **§2.4 Batch inference (M)**.
257+
7. **§2.6 Session helper (S)**.
258+
8. **§2.8 Structured-output helpers (S)**.
259+
9. **§2.9 Logprobs in result (S)**.
260+
261+
Total realistic effort for the full list: **~5–7 engineer-weeks**, with
262+
multimodal alone accounting for roughly a third of that.
263+
264+
## 5. Open questions
265+
266+
- **Multimodal scope** — vision only first, or also audio (Qwen-Audio,
267+
Whisper-style mmproj)?
268+
- **Reactive dep** — use `java.util.concurrent.Flow` (JDK 9+ but project
269+
targets bytecode 1.8) or add Reactive Streams as an optional Maven
270+
classifier?
271+
- **Tool calling** — expose the raw `common_chat_*` parser output, or
272+
normalize to an OpenAI-compatible shape?
273+
- **Session persistence format** — reuse llama.cpp's slot binary file, or
274+
also store the message history as JSON next to it?

0 commit comments

Comments
 (0)