Skip to content

Commit 72b9054

Browse files
committed
Annotate docs with shipped/open status for the §2 feature inventory
The kotlin-comparison doc and the open-issues doc were both stale after PR #188 shipped 11 features. Bring them up to date in place rather than introducing a separate changelog: docs/feature-investigation-llama-stack-client-kotlin.md - §1 capability matrix gets new rows for everything that landed (typed chat + tools, async wrappers, reactive Publisher, batch dispatch, Usage/Timings, Session, completeAsJson, TokenLogprob, CancellationToken, LoadProgressCallback). - New §1.1 status legend (SHIPPED / PARTIAL / OPEN). - Each §2.x section now starts with a Status: line summarising what shipped, with commit refs into this PR. §2.2/2.3/2.4/2.5/2.7/2.8/ 2.9/2.10 marked SHIPPED. §2.6 PARTIAL (locking deferred). §2.10 PARTIAL — cooperative cancel shipped; immediate cancel needs a new server-side JNI primitive (M-effort follow-up). §2.1 OPEN (multimodal image API). docs/history/49be664_open_issues.md - #113 updated STILL POSSIBLE -> FIXED in PR #188 commit 70df324, with a note that the richer payload (file name, bytes, weights flag) is intentionally not exposed because the upstream llama_model_params.progress_callback emits only a float. No code changes, no test impact. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy
1 parent afa4f65 commit 72b9054

2 files changed

Lines changed: 113 additions & 2 deletions

File tree

docs/feature-investigation-llama-stack-client-kotlin.md

Lines changed: 97 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,42 @@ T-shirt sizes:
2929
| Slot save / restore / erase ||
3030
| `continueFinalMessage` ||
3131
| Tokenize / decode / template apply ||
32-
| Metrics string (`getMetrics()`) ||
32+
| Metrics string (`getMetrics()`) + typed `ServerMetrics` ||
3333
| Speculative draft model wiring ||
34+
| Typed `ChatRequest` / `ChatResponse` + tool calling | ✅ (§2.2) |
35+
| `CompletableFuture` async wrappers | ✅ (§2.3) |
36+
| Reactive Streams `Publisher<LlamaOutput>` token stream | ✅ (§2.3) |
37+
| `completeBatch` / `chatBatch` parallel dispatch | ✅ (§2.4) |
38+
| Typed `Usage` / `Timings` / `CompletionResult` | ✅ (§2.5) |
39+
| `Session` helper (single-threaded) | ✅ (§2.6) |
40+
| `AutoCloseable` iterator + cancel polish | ✅ (§2.7) |
41+
| Per-request `setJsonSchema` + `completeAsJson<T>` | ✅ (§2.8) |
42+
| Typed `TokenLogprob` in `LlamaOutput` | ✅ (§2.9) |
43+
| `CancellationToken` (cooperative) | ✅ (§2.10) |
44+
| `LoadProgressCallback` model-load progress | ✅ (#113) |
3445

3546
These do not need work — they already match or exceed the Kotlin client.
3647

48+
### 1.1 Status legend for §2
49+
50+
Each §2.x subsection below carries a **Status:** line at the top:
51+
52+
| Marker | Meaning |
53+
|--------|---------|
54+
| `SHIPPED` | Fully landed; commit refs follow. |
55+
| `PARTIAL` | Core landed; a documented follow-up remains (called out inline). |
56+
| `OPEN` | Not started. |
57+
58+
All references point to PR #188 on the `claude/upbeat-hypatia-wPdK5`
59+
branch unless noted.
60+
3761
## 2. Recommended additions (in priority order)
3862

3963
### 2.1 Multimodal image input (mtmd) — **L**
4064

65+
**Status: OPEN.** `ModelParameters.setMmproj` already wires the projector; no
66+
typed Java image API yet. Same gap as issues #103 / #34.
67+
4168
**Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and
4269
the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No
4370
Java method currently accepts image input. Kotlin examples show base64 image
@@ -59,6 +86,16 @@ Gemma 3, MiniCPM-V, LLaVA, etc.
5986

6087
### 2.2 Typed `ChatMessage` / `ChatResponse` model + tool calling — **M**
6188

89+
**Status: SHIPPED** (PR #188, commit `f2c7ed1`). New value types
90+
`ChatRequest`, `ChatResponse`, `ChatChoice`, `ChatMessage` (extended),
91+
`ToolCall`, `ToolDefinition`, plus `ToolHandler` functional interface.
92+
`LlamaModel.chat(ChatRequest)` returns a typed `ChatResponse`;
93+
`chatWithTools(ChatRequest, Map<String, ToolHandler>)` runs the agent
94+
auto-loop (capturing handler exceptions as `{"error":...}` tool results
95+
so the loop continues; cap via `ChatRequest.maxToolRounds`, default 8).
96+
The tier-1 (typed response only) and tier-2 (manual tool round-trip via
97+
`ChatMessage.toolResult`) APIs are equally usable.
98+
6299
**Gap.** Today: `setMessages(String system, List<Pair<String,String>>)` and
63100
`chatComplete → String`. The server *parses* tool calls
64101
(`common_chat_*` infrastructure) but Java callers must scrape JSON
@@ -85,6 +122,15 @@ papercut.
85122

86123
### 2.3 Async / non-blocking API — **S–M**
87124

125+
**Status: SHIPPED.** `CompletableFuture` wrappers (`completeAsync`,
126+
`chatCompleteAsync`, `chatCompleteTextAsync`, plus a
127+
`completeAsync(params, CancellationToken)` bridge that propagates
128+
`future.cancel(true)` into the cooperative token) in commit `1e673a9`.
129+
The reactive `Publisher<LlamaOutput>` follow-up (backpressure via
130+
Reactive Streams, single-subscriber) shipped in commit `afa4f65` as
131+
`LlamaModel.streamPublisher(...)` and `streamChatPublisher(...)` backed
132+
by `LlamaPublisher`. New runtime dep: `org.reactivestreams:reactive-streams:1.0.4`.
133+
88134
**Gap.** All `LlamaModel` methods are blocking. Kotlin offers
89135
`suspend fun` + Flow variants. JVM users currently dedicate platform
90136
threads per inference.
@@ -108,6 +154,13 @@ RxJava, Kotlin coroutines from Java consumers.
108154

109155
### 2.4 Batch inference across slots — **M**
110156

157+
**Status: SHIPPED** (PR #188, commit `de457b2`).
158+
`LlamaModel.completeBatch(List<InferenceParameters>)`,
159+
`completeBatchWithStats(...)`, and `chatBatch(List<ChatRequest>)` dispatch
160+
all requests at once via the existing async wrappers; results returned in
161+
input order. Throughput scales with `ModelParameters.setParallel(N)`
162+
(default `N=1` runs sequentially across the single slot).
163+
111164
**Gap.** llama.cpp natively serves parallel slots; the compiled-in server
112165
handles concurrent tasks. `LlamaModel` exposes no batch entry point.
113166

@@ -127,6 +180,12 @@ rerank pipelines; close to a free win.
127180

128181
### 2.5 Typed `Usage` / `Timings` result — **XS–S**
129182

183+
**Status: SHIPPED** (PR #188, commits `fe1cf3b` + `c529499`). `Usage`,
184+
`Timings`, and `ServerMetrics` value classes + `LlamaModel.getMetricsTyped()`
185+
parse server-wide metrics. Per-completion `Usage`/`Timings` land in
186+
`ChatResponse` (§2.2) and in the new `CompletionResult` returned by
187+
`LlamaModel.completeWithStats(InferenceParameters)`.
188+
130189
**Gap.** `getMetrics()` returns a raw JSON `String`. Kotlin exposes
131190
`Usage(promptTokens, completionTokens, totalTokens)` plus a richer
132191
`Timings` (`tokensPerSecond`, `promptMs`, `predictedMs`, `cacheHit`,
@@ -146,6 +205,12 @@ rerank pipelines; close to a free win.
146205

147206
### 2.6 `Session` helper (multi-turn) — **S–M**
148207

208+
**Status: PARTIAL** (PR #188, commit `e4f531c`). `Session` ships as an
209+
`AutoCloseable` wrapper with `send(...)`, `stream(...)`,
210+
`commitStreamedReply(...)`, `save(Path)` / `restore(Path)`, and an
211+
optional `InferenceParameters` customizer. Single-thread only in this
212+
pass — per-session locking is the remaining M-effort follow-up.
213+
149214
**Gap.** Slots exist as a low-level primitive. Kotlin offers
150215
"agents/sessions/turns" with persistence and resume.
151216

@@ -165,6 +230,12 @@ rerank pipelines; close to a free win.
165230

166231
### 2.7 Stream cancellation & `AutoCloseable` iterator — **S**
167232

233+
**Status: SHIPPED** (PR #188, commit `d1c9fb0`). `LlamaIterator` already
234+
implemented `AutoCloseable` with `cancel()`/`close()`; this commit
235+
audited the path, documented the cancel-vs-stop nuance and idempotency
236+
in the javadoc, added a try-with-resources example on
237+
`LlamaModel.generate(...)`, and added `testIteratorCloseIdempotent`.
238+
168239
**Gap.** `LlamaIterable` / `LlamaIterator` cannot be cancelled mid-stream;
169240
the underlying slot task keeps running until natural stop. Kotlin marks
170241
streaming returns `@MustBeClosed`.
@@ -183,6 +254,13 @@ Java side.
183254

184255
### 2.8 Structured-output convenience helpers — **S**
185256

257+
**Status: SHIPPED** (PR #188, commit `80e5c13`).
258+
`InferenceParameters.setJsonSchema(String)` mirrors the existing
259+
`setGrammar`. `LlamaModel.completeAsJson(Class<T>, String schema, InferenceParameters)`
260+
sets the schema and Jackson-deserializes the result. The
261+
single-argument overload `completeAsJson(Class<T>, InferenceParameters)`
262+
trusts that the caller already set schema/grammar.
263+
186264
**Gap.** `setJsonSchema` / `setGrammar` already exist on `ModelParameters`
187265
but not on `InferenceParameters`. No typed-result helper.
188266

@@ -202,6 +280,13 @@ SDKs.
202280

203281
### 2.9 Logprobs in the typed result — **S**
204282

283+
**Status: SHIPPED** (PR #188, commit `a8077b6`). `TokenLogprob` value
284+
type carries `token`, `tokenId`, `logprob`, and the nested
285+
`topLogprobs` alternatives. `LlamaOutput.logprobs` is populated by
286+
`CompletionResponseParser.parseLogprobs` (post-sampling `prob` or
287+
pre-sampling `logprob` mode auto-detected). Also surfaces in
288+
`CompletionResult.getLogprobs()` (§2.5).
289+
205290
**Gap.** `setNProbs` exists; the result type is a plain `String`, so
206291
per-token probabilities are not surfaced.
207292

@@ -219,6 +304,17 @@ per-token probabilities are not surfaced.
219304

220305
### 2.10 Cancellation token / abort for blocking calls — **S**
221306

307+
**Status: SHIPPED — cooperative only** (PR #188, commits `ad66e3a` +
308+
`e3b9043`). `CancellationToken.cancel()` sets a `volatile` flag observed
309+
between tokens by the inference loop in
310+
`LlamaModel.complete(InferenceParameters, CancellationToken)`. Effective
311+
latency is one token interval (the loop checks at each token boundary).
312+
Immediate cancel (requiring a new server-side stop-task JNI primitive)
313+
is the remaining M-effort follow-up. The initial impl tried to abort
314+
mid-token via a cross-thread JNI call; that race was the root cause of
315+
a `std::system_error` JVM abort in CI and was reverted to the safe
316+
cooperative path.
317+
222318
**Gap.** A blocking `complete(...)` cannot be aborted from another thread.
223319

224320
**Proposal.**

docs/history/49be664_open_issues.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,22 @@ vs. total, whether the file is the weights file, whether it is a download or
188188
disk load) via a `Consumer<LLamaLoadProgress>` callback passed to the
189189
`LlamaModel` constructor. Intended for showing a progress bar to end users.
190190

191-
**Status in fork:** STILL POSSIBLE. No `LLamaLoadProgress`, `Consumer<…>` or `setProgressCallback` exists in `LlamaModel.java` / `ModelParameters.java` (`grep -n "progress\|Progress"` returns only iterator cancellation comments). Next steps: expose `llama_model_params.progress_callback` through `ModelParameters` (Java side: a `Consumer<Float>` field; JNI side: wire a trampoline in `jllama.cpp` similar to `log_callback_trampoline`).
191+
**Status in fork:** FIXED in PR #188 (commit `70df324`). New
192+
`LoadProgressCallback` functional interface (single method
193+
`boolean onProgress(float progress)`; return `false` to abort).
194+
New constructor overload
195+
`LlamaModel(ModelParameters, LoadProgressCallback)` plumbs the
196+
callback through a new JNI entry point `loadModelWithProgress`,
197+
which installs a trampoline on `common_params.load_progress_callback`
198+
that forwards the float to `LoadProgressCallback.onProgress(float)Z`
199+
via `CallBooleanMethod`. The existing `loadModel` JNI symbol still
200+
exists; both entry points share a `load_model_impl` helper.
201+
Callback fires synchronously on the loader thread with progress in
202+
`[0.0, 1.0]`; returning `false` aborts and the constructor throws
203+
`LlamaException`. The original report's richer payload (file name,
204+
bytes, weights vs download flag) is NOT exposed — only the float —
205+
because `llama_model_params.progress_callback` itself only emits the
206+
float; richer fields would require an upstream API change.
192207

193208
---
194209

0 commit comments

Comments
 (0)