Skip to content

Commit 92a4e1f

Browse files
Add async APIs, cancellation support, and typed metrics accessors (#188)
* Add typed Usage / Timings / ServerMetrics accessors (§2.5) Introduces Usage, Timings, and ServerMetrics value classes plus LlamaModel.getMetricsTyped() so callers no longer need to parse the raw JSON from getMetrics() by hand. Mirrors the existing ModelMeta pattern. 15 unit tests, no native or JNI changes. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add per-request setJsonSchema + completeAsJson<T> helpers (§2.8) InferenceParameters gains setJsonSchema(String) mirroring the existing ModelParameters setter. LlamaModel.completeAsJson<T> sets the schema, runs complete(), and deserializes the result via Jackson, throwing a LlamaException if the model output is not valid JSON for the target type. No JNI changes — the native server already accepts json_schema in slot params. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add typed logprobs to LlamaOutput (§2.9) New TokenLogprob record carries token text, id, raw prob/logprob, and the nested top_probs/top_logprobs alternatives. LlamaOutput.logprobs is populated by CompletionResponseParser.parseLogprobs from the same completion_probabilities array that already feeds the flat probabilities map. Existing constructor stays as a delegator so all prior callers keep working with logprobs defaulting to empty. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Document iterator cancel semantics + idempotency regression (§2.7) LlamaIterator.cancel() / close() were already wired correctly via the existing JNI cancelCompletion → erase_reader path, so this is purely a docs + test pass: - Clarify in LlamaIterator javadoc that the underlying llama.cpp slot may continue to its natural stop after cancel(), while the reader is released immediately and next() stops yielding. - Document close() idempotency (post-natural-stop, post-cancel, double-close all safe). - Add try-with-resources example to LlamaModel.generate javadoc. - Add testIteratorCloseIdempotent in LlamaModelTest covering both the drained-then-closed and cancelled-then-closed paths and confirming the model is still usable afterwards. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add CancellationToken + complete(params, token) overload (§2.10) CancellationToken wraps an AtomicInteger task id and a model reference. LlamaModel.complete(params, token) runs the streaming inference path internally, binds the token, accumulates text, and returns early when token.cancel() is invoked from another thread. The token is reset on return so it is reusable across calls. No JNI changes: reuses the existing cancelCompletion native method (which erases the JNI reader; the upstream slot completes naturally). https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add CompletableFuture async wrappers for complete/chatComplete (§2.3) LlamaModel gains completeAsync, chatCompleteAsync, and chatCompleteTextAsync — thin wrappers that dispatch the existing blocking methods through ForkJoinPool.commonPool(). The completeAsync(params, token) overload bridges future.cancel(true) to CancellationToken.cancel() so cancellation propagates into the inference loop. Reactive Flow.Publisher streaming (M-effort) is intentionally deferred to a follow-up; this PR delivers only the S-effort portion of §2.3. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add Session multi-turn helper + ChatMessage value type (§2.6) Session is a thin wrapper over LlamaModel: it owns a slot id, an accumulating user/assistant transcript, and an optional system message and parameter customizer. send(userMessage) appends both sides of the turn and runs chatCompleteText with the full history. stream(userMessage) returns a LlamaIterable for streamed replies; commitStreamedReply records the assistant turn once the caller has accumulated the text. save/restore delegate to existing LlamaModel.saveSlot/restoreSlot. close() erases the slot's KV cache. Single-threaded use only in this pass — per-session locking is the M-effort follow-up. ChatMessage is the minimal value type for the transcript; will be reused by ChatResponse when §2.2 lands. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Fix CI VM crash: make CancellationToken cooperative-only Cross-thread cancel raced with the JNI receive loop: cancel() called cancelCompletion() from another thread, which erased the underlying server_response_reader unique_ptr while the main thread held a raw pointer to it and was blocked inside rd->next(). On the next token this dereferenced freed memory and aborted with std::system_error, crashing the test JVM (exit 134). Fix: cancel() now sets a volatile flag only. The inference loop in complete(params, token) checks the flag between tokens and, when set, calls cancelCompletion from the same thread that just returned from receiveCompletionJson — safe because no concurrent access remains. Latency becomes one token interval (tens to a few hundred ms on CPU) instead of immediate. Documented in CancellationToken javadoc. Tests: - LlamaModelTest#testCompleteWithCancellationToken: budget relaxed from 5s to 30s (was tight even on the happy path). - LlamaModelTest#testCompleteAsyncCancelPropagates: drop the brittle poll on token.isCancelled() (the worker resets the token on return before the assertion sees it); sleep for cancel propagation and verify the model is still usable. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Fix javadoc build error + reduce warnings on new classes The release packaging job (mvn package, release profile) runs maven-javadoc-plugin's attach-javadocs which treats Javadoc tool errors as build failures. PR #188 introduced one such error: TokenLogprob.java had a </p> with no matching <p> (the prose was already enclosed by an outer <p>...</p>, and the inner </p> was stray). Fix the error and bring my new public APIs up to a clean shape: - TokenLogprob: rebalance the <p>/</p> HTML and add @return / @param to public getters and constructor. - Timings, Usage, ServerMetrics, ChatMessage, CancellationToken, Session, LlamaOutput: add @return / @param tags with a leading one-line description (the "no main description" warning fires on bare /** @return ... */ blocks). - LlamaModel: restore the doc comment for complete(params, token) that was accidentally stripped during an earlier edit, and add one for getMetricsTyped(); remove a stray orphan doc block. Local verification: mvn clean javadoc:jar -DskipTests=true -Dgpg.skip=true mvn -P release -Dmaven.test.skip=true -Dgpg.skip=true package Both: BUILD SUCCESS (was: BUILD FAILURE, 1 error, 100 warnings). 60 warnings remain, all from pre-existing files outside this PR. Document the verification command and the failure categories (errors vs warnings) in CLAUDE.md under "Javadoc — must build cleanly before mvn package". https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add LoadProgressCallback for model-load progress (#113) Exposes llama.cpp's llama_model_params.progress_callback as a Java functional interface. New constructor: new LlamaModel(parameters, progress -> { ... return true; }); The callback receives a float in [0.0, 1.0] on the loader thread (same thread that called the constructor) and may return false to abort, in which case the constructor throws LlamaException. JNI: extracts the existing loadModel body into load_model_impl, adds a trampoline that forwards float progress to a Java LoadProgressCallback.onProgress(float)Z via CallBooleanMethod. Trampoline state lives on the loader stack — bounded lifetime is the single load call. Two native entry points share the implementation: loadModel(String[]) — unchanged signature loadModelWithProgress(String[], LoadProgressCallback) Tests in LoadProgressCallbackTest (model-gated): non-decreasing progress in [0,1] reaching ~1.0, returning false aborts with LlamaException, null callback overload delegates to plain loadModel. All 435 C++ unit tests still pass. mvn javadoc:jar BUILD SUCCESS. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add typed ChatRequest/ChatResponse + tool calling + agent loop (§2.2) New typed chat API on top of the existing handleChatCompletions JNI path — no native changes. Value types: - ChatChoice, ChatResponse — choices array, Usage, Timings, raw JSON - ToolCall, ToolDefinition — OAI-shaped tool wire types - ChatMessage (extended) — tool_call_id + tool_calls support, with toolResult() and assistantToolCalls() factory methods (backwards- compatible 2-arg constructor kept for Session and existing tests) - ToolHandler — functional interface for tool callbacks - ChatRequest — builder with messages, tools, tool_choice, maxToolRounds, and an InferenceParameters customizer InferenceParameters: new setMessagesJson(String), setToolsJson(String), setToolChoice(String) for verbatim JSON injection from ChatRequest. LlamaModel: - chat(ChatRequest) → ChatResponse Serializes the request (auto-enables use_jinja when tools present), calls chatComplete, parses the OAI JSON into ChatResponse via the extended ChatResponseParser.parseResponse. - chatWithTools(ChatRequest, Map<String, ToolHandler>) → ChatResponse Agent loop: per round, calls chat(); if the assistant returned tool_calls, invokes each handler (capturing exceptions as {"error":...} tool results so the loop continues), appends the assistant turn and tool-result turns to the request, and loops up to ChatRequest.maxToolRounds (default 8). Unknown tool names produce a {"error":"unknown tool: <name>"} result. ChatResponseParser: new parseResponse() and tool-call/choice parsers; handles both string-shaped and object-shaped tool_calls.arguments (some upstream variants emit each shape). Tests: - ChatResponseTest (7 new unit tests, model-free): plain reply, tool calls with string arguments, object-shaped arguments, malformed input, ChatRequest serialization round-trip. - LlamaModelTest: testTypedChat and testChatWithToolsLoopShortCircuits (model-gated). mvn javadoc:jar BUILD SUCCESS (0 errors, 60 warnings — same as before, none from new files). https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add completeWithStats() for typed Usage/Timings/logprobs on plain completion complete() returned only the generated text, while chat() already exposed Usage/Timings/TokenLogprob via ChatResponse. This commit parity-fills the plain completion path: - New CompletionResult value type (text + Usage + Timings + List<TokenLogprob> + StopReason + raw JSON). - New LlamaModel.completeWithStats(InferenceParameters) calling the existing non-streaming JNI path and parsing the response via a new CompletionResponseParser.parseCompletionResult. - Maps the non-OAI completion fields: content -> text, tokens_evaluated -> Usage.promptTokens, tokens_predicted -> Usage.completionTokens, timings sub-object -> Timings, completion_probabilities -> List<TokenLogprob>, stop_type -> StopReason. complete() (the String-returning overload) is unchanged for backwards compatibility. 5 unit tests in CompletionResultTest (model-free): full response, missing-fields defaults, stop reason mapping (eos / limit / word), malformed input. mvn javadoc:jar BUILD SUCCESS, no new warnings. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add completeBatch / chatBatch parallel dispatch (§2.4) Three new methods on LlamaModel that hand a list of requests to the native scheduler at once and collect results in input order: - completeBatch(List<InferenceParameters>) -> List<String> - completeBatchWithStats(List<InferenceParameters>) -> List<CompletionResult> - chatBatch(List<ChatRequest>) -> List<ChatResponse> Implementation reuses the existing CompletableFuture wrappers (completeAsync, supplyAsync(() -> completeWithStats/chat)) and joins them all in input order. The native worker thread runs the upstream slot scheduler, which dispatches tasks across however many slots ModelParameters.setParallel(N) was configured with. With the default N=1 the batch still works correctly, just sequentially. No JNI changes — the upstream scheduler already supports parallel slot execution; this surfaces it as a typed Java API. Three model-gated tests in LlamaModelTest exercise the order-preserving contract and per-result Usage population. mvn javadoc:jar BUILD SUCCESS, no new warnings. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Add LlamaPublisher reactive-streams token publisher (§2.3 follow-up) Backpressure-aware Publisher<LlamaOutput> on top of the existing streaming iterator. Reactor / RxJava / Kotlin coroutines all bridge to the Reactive Streams interface natively, so consumers wrap with Flux.from(...) / Flowable.fromPublisher(...) / asFlow() in one line. LlamaPublisher: - Single-subscriber (second subscribe signals onError per RS spec). - Each subscribe starts a dedicated emitter daemon thread. - Demand honoured via AtomicLong + monitor: emitter blocks while demand == 0 and only calls iterator.next() when demand > 0. - request(n <= 0) signals onError with IllegalArgumentException per reactive-streams §3.9. - cancel() closes the underlying iterator (cooperative, same path as LlamaIterator.close); idempotent. - onComplete fires on stop token, onError on any throwable from the iterator path. LlamaModel: - streamPublisher(InferenceParameters) and streamChatPublisher(InferenceParameters) factories. Dependency: adds org.reactivestreams:reactive-streams 1.0.4 (~5 KB, Java 8 compatible) to pom.xml. Tests in LlamaPublisherTest: - nullSubscriberThrows (model-free). - backpressureAndCancel, singleSubscriberContract, invalidRequestSignalsError (model-gated). mvn javadoc:jar BUILD SUCCESS, no new warnings. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy * Annotate docs with shipped/open status for the §2 feature inventory The kotlin-comparison doc and the open-issues doc were both stale after PR #188 shipped 11 features. Bring them up to date in place rather than introducing a separate changelog: docs/feature-investigation-llama-stack-client-kotlin.md - §1 capability matrix gets new rows for everything that landed (typed chat + tools, async wrappers, reactive Publisher, batch dispatch, Usage/Timings, Session, completeAsJson, TokenLogprob, CancellationToken, LoadProgressCallback). - New §1.1 status legend (SHIPPED / PARTIAL / OPEN). - Each §2.x section now starts with a Status: line summarising what shipped, with commit refs into this PR. §2.2/2.3/2.4/2.5/2.7/2.8/ 2.9/2.10 marked SHIPPED. §2.6 PARTIAL (locking deferred). §2.10 PARTIAL — cooperative cancel shipped; immediate cancel needs a new server-side JNI primitive (M-effort follow-up). §2.1 OPEN (multimodal image API). docs/history/49be664_open_issues.md - #113 updated STILL POSSIBLE -> FIXED in PR #188 commit 70df324, with a note that the richer payload (file name, bytes, weights flag) is intentionally not exposed because the upstream llama_model_params.progress_callback emits only a float. No code changes, no test impact. https://claude.ai/code/session_01R4ZrEy3ptJDLuUgUKuM4Gy --------- Co-authored-by: Claude <noreply@anthropic.com>
1 parent e96984c commit 92a4e1f

39 files changed

Lines changed: 3527 additions & 9 deletions

CLAUDE.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -514,6 +514,52 @@ into `models/` out-of-band.
514514
clang-format -i src/main/cpp/*.cpp src/main/cpp/*.hpp # Format C++ code
515515
```
516516

517+
### Javadoc — must build cleanly before `mvn package`
518+
519+
The release packaging job runs `mvn package` with the `release` profile, which attaches
520+
a javadoc jar via `maven-javadoc-plugin`. The plugin treats Javadoc tool **errors** as
521+
build failures (warnings are tolerated). After changing any public/protected Java API,
522+
verify the javadoc build succeeds locally:
523+
524+
```bash
525+
mvn clean javadoc:jar -DskipTests=true -Dgpg.skip=true
526+
# expected: BUILD SUCCESS
527+
```
528+
529+
Common Javadoc errors that fail the build (not warnings):
530+
531+
- **Unbalanced HTML**: `</p>` without a matching `<p>`, mismatched `<ul>`/`<li>`, stray
532+
closing tags. Symptom: `error: unexpected end tag: </p>`.
533+
- **Invalid `{@link …}` targets**: typo'd class, method, or parameter name.
534+
- **Self-closing void HTML elements written as `<br>` inside `<pre>` blocks** in HTML5
535+
mode (rare but seen).
536+
537+
Common Javadoc *warnings* (do not fail the build, but should be cleaned up on new code):
538+
539+
- `no main description` — a doc comment containing only `@param`/`@return`/`@throws`
540+
tags with no leading prose. Fix: add a one-line description before the tags.
541+
- `no @return` / `no @param` — public method missing the tag. Fix: add it.
542+
- `no comment` — public method/field/enum constant has no doc comment at all.
543+
- `use of default constructor, which does not provide a comment` — public class with
544+
no explicit constructor (the synthetic default has no Javadoc). Fix: add an explicit
545+
no-arg constructor with a Javadoc comment.
546+
547+
Preferred doc-comment shapes for getters and small value types:
548+
549+
```java
550+
/**
551+
* Brief one-line description of the value.
552+
*
553+
* @return the value
554+
*/
555+
public T getThing() { ... }
556+
```
557+
558+
A bare `/** @return … */` triggers `no main description`; add a leading sentence.
559+
560+
If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
561+
`.github/workflows/publish.yml` will pass the `attach-javadocs` step.
562+
517563
## Architecture
518564

519565
### Two-Layer Design

docs/feature-investigation-llama-stack-client-kotlin.md

Lines changed: 97 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,42 @@ T-shirt sizes:
2929
| Slot save / restore / erase ||
3030
| `continueFinalMessage` ||
3131
| Tokenize / decode / template apply ||
32-
| Metrics string (`getMetrics()`) ||
32+
| Metrics string (`getMetrics()`) + typed `ServerMetrics` ||
3333
| Speculative draft model wiring ||
34+
| Typed `ChatRequest` / `ChatResponse` + tool calling | ✅ (§2.2) |
35+
| `CompletableFuture` async wrappers | ✅ (§2.3) |
36+
| Reactive Streams `Publisher<LlamaOutput>` token stream | ✅ (§2.3) |
37+
| `completeBatch` / `chatBatch` parallel dispatch | ✅ (§2.4) |
38+
| Typed `Usage` / `Timings` / `CompletionResult` | ✅ (§2.5) |
39+
| `Session` helper (single-threaded) | ✅ (§2.6) |
40+
| `AutoCloseable` iterator + cancel polish | ✅ (§2.7) |
41+
| Per-request `setJsonSchema` + `completeAsJson<T>` | ✅ (§2.8) |
42+
| Typed `TokenLogprob` in `LlamaOutput` | ✅ (§2.9) |
43+
| `CancellationToken` (cooperative) | ✅ (§2.10) |
44+
| `LoadProgressCallback` model-load progress | ✅ (#113) |
3445

3546
These do not need work — they already match or exceed the Kotlin client.
3647

48+
### 1.1 Status legend for §2
49+
50+
Each §2.x subsection below carries a **Status:** line at the top:
51+
52+
| Marker | Meaning |
53+
|--------|---------|
54+
| `SHIPPED` | Fully landed; commit refs follow. |
55+
| `PARTIAL` | Core landed; a documented follow-up remains (called out inline). |
56+
| `OPEN` | Not started. |
57+
58+
All references point to PR #188 on the `claude/upbeat-hypatia-wPdK5`
59+
branch unless noted.
60+
3761
## 2. Recommended additions (in priority order)
3862

3963
### 2.1 Multimodal image input (mtmd) — **L**
4064

65+
**Status: OPEN.** `ModelParameters.setMmproj` already wires the projector; no
66+
typed Java image API yet. Same gap as issues #103 / #34.
67+
4168
**Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and
4269
the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No
4370
Java method currently accepts image input. Kotlin examples show base64 image
@@ -59,6 +86,16 @@ Gemma 3, MiniCPM-V, LLaVA, etc.
5986

6087
### 2.2 Typed `ChatMessage` / `ChatResponse` model + tool calling — **M**
6188

89+
**Status: SHIPPED** (PR #188, commit `f2c7ed1`). New value types
90+
`ChatRequest`, `ChatResponse`, `ChatChoice`, `ChatMessage` (extended),
91+
`ToolCall`, `ToolDefinition`, plus `ToolHandler` functional interface.
92+
`LlamaModel.chat(ChatRequest)` returns a typed `ChatResponse`;
93+
`chatWithTools(ChatRequest, Map<String, ToolHandler>)` runs the agent
94+
auto-loop (capturing handler exceptions as `{"error":...}` tool results
95+
so the loop continues; cap via `ChatRequest.maxToolRounds`, default 8).
96+
The tier-1 (typed response only) and tier-2 (manual tool round-trip via
97+
`ChatMessage.toolResult`) APIs are equally usable.
98+
6299
**Gap.** Today: `setMessages(String system, List<Pair<String,String>>)` and
63100
`chatComplete → String`. The server *parses* tool calls
64101
(`common_chat_*` infrastructure) but Java callers must scrape JSON
@@ -85,6 +122,15 @@ papercut.
85122

86123
### 2.3 Async / non-blocking API — **S–M**
87124

125+
**Status: SHIPPED.** `CompletableFuture` wrappers (`completeAsync`,
126+
`chatCompleteAsync`, `chatCompleteTextAsync`, plus a
127+
`completeAsync(params, CancellationToken)` bridge that propagates
128+
`future.cancel(true)` into the cooperative token) in commit `1e673a9`.
129+
The reactive `Publisher<LlamaOutput>` follow-up (backpressure via
130+
Reactive Streams, single-subscriber) shipped in commit `afa4f65` as
131+
`LlamaModel.streamPublisher(...)` and `streamChatPublisher(...)` backed
132+
by `LlamaPublisher`. New runtime dep: `org.reactivestreams:reactive-streams:1.0.4`.
133+
88134
**Gap.** All `LlamaModel` methods are blocking. Kotlin offers
89135
`suspend fun` + Flow variants. JVM users currently dedicate platform
90136
threads per inference.
@@ -108,6 +154,13 @@ RxJava, Kotlin coroutines from Java consumers.
108154

109155
### 2.4 Batch inference across slots — **M**
110156

157+
**Status: SHIPPED** (PR #188, commit `de457b2`).
158+
`LlamaModel.completeBatch(List<InferenceParameters>)`,
159+
`completeBatchWithStats(...)`, and `chatBatch(List<ChatRequest>)` dispatch
160+
all requests at once via the existing async wrappers; results returned in
161+
input order. Throughput scales with `ModelParameters.setParallel(N)`
162+
(default `N=1` runs sequentially across the single slot).
163+
111164
**Gap.** llama.cpp natively serves parallel slots; the compiled-in server
112165
handles concurrent tasks. `LlamaModel` exposes no batch entry point.
113166

@@ -127,6 +180,12 @@ rerank pipelines; close to a free win.
127180

128181
### 2.5 Typed `Usage` / `Timings` result — **XS–S**
129182

183+
**Status: SHIPPED** (PR #188, commits `fe1cf3b` + `c529499`). `Usage`,
184+
`Timings`, and `ServerMetrics` value classes + `LlamaModel.getMetricsTyped()`
185+
parse server-wide metrics. Per-completion `Usage`/`Timings` land in
186+
`ChatResponse` (§2.2) and in the new `CompletionResult` returned by
187+
`LlamaModel.completeWithStats(InferenceParameters)`.
188+
130189
**Gap.** `getMetrics()` returns a raw JSON `String`. Kotlin exposes
131190
`Usage(promptTokens, completionTokens, totalTokens)` plus a richer
132191
`Timings` (`tokensPerSecond`, `promptMs`, `predictedMs`, `cacheHit`,
@@ -146,6 +205,12 @@ rerank pipelines; close to a free win.
146205

147206
### 2.6 `Session` helper (multi-turn) — **S–M**
148207

208+
**Status: PARTIAL** (PR #188, commit `e4f531c`). `Session` ships as an
209+
`AutoCloseable` wrapper with `send(...)`, `stream(...)`,
210+
`commitStreamedReply(...)`, `save(Path)` / `restore(Path)`, and an
211+
optional `InferenceParameters` customizer. Single-thread only in this
212+
pass — per-session locking is the remaining M-effort follow-up.
213+
149214
**Gap.** Slots exist as a low-level primitive. Kotlin offers
150215
"agents/sessions/turns" with persistence and resume.
151216

@@ -165,6 +230,12 @@ rerank pipelines; close to a free win.
165230

166231
### 2.7 Stream cancellation & `AutoCloseable` iterator — **S**
167232

233+
**Status: SHIPPED** (PR #188, commit `d1c9fb0`). `LlamaIterator` already
234+
implemented `AutoCloseable` with `cancel()`/`close()`; this commit
235+
audited the path, documented the cancel-vs-stop nuance and idempotency
236+
in the javadoc, added a try-with-resources example on
237+
`LlamaModel.generate(...)`, and added `testIteratorCloseIdempotent`.
238+
168239
**Gap.** `LlamaIterable` / `LlamaIterator` cannot be cancelled mid-stream;
169240
the underlying slot task keeps running until natural stop. Kotlin marks
170241
streaming returns `@MustBeClosed`.
@@ -183,6 +254,13 @@ Java side.
183254

184255
### 2.8 Structured-output convenience helpers — **S**
185256

257+
**Status: SHIPPED** (PR #188, commit `80e5c13`).
258+
`InferenceParameters.setJsonSchema(String)` mirrors the existing
259+
`setGrammar`. `LlamaModel.completeAsJson(Class<T>, String schema, InferenceParameters)`
260+
sets the schema and Jackson-deserializes the result. The
261+
single-argument overload `completeAsJson(Class<T>, InferenceParameters)`
262+
trusts that the caller already set schema/grammar.
263+
186264
**Gap.** `setJsonSchema` / `setGrammar` already exist on `ModelParameters`
187265
but not on `InferenceParameters`. No typed-result helper.
188266

@@ -202,6 +280,13 @@ SDKs.
202280

203281
### 2.9 Logprobs in the typed result — **S**
204282

283+
**Status: SHIPPED** (PR #188, commit `a8077b6`). `TokenLogprob` value
284+
type carries `token`, `tokenId`, `logprob`, and the nested
285+
`topLogprobs` alternatives. `LlamaOutput.logprobs` is populated by
286+
`CompletionResponseParser.parseLogprobs` (post-sampling `prob` or
287+
pre-sampling `logprob` mode auto-detected). Also surfaces in
288+
`CompletionResult.getLogprobs()` (§2.5).
289+
205290
**Gap.** `setNProbs` exists; the result type is a plain `String`, so
206291
per-token probabilities are not surfaced.
207292

@@ -219,6 +304,17 @@ per-token probabilities are not surfaced.
219304

220305
### 2.10 Cancellation token / abort for blocking calls — **S**
221306

307+
**Status: SHIPPED — cooperative only** (PR #188, commits `ad66e3a` +
308+
`e3b9043`). `CancellationToken.cancel()` sets a `volatile` flag observed
309+
between tokens by the inference loop in
310+
`LlamaModel.complete(InferenceParameters, CancellationToken)`. Effective
311+
latency is one token interval (the loop checks at each token boundary).
312+
Immediate cancel (requiring a new server-side stop-task JNI primitive)
313+
is the remaining M-effort follow-up. The initial impl tried to abort
314+
mid-token via a cross-thread JNI call; that race was the root cause of
315+
a `std::system_error` JVM abort in CI and was reverted to the safe
316+
cooperative path.
317+
222318
**Gap.** A blocking `complete(...)` cannot be aborted from another thread.
223319

224320
**Proposal.**

docs/history/49be664_open_issues.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,22 @@ vs. total, whether the file is the weights file, whether it is a download or
188188
disk load) via a `Consumer<LLamaLoadProgress>` callback passed to the
189189
`LlamaModel` constructor. Intended for showing a progress bar to end users.
190190

191-
**Status in fork:** STILL POSSIBLE. No `LLamaLoadProgress`, `Consumer<…>` or `setProgressCallback` exists in `LlamaModel.java` / `ModelParameters.java` (`grep -n "progress\|Progress"` returns only iterator cancellation comments). Next steps: expose `llama_model_params.progress_callback` through `ModelParameters` (Java side: a `Consumer<Float>` field; JNI side: wire a trampoline in `jllama.cpp` similar to `log_callback_trampoline`).
191+
**Status in fork:** FIXED in PR #188 (commit `70df324`). New
192+
`LoadProgressCallback` functional interface (single method
193+
`boolean onProgress(float progress)`; return `false` to abort).
194+
New constructor overload
195+
`LlamaModel(ModelParameters, LoadProgressCallback)` plumbs the
196+
callback through a new JNI entry point `loadModelWithProgress`,
197+
which installs a trampoline on `common_params.load_progress_callback`
198+
that forwards the float to `LoadProgressCallback.onProgress(float)Z`
199+
via `CallBooleanMethod`. The existing `loadModel` JNI symbol still
200+
exists; both entry points share a `load_model_impl` helper.
201+
Callback fires synchronously on the loader thread with progress in
202+
`[0.0, 1.0]`; returning `false` aborts and the constructor throws
203+
`LlamaException`. The original report's richer payload (file name,
204+
bytes, weights vs download flag) is NOT exposed — only the float —
205+
because `llama_model_params.progress_callback` itself only emits the
206+
float; richer fields would require an upstream API change.
192207

193208
---
194209

pom.xml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,14 @@ SPDX-License-Identifier: MIT
7373
<artifactId>jackson-databind</artifactId>
7474
<version>2.21.3</version>
7575
</dependency>
76+
<!-- Reactive Streams API used by LlamaPublisher to expose token streams as a
77+
Publisher<LlamaOutput>. Java 8 compatible, ~5 KB, supplies the standard
78+
interfaces that Reactor / RxJava / Kotlin coroutines bridge to. -->
79+
<dependency>
80+
<groupId>org.reactivestreams</groupId>
81+
<artifactId>reactive-streams</artifactId>
82+
<version>1.0.4</version>
83+
</dependency>
7684
<!-- Required by OSInfo (vendored from xerial/sqlite-jdbc) for log emission. -->
7785
<dependency>
7886
<groupId>org.slf4j</groupId>

src/main/cpp/jllama.cpp

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -598,7 +598,26 @@ JNIEXPORT void JNICALL JNI_OnUnload(JavaVM *vm, void *reserved) {
598598
llama_backend_free();
599599
}
600600

601-
JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env, jobject obj, jobjectArray jparams) {
601+
// Trampoline state for llama.cpp's load_progress_callback. The native loader runs
602+
// on the calling JNI thread so we can capture JNIEnv directly. Lifetime is bounded
603+
// by the single load_model_impl call.
604+
namespace {
605+
struct load_progress_ud {
606+
JNIEnv *env;
607+
jobject callback;
608+
jmethodID on_progress;
609+
};
610+
611+
bool jni_load_progress_trampoline(float progress, void *user_data) {
612+
auto *ud = static_cast<load_progress_ud *>(user_data);
613+
return ud->env->CallBooleanMethod(ud->callback, ud->on_progress, progress) == JNI_TRUE;
614+
}
615+
} // namespace
616+
617+
// Shared implementation of loadModel and loadModelWithProgress. When `progress` is
618+
// non-null, installs a load-progress trampoline; otherwise behaves identically to
619+
// the no-callback path.
620+
static void load_model_impl(JNIEnv *env, jobject obj, jobjectArray jparams, jobject progress) {
602621
common_params params;
603622

604623
const jsize argc = env->GetArrayLength(jparams);
@@ -662,6 +681,21 @@ JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env
662681

663682
LOG_INF("%s: loading model\n", __func__);
664683

684+
// Install the load-progress trampoline if the caller supplied a callback.
685+
load_progress_ud progress_ud{};
686+
if (progress != nullptr) {
687+
jclass cb_cls = env->GetObjectClass(progress);
688+
progress_ud.env = env;
689+
progress_ud.callback = progress;
690+
progress_ud.on_progress = env->GetMethodID(cb_cls, "onProgress", "(F)Z");
691+
if (progress_ud.on_progress == nullptr) {
692+
fail_load("LoadProgressCallback.onProgress(float) not found");
693+
return;
694+
}
695+
params.load_progress_callback = jni_load_progress_trampoline;
696+
params.load_progress_callback_user_data = &progress_ud;
697+
}
698+
665699
if (!jctx->server.load_model(params)) {
666700
fail_load("could not load model from given file path");
667701
return;
@@ -706,6 +740,16 @@ JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env
706740
env->SetLongField(obj, f_model_pointer, reinterpret_cast<jlong>(jctx));
707741
}
708742

743+
JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env, jobject obj, jobjectArray jparams) {
744+
load_model_impl(env, obj, jparams, nullptr);
745+
}
746+
747+
JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModelWithProgress(JNIEnv *env, jobject obj,
748+
jobjectArray jparams,
749+
jobject callback) {
750+
load_model_impl(env, obj, jparams, callback);
751+
}
752+
709753
JNIEXPORT jstring JNICALL Java_net_ladenthin_llama_LlamaModel_getModelMetaJson(JNIEnv *env, jobject obj) {
710754
REQUIRE_SERVER_CONTEXT(nullptr);
711755
if (jctx->vocab_only) {
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama;
6+
7+
/**
8+
* Cancellation handle for a blocking {@link LlamaModel} call. Pass an instance to
9+
* {@link LlamaModel#complete(InferenceParameters, CancellationToken)} and invoke
10+
* {@link #cancel()} from another thread to abort the inference loop.
11+
* <p>
12+
* Cancellation is cooperative: {@link #cancel()} only sets a flag, and the inference
13+
* loop checks that flag between generated tokens. Effective latency is therefore one
14+
* token interval (typically tens to a few hundred ms). The native task is <em>not</em>
15+
* unblocked mid-token because the underlying JNI reader cannot be safely freed while
16+
* another thread is blocked inside it.
17+
* </p>
18+
* <p>
19+
* A token may be reused across calls. {@link #cancel()} and {@link #isCancelled()} are
20+
* safe to invoke concurrently with the inference loop.
21+
* </p>
22+
*/
23+
public final class CancellationToken {
24+
25+
private volatile boolean cancelled;
26+
27+
/** Construct a fresh, not-cancelled token. */
28+
public CancellationToken() {
29+
// empty
30+
}
31+
32+
/**
33+
* Cancellation flag accessor.
34+
* @return {@code true} once {@link #cancel()} has been called and before {@link #reset()}
35+
*/
36+
public boolean isCancelled() {
37+
return cancelled;
38+
}
39+
40+
/**
41+
* Request cancellation. Sets the flag observed by the inference loop; the loop will
42+
* return at its next token boundary. Idempotent and safe to call from any thread.
43+
*/
44+
public void cancel() {
45+
cancelled = true;
46+
}
47+
48+
/** Clear the cancelled flag so the token can be reused. Package-private. */
49+
void reset() {
50+
cancelled = false;
51+
}
52+
}

0 commit comments

Comments
 (0)