bernardladenthin
diff --git a/‎CLAUDE.md‎
Lines changed: 46 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎docs/feature-investigation-llama-stack-client-kotlin.md‎
Lines changed: 97 additions & 1 deletion b/‎docs/feature-investigation-llama-stack-client-kotlin.md‎
Lines changed: 97 additions & 1 deletion
diff --git a/‎docs/history/49be664_open_issues.md‎
Lines changed: 16 additions & 1 deletion b/‎docs/history/49be664_open_issues.md‎
Lines changed: 16 additions & 1 deletion
diff --git a/‎pom.xml‎
Lines changed: 8 additions & 0 deletions b/‎pom.xml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎src/main/cpp/jllama.cpp‎
Lines changed: 45 additions & 1 deletion b/‎src/main/cpp/jllama.cpp‎
Lines changed: 45 additions & 1 deletion
diff --git a/‎src/main/java/net/ladenthin/llama/CancellationToken.java‎
Lines changed: 52 additions & 0 deletions b/‎src/main/java/net/ladenthin/llama/CancellationToken.java‎
Lines changed: 52 additions & 0 deletions
@@ -514,6 +514,52 @@ into `models/` out-of-band.
 clang-format -i src/main/cpp/*.cpp src/main/cpp/*.hpp   # Format C++ code
 ```
 
+### Javadoc — must build cleanly before `mvn package`
+
+The release packaging job runs `mvn package` with the `release` profile, which attaches
+a javadoc jar via `maven-javadoc-plugin`. The plugin treats Javadoc tool **errors** as
+build failures (warnings are tolerated). After changing any public/protected Java API,
+verify the javadoc build succeeds locally:
+
+```bash
+mvn clean javadoc:jar -DskipTests=true -Dgpg.skip=true
+# expected: BUILD SUCCESS
+```
+
+Common Javadoc errors that fail the build (not warnings):
+
+- **Unbalanced HTML**: `</p>` without a matching `<p>`, mismatched `<ul>`/`<li>`, stray
+  closing tags. Symptom: `error: unexpected end tag: </p>`.
+- **Invalid `{@link …}` targets**: typo'd class, method, or parameter name.
+- **Self-closing void HTML elements written as `<br>` inside `<pre>` blocks** in HTML5
+  mode (rare but seen).
+
+Common Javadoc *warnings* (do not fail the build, but should be cleaned up on new code):
+
+- `no main description` — a doc comment containing only `@param`/`@return`/`@throws`
+  tags with no leading prose. Fix: add a one-line description before the tags.
+- `no @return` / `no @param` — public method missing the tag. Fix: add it.
+- `no comment` — public method/field/enum constant has no doc comment at all.
+- `use of default constructor, which does not provide a comment` — public class with
+  no explicit constructor (the synthetic default has no Javadoc). Fix: add an explicit
+  no-arg constructor with a Javadoc comment.
+
+Preferred doc-comment shapes for getters and small value types:
+
+```java
+/**
+ * Brief one-line description of the value.
+ *
+ * @return the value
+ */
+public T getThing() { ... }
+```
+
+A bare `/** @return … */` triggers `no main description`; add a leading sentence.
+
+If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
+`.github/workflows/publish.yml` will pass the `attach-javadocs` step.
+
 ## Architecture
 
 ### Two-Layer Design
 
@@ -29,15 +29,42 @@ T-shirt sizes:
 | Slot save / restore / erase                              | ✅ |
 | `continueFinalMessage`                                   | ✅ |
 | Tokenize / decode / template apply                       | ✅ |
-| Metrics string (`getMetrics()`)                          | ✅ |
+| Metrics string (`getMetrics()`) + typed `ServerMetrics`  | ✅ |
 | Speculative draft model wiring                           | ✅ |
+| Typed `ChatRequest` / `ChatResponse` + tool calling      | ✅ (§2.2) |
+| `CompletableFuture` async wrappers                       | ✅ (§2.3) |
+| Reactive Streams `Publisher<LlamaOutput>` token stream   | ✅ (§2.3) |
+| `completeBatch` / `chatBatch` parallel dispatch          | ✅ (§2.4) |
+| Typed `Usage` / `Timings` / `CompletionResult`           | ✅ (§2.5) |
+| `Session` helper (single-threaded)                       | ✅ (§2.6) |
+| `AutoCloseable` iterator + cancel polish                 | ✅ (§2.7) |
+| Per-request `setJsonSchema` + `completeAsJson<T>`        | ✅ (§2.8) |
+| Typed `TokenLogprob` in `LlamaOutput`                    | ✅ (§2.9) |
+| `CancellationToken` (cooperative)                        | ✅ (§2.10) |
+| `LoadProgressCallback` model-load progress               | ✅ (#113) |
 
 These do not need work — they already match or exceed the Kotlin client.
 
+### 1.1 Status legend for §2
+
+Each §2.x subsection below carries a **Status:** line at the top:
+
+| Marker | Meaning |
+|--------|---------|
+| `SHIPPED` | Fully landed; commit refs follow. |
+| `PARTIAL` | Core landed; a documented follow-up remains (called out inline). |
+| `OPEN` | Not started. |
+
+All references point to PR #188 on the `claude/upbeat-hypatia-wPdK5`
+branch unless noted.
+
 ## 2. Recommended additions (in priority order)
 
 ### 2.1 Multimodal image input (mtmd) — **L**
 
+**Status: OPEN.** `ModelParameters.setMmproj` already wires the projector; no
+typed Java image API yet. Same gap as issues #103 / #34.
+
 **Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and
 the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No
 Java method currently accepts image input. Kotlin examples show base64 image
@@ -59,6 +86,16 @@ Gemma 3, MiniCPM-V, LLaVA, etc.
 
 ### 2.2 Typed `ChatMessage` / `ChatResponse` model + tool calling — **M**
 
+**Status: SHIPPED** (PR #188, commit `f2c7ed1`). New value types
+`ChatRequest`, `ChatResponse`, `ChatChoice`, `ChatMessage` (extended),
+`ToolCall`, `ToolDefinition`, plus `ToolHandler` functional interface.
+`LlamaModel.chat(ChatRequest)` returns a typed `ChatResponse`;
+`chatWithTools(ChatRequest, Map<String, ToolHandler>)` runs the agent
+auto-loop (capturing handler exceptions as `{"error":...}` tool results
+so the loop continues; cap via `ChatRequest.maxToolRounds`, default 8).
+The tier-1 (typed response only) and tier-2 (manual tool round-trip via
+`ChatMessage.toolResult`) APIs are equally usable.
+
 **Gap.** Today: `setMessages(String system, List<Pair<String,String>>)` and
 `chatComplete → String`. The server *parses* tool calls
 (`common_chat_*` infrastructure) but Java callers must scrape JSON
@@ -85,6 +122,15 @@ papercut.
 
 ### 2.3 Async / non-blocking API — **S–M**
 
+**Status: SHIPPED.** `CompletableFuture` wrappers (`completeAsync`,
+`chatCompleteAsync`, `chatCompleteTextAsync`, plus a
+`completeAsync(params, CancellationToken)` bridge that propagates
+`future.cancel(true)` into the cooperative token) in commit `1e673a9`.
+The reactive `Publisher<LlamaOutput>` follow-up (backpressure via
+Reactive Streams, single-subscriber) shipped in commit `afa4f65` as
+`LlamaModel.streamPublisher(...)` and `streamChatPublisher(...)` backed
+by `LlamaPublisher`. New runtime dep: `org.reactivestreams:reactive-streams:1.0.4`.
+
 **Gap.** All `LlamaModel` methods are blocking. Kotlin offers
 `suspend fun` + Flow variants. JVM users currently dedicate platform
 threads per inference.
@@ -108,6 +154,13 @@ RxJava, Kotlin coroutines from Java consumers.
 
 ### 2.4 Batch inference across slots — **M**
 
+**Status: SHIPPED** (PR #188, commit `de457b2`).
+`LlamaModel.completeBatch(List<InferenceParameters>)`,
+`completeBatchWithStats(...)`, and `chatBatch(List<ChatRequest>)` dispatch
+all requests at once via the existing async wrappers; results returned in
+input order. Throughput scales with `ModelParameters.setParallel(N)`
+(default `N=1` runs sequentially across the single slot).
+
 **Gap.** llama.cpp natively serves parallel slots; the compiled-in server
 handles concurrent tasks. `LlamaModel` exposes no batch entry point.
 
@@ -127,6 +180,12 @@ rerank pipelines; close to a free win.
 
 ### 2.5 Typed `Usage` / `Timings` result — **XS–S**
 
+**Status: SHIPPED** (PR #188, commits `fe1cf3b` + `c529499`). `Usage`,
+`Timings`, and `ServerMetrics` value classes + `LlamaModel.getMetricsTyped()`
+parse server-wide metrics. Per-completion `Usage`/`Timings` land in
+`ChatResponse` (§2.2) and in the new `CompletionResult` returned by
+`LlamaModel.completeWithStats(InferenceParameters)`.
+
 **Gap.** `getMetrics()` returns a raw JSON `String`. Kotlin exposes
 `Usage(promptTokens, completionTokens, totalTokens)` plus a richer
 `Timings` (`tokensPerSecond`, `promptMs`, `predictedMs`, `cacheHit`,
@@ -146,6 +205,12 @@ rerank pipelines; close to a free win.
 
 ### 2.6 `Session` helper (multi-turn) — **S–M**
 
+**Status: PARTIAL** (PR #188, commit `e4f531c`). `Session` ships as an
+`AutoCloseable` wrapper with `send(...)`, `stream(...)`,
+`commitStreamedReply(...)`, `save(Path)` / `restore(Path)`, and an
+optional `InferenceParameters` customizer. Single-thread only in this
+pass — per-session locking is the remaining M-effort follow-up.
+
 **Gap.** Slots exist as a low-level primitive. Kotlin offers
 "agents/sessions/turns" with persistence and resume.
 
@@ -165,6 +230,12 @@ rerank pipelines; close to a free win.
 
 ### 2.7 Stream cancellation & `AutoCloseable` iterator — **S**
 
+**Status: SHIPPED** (PR #188, commit `d1c9fb0`). `LlamaIterator` already
+implemented `AutoCloseable` with `cancel()`/`close()`; this commit
+audited the path, documented the cancel-vs-stop nuance and idempotency
+in the javadoc, added a try-with-resources example on
+`LlamaModel.generate(...)`, and added `testIteratorCloseIdempotent`.
+
 **Gap.** `LlamaIterable` / `LlamaIterator` cannot be cancelled mid-stream;
 the underlying slot task keeps running until natural stop. Kotlin marks
 streaming returns `@MustBeClosed`.
@@ -183,6 +254,13 @@ Java side.
 
 ### 2.8 Structured-output convenience helpers — **S**
 
+**Status: SHIPPED** (PR #188, commit `80e5c13`).
+`InferenceParameters.setJsonSchema(String)` mirrors the existing
+`setGrammar`. `LlamaModel.completeAsJson(Class<T>, String schema, InferenceParameters)`
+sets the schema and Jackson-deserializes the result. The
+single-argument overload `completeAsJson(Class<T>, InferenceParameters)`
+trusts that the caller already set schema/grammar.
+
 **Gap.** `setJsonSchema` / `setGrammar` already exist on `ModelParameters`
 but not on `InferenceParameters`. No typed-result helper.
 
@@ -202,6 +280,13 @@ SDKs.
 
 ### 2.9 Logprobs in the typed result — **S**
 
+**Status: SHIPPED** (PR #188, commit `a8077b6`). `TokenLogprob` value
+type carries `token`, `tokenId`, `logprob`, and the nested
+`topLogprobs` alternatives. `LlamaOutput.logprobs` is populated by
+`CompletionResponseParser.parseLogprobs` (post-sampling `prob` or
+pre-sampling `logprob` mode auto-detected). Also surfaces in
+`CompletionResult.getLogprobs()` (§2.5).
+
 **Gap.** `setNProbs` exists; the result type is a plain `String`, so
 per-token probabilities are not surfaced.
 
@@ -219,6 +304,17 @@ per-token probabilities are not surfaced.
 
 ### 2.10 Cancellation token / abort for blocking calls — **S**
 
+**Status: SHIPPED — cooperative only** (PR #188, commits `ad66e3a` +
+`e3b9043`). `CancellationToken.cancel()` sets a `volatile` flag observed
+between tokens by the inference loop in
+`LlamaModel.complete(InferenceParameters, CancellationToken)`. Effective
+latency is one token interval (the loop checks at each token boundary).
+Immediate cancel (requiring a new server-side stop-task JNI primitive)
+is the remaining M-effort follow-up. The initial impl tried to abort
+mid-token via a cross-thread JNI call; that race was the root cause of
+a `std::system_error` JVM abort in CI and was reverted to the safe
+cooperative path.
+
 **Gap.** A blocking `complete(...)` cannot be aborted from another thread.
 
 **Proposal.**
 
@@ -188,7 +188,22 @@ vs. total, whether the file is the weights file, whether it is a download or
 disk load) via a `Consumer<LLamaLoadProgress>` callback passed to the
 `LlamaModel` constructor. Intended for showing a progress bar to end users.
 
-**Status in fork:** STILL POSSIBLE. No `LLamaLoadProgress`, `Consumer<…>` or `setProgressCallback` exists in `LlamaModel.java` / `ModelParameters.java` (`grep -n "progress\|Progress"` returns only iterator cancellation comments). Next steps: expose `llama_model_params.progress_callback` through `ModelParameters` (Java side: a `Consumer<Float>` field; JNI side: wire a trampoline in `jllama.cpp` similar to `log_callback_trampoline`).
+**Status in fork:** FIXED in PR #188 (commit `70df324`). New
+`LoadProgressCallback` functional interface (single method
+`boolean onProgress(float progress)`; return `false` to abort).
+New constructor overload
+`LlamaModel(ModelParameters, LoadProgressCallback)` plumbs the
+callback through a new JNI entry point `loadModelWithProgress`,
+which installs a trampoline on `common_params.load_progress_callback`
+that forwards the float to `LoadProgressCallback.onProgress(float)Z`
+via `CallBooleanMethod`. The existing `loadModel` JNI symbol still
+exists; both entry points share a `load_model_impl` helper.
+Callback fires synchronously on the loader thread with progress in
+`[0.0, 1.0]`; returning `false` aborts and the constructor throws
+`LlamaException`. The original report's richer payload (file name,
+bytes, weights vs download flag) is NOT exposed — only the float —
+because `llama_model_params.progress_callback` itself only emits the
+float; richer fields would require an upstream API change.
 
 ---
 
 
@@ -73,6 +73,14 @@ SPDX-License-Identifier: MIT
 			<artifactId>jackson-databind</artifactId>
 			<version>2.21.3</version>
 		</dependency>
+		<!-- Reactive Streams API used by LlamaPublisher to expose token streams as a
+		     Publisher<LlamaOutput>. Java 8 compatible, ~5 KB, supplies the standard
+		     interfaces that Reactor / RxJava / Kotlin coroutines bridge to. -->
+		<dependency>
+			<groupId>org.reactivestreams</groupId>
+			<artifactId>reactive-streams</artifactId>
+			<version>1.0.4</version>
+		</dependency>
 		<!-- Required by OSInfo (vendored from xerial/sqlite-jdbc) for log emission. -->
 		<dependency>
 			<groupId>org.slf4j</groupId>
 
@@ -598,7 +598,26 @@ JNIEXPORT void JNICALL JNI_OnUnload(JavaVM *vm, void *reserved) {
     llama_backend_free();
 }
 
-JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env, jobject obj, jobjectArray jparams) {
+// Trampoline state for llama.cpp's load_progress_callback. The native loader runs
+// on the calling JNI thread so we can capture JNIEnv directly. Lifetime is bounded
+// by the single load_model_impl call.
+namespace {
+struct load_progress_ud {
+    JNIEnv  *env;
+    jobject  callback;
+    jmethodID on_progress;
+};
+
+bool jni_load_progress_trampoline(float progress, void *user_data) {
+    auto *ud = static_cast<load_progress_ud *>(user_data);
+    return ud->env->CallBooleanMethod(ud->callback, ud->on_progress, progress) == JNI_TRUE;
+}
+} // namespace
+
+// Shared implementation of loadModel and loadModelWithProgress. When `progress` is
+// non-null, installs a load-progress trampoline; otherwise behaves identically to
+// the no-callback path.
+static void load_model_impl(JNIEnv *env, jobject obj, jobjectArray jparams, jobject progress) {
     common_params params;
 
     const jsize argc = env->GetArrayLength(jparams);
@@ -662,6 +681,21 @@ JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env
 
     LOG_INF("%s: loading model\n", __func__);
 
+    // Install the load-progress trampoline if the caller supplied a callback.
+    load_progress_ud progress_ud{};
+    if (progress != nullptr) {
+        jclass cb_cls = env->GetObjectClass(progress);
+        progress_ud.env         = env;
+        progress_ud.callback    = progress;
+        progress_ud.on_progress = env->GetMethodID(cb_cls, "onProgress", "(F)Z");
+        if (progress_ud.on_progress == nullptr) {
+            fail_load("LoadProgressCallback.onProgress(float) not found");
+            return;
+        }
+        params.load_progress_callback           = jni_load_progress_trampoline;
+        params.load_progress_callback_user_data = &progress_ud;
+    }
+
     if (!jctx->server.load_model(params)) {
         fail_load("could not load model from given file path");
         return;
@@ -706,6 +740,16 @@ JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env
     env->SetLongField(obj, f_model_pointer, reinterpret_cast<jlong>(jctx));
 }
 
+JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModel(JNIEnv *env, jobject obj, jobjectArray jparams) {
+    load_model_impl(env, obj, jparams, nullptr);
+}
+
+JNIEXPORT void JNICALL Java_net_ladenthin_llama_LlamaModel_loadModelWithProgress(JNIEnv *env, jobject obj,
+                                                                              jobjectArray jparams,
+                                                                              jobject       callback) {
+    load_model_impl(env, obj, jparams, callback);
+}
+
 JNIEXPORT jstring JNICALL Java_net_ladenthin_llama_LlamaModel_getModelMetaJson(JNIEnv *env, jobject obj) {
     REQUIRE_SERVER_CONTEXT(nullptr);
     if (jctx->vocab_only) {
 
@@ -0,0 +1,52 @@
+// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
+//
+// SPDX-License-Identifier: MIT
+
+package net.ladenthin.llama;
+
+/**
+ * Cancellation handle for a blocking {@link LlamaModel} call. Pass an instance to
+ * {@link LlamaModel#complete(InferenceParameters, CancellationToken)} and invoke
+ * {@link #cancel()} from another thread to abort the inference loop.
+ * <p>
+ * Cancellation is cooperative: {@link #cancel()} only sets a flag, and the inference
+ * loop checks that flag between generated tokens. Effective latency is therefore one
+ * token interval (typically tens to a few hundred ms). The native task is <em>not</em>
+ * unblocked mid-token because the underlying JNI reader cannot be safely freed while
+ * another thread is blocked inside it.
+ * </p>
+ * <p>
+ * A token may be reused across calls. {@link #cancel()} and {@link #isCancelled()} are
+ * safe to invoke concurrently with the inference loop.
+ * </p>
+ */
+public final class CancellationToken {
+
+    private volatile boolean cancelled;
+
+    /** Construct a fresh, not-cancelled token. */
+    public CancellationToken() {
+        // empty
+    }
+
+    /**
+     * Cancellation flag accessor.
+     * @return {@code true} once {@link #cancel()} has been called and before {@link #reset()}
+     */
+    public boolean isCancelled() {
+        return cancelled;
+    }
+
+    /**
+     * Request cancellation. Sets the flag observed by the inference loop; the loop will
+     * return at its next token boundary. Idempotent and safe to call from any thread.
+     */
+    public void cancel() {
+        cancelled = true;
+    }
+
+    /** Clear the cancelled flag so the token can be reused. Package-private. */
+    void reset() {
+        cancelled = false;
+    }
+}