bernardladenthin
diff --git a/‎CHANGELOG.md‎
Lines changed: 7 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎CMakeLists.txt‎
Lines changed: 1 addition & 0 deletions b/‎CMakeLists.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 28 additions & 1 deletion b/‎README.md‎
Lines changed: 28 additions & 1 deletion
diff --git a/‎docs/feature-investigation-llama-stack-client-kotlin.md‎
Lines changed: 13 additions & 29 deletions b/‎docs/feature-investigation-llama-stack-client-kotlin.md‎
Lines changed: 13 additions & 29 deletions
diff --git a/‎docs/history/49be664_open_issues.md‎
Lines changed: 10 additions & 12 deletions b/‎docs/history/49be664_open_issues.md‎
Lines changed: 10 additions & 12 deletions
@@ -15,6 +15,8 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
 - OpenSSF Best Practices badge (project 12862) on README.
 - OpenAI-compatible `parallel_tool_calls` support: `ChatRequest.withParallelToolCalls(Boolean)` / `getParallelToolCalls()`, `InferenceParameters.withParallelToolCalls(boolean)`, and pass-through in the `/v1/chat/completions` server mapper.
 - Real-model tool-calling integration tests for blocking and streaming required tool calls (`ToolCallingIntegrationTest`, Qwen2.5-1.5B-Instruct), wired into CI and `validate-models`.
+- End-to-end vision input across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify that distinct red and blue images produce the correct semantic answers.
+- Explicit `setMmprojAuto(boolean)` and `setMmprojOffload(boolean)` controls, including the upstream `--no-mmproj-auto` and `--no-mmproj-offload` flags.
 
 ### Changed
 - Unified `CONTRIBUTING.md` and `SECURITY.md` structure with sibling repositories in the project family.
@@ -24,6 +26,11 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
 - Upgraded llama.cpp from b9151 to b9172.
 - Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.
 
+### Fixed
+- Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's upstream multimodal task path instead of silently tokenizing them as text-only prompts.
+- Preserved multipart image content when using the typed `ChatRequest` serializer.
+- The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
+
 ### Added
 - Reasoning-budget tests (Qwen3-0.6B).
 
 
@@ -408,6 +408,7 @@ if(BUILD_TESTING)
         ${llama.cpp_SOURCE_DIR}/tools/server/server-context.cpp
         ${llama.cpp_SOURCE_DIR}/tools/server/server-queue.cpp
         ${llama.cpp_SOURCE_DIR}/tools/server/server-task.cpp
+        ${llama.cpp_SOURCE_DIR}/tools/server/server-schema.cpp
         ${llama.cpp_SOURCE_DIR}/tools/server/server-models.cpp
     )
 
 
@@ -373,6 +373,33 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
 Reasoning/thinking models can receive custom Jinja template variables via
 `ModelParameters#setChatTemplateKwargs(Map)`.
 
+### Vision / Multimodal Chat
+
+Load a vision-capable GGUF with its matching projector, then place text and image parts in the
+same user message. Images may come from a file, raw bytes, a data URI, or an HTTP(S) URL:
+
+```java
+ModelParameters modelParams = new ModelParameters()
+        .setModel("models/SmolVLM-500M-Instruct-Q8_0.gguf")
+        .setMmproj("models/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf");
+
+ChatMessage message = ChatMessage.userMultimodal(
+        ContentPart.text("Describe this image in one short sentence."),
+        ContentPart.imageFile(Paths.get("photo.jpg")));
+
+try (LlamaModel model = new LlamaModel(modelParams)) {
+    String answer = model.chatCompleteText(InferenceParameters.empty()
+            .withMessages(Collections.singletonList(message))
+            .withNPredict(64));
+    System.out.println(answer);
+}
+```
+
+The same multipart `messages[].content` shape works through `ChatRequest` and the embedded
+OpenAI-compatible `/v1/chat/completions` server. For a strictly CPU-only run, use
+`setDevices("none").setMmprojOffload(false)` in addition to `setGpuLayers(0)`; projector offload
+has its own upstream default.
+
 ### Tool Calling
 
 Use a tool-aware instruct model and enable Jinja when loading it. A typed request can either return
@@ -732,7 +759,7 @@ Forward-looking ideas being tracked for this fork:
 
 - **Adopt feature ideas from the Kotlin Llama Stack client.** Candidates (multimodal image input, typed chat messages, async API, batch inference, typed usage/timings) are inventoried with effort estimates in [`docs/feature-investigation-llama-stack-client-kotlin.md`](docs/feature-investigation-llama-stack-client-kotlin.md), derived from [`ogx-ai/llama-stack-client-kotlin`](https://github.com/ogx-ai/llama-stack-client-kotlin).
 - **Ship a directly Android-capable artifact.** Building on the existing [Importing in Android](#importing-in-android) flow and the `opencl-android-aarch64` classifier (see [Choosing the right classifier](#choosing-the-right-classifier)), the goal is a first-class Android Maven artifact — including a typed image-input helper for VLMs such as Qwen2.5-VL — so downstream Android projects can drop their dependency on [`ogx-ai/llama-stack-client-kotlin`](https://github.com/ogx-ai/llama-stack-client-kotlin) entirely.
-- **Resolve all upstream `kherud/java-llama.cpp` open issues.** All 37 open issues at fork time are catalogued with per-issue verdicts in [`docs/history/49be664_open_issues.md`](docs/history/49be664_open_issues.md); fixes land in this fork as they are completed. The remaining headline item is a typed Java image API for multimodal inputs (issues [#103](docs/history/49be664_open_issues.md#103--vlm-support--image-input-for-multimodal-models) and [#34](docs/history/49be664_open_issues.md#34--support-multimodal-inputs), both PARTIALLY FIXED) — the same work that closes §2.1 of the Kotlin feature inventory.
+- **Resolve all upstream `kherud/java-llama.cpp` open issues.** All 37 open issues at fork time are catalogued with per-issue verdicts in [`docs/history/49be664_open_issues.md`](docs/history/49be664_open_issues.md); fixes land in this fork as they are completed. Vision inputs (issues [#103](docs/history/49be664_open_issues.md#103--vlm-support--image-input-for-multimodal-models) and [#34](docs/history/49be664_open_issues.md#34--support-multimodal-inputs)) are now wired end to end through blocking, typed, streaming, and OpenAI-compatible request surfaces.
 
 ## Troubleshooting
 
 
@@ -62,13 +62,12 @@ branch unless noted.
 
 ### 2.1 Multimodal image input (mtmd) — **L**
 
-**Status: SHIPPED (typed Java surface).** The original L-effort scope assumed
-new JNI plumbing was required, but on inspection the upstream OAI chat path
-(`oaicompat_chat_params_parse` in `server-common.cpp`) already detects
-`{"type":"image_url","image_url":{"url":"data:..."}}` blocks and routes them
-through the compiled-in `mtmd` pipeline, and the project's
-`handleChatCompletions` JNI method forwards the request JSON intact. Only the
-Java-side convenience to emit the multipart-array `content` was missing.
+**Status: SHIPPED (end to end).** The upstream OAI parser detects
+`{"type":"image_url","image_url":{"url":"data:..."}}` blocks and decodes them,
+but the binding previously discarded those decoded buffers before creating the
+server task. The JNI bridge now preserves the media and submits it through the
+upstream CLI multimodal task path, which invokes `process_mtmd_prompt` with the
+loaded projector.
 
 This pass adds:
 - **`ContentPart`** value type (`TEXT` / `IMAGE_URL`) with static factories
@@ -79,13 +78,14 @@ This pass adds:
   `userMultimodal(ContentPart...)` factory, `getParts()`, and `hasParts()`.
   The legacy `ChatMessage(role, content)` ctor and existing serializer path
   are unchanged.
-- **`InferenceParameters.setMessages(List<ChatMessage>)`** overload that
+- **`InferenceParameters.withMessages(List<ChatMessage>)`** overload that
   routes through a new `ParameterJsonSerializer.buildMessages(List<ChatMessage>)`
   emitting array-form `content` only when a message has parts.
-- 25 unit tests in `ContentPartTest` and `MultimodalMessagesTest` cover the
+- Unit tests in `ContentPartTest`, `MultimodalMessagesTest`, `ChatRequestTest`,
+  and `OpenAiRequestMapperTest` cover the
   factory contracts, the parts/legacy split, and the OAI multipart JSON shape;
-  the 123 existing `ChatMessage` / `InferenceParameters` /
-  `ParameterJsonSerializer` tests still pass.
+- `MultimodalIntegrationTest` exercises blocking, typed, and streaming calls
+  with a real model and verifies semantic red/blue image discrimination.
 
 A multimodal call from Java now looks like:
 ```java
@@ -99,24 +99,8 @@ String reply = model.chatCompleteText(new InferenceParameters("")
             ContentPart.imageFile(java.nio.file.Paths.get("photo.jpg"))))));
 ```
 
-Zero new JNI symbols; zero risk to existing text-only chat callers.
-
-**Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and
-the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No
-Java method currently accepts image input. Kotlin examples show base64 image
-chat against vision models.
-
-**Proposal.**
-- `InferenceParameters.addImage(byte[] png)` / `addImage(Path)` / `addImageBase64(String)`.
-- `ModelParameters.setMmproj(Path)` to load the mmproj projector file.
-- JNI: feed images into the server task params (`mtmd_*` API).
-
-**Effort: L** — non-trivial JNI plumbing, lifecycle of `mtmd_context`,
-test fixtures for vision models, but most of the heavy lifting is already
-upstream.
-
-**Value.** Biggest user-visible capability missing today. Unlocks Qwen-VL,
-Gemma 3, MiniCPM-V, LLaVA, etc.
+No new exported JNI symbols are needed; existing text-only chat callers retain
+the ordinary tokenization path.
 
 ---
 
 
@@ -21,7 +21,7 @@ After a second-pass analysis of every `LIKELY FIXED` and `PARTIALLY FIXED` issue
 
 - **Confirmable from code inspection alone (no runtime needed):**
   - #103, #34 — image input API: now FIXED via PR #189 (typed `ContentPart`
-    + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)`
+    + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.withMessages(List<ChatMessage>)`
     emitting OAI multipart content; the upstream chat path already routes
     `image_url` blocks through the compiled-in `mtmd` pipeline, so zero new
     JNI was needed).
@@ -305,13 +305,11 @@ Java API.
 Feature request: support visual-language models such as Qwen2.5-VL (image
 inputs) on Android.
 
-**Status in fork:** FIXED in the same PR that closes the typed Java surface gap. The build still links the upstream `mtmd` multimodal library into `jllama` (`CMakeLists.txt:125-145, 253-255`) and `ModelParameters` still exposes `setMmproj`, `setMmprojUrl`, `enableMmprojAuto`, `enableMmprojOffload` (`ModelParameters.java:1250-1281`). The previously-missing typed image API now exists: `ContentPart.text(...)`, `ContentPart.imageUrl(...)`, `ContentPart.imageBytes(byte[], mime)`, `ContentPart.imageFile(Path)` (auto-detects png/jpeg/webp/gif), and `ChatMessage(role, List<ContentPart>)` + `ChatMessage.userMultimodal(ContentPart...)`. `InferenceParameters.setMessages(List<ChatMessage>)` serializes parts-bearing messages to the OAI array-form `content` that the upstream `oaicompat_chat_params_parse` already routes through the compiled-in `mtmd` pipeline &#x2014; zero new JNI required. See PR #189 (§2.1 of `docs/feature-investigation-llama-stack-client-kotlin.md`).
+**Status in fork:** FIXED end to end. The build links the upstream `mtmd` multimodal library into `jllama`, `ModelParameters` exposes projector loading and explicit auto/offload controls, and `ContentPart` + `ChatMessage.userMultimodal(...)` provide the typed image API. The native parser now preserves decoded media buffers and submits them through the upstream multimodal task path; previously those buffers were discarded and requests silently became text-only. Blocking, typed `ChatRequest`, streaming, and OpenAI-compatible mapping are covered, including a real-model semantic red/blue regression.
 
-**Deep-dive analysis:** Definitively confirmable from code inspection — no runtime test changes this verdict. Two distinct surfaces exist for VLM:
+**Deep-dive analysis:** Two distinct surfaces exist for VLM:
 1. **Model loading:** fully wired (mmproj path, auto-detect, GPU offload) — these flags reach the upstream server-context unchanged.
-2. **Request payload:** the only path is `LlamaModel.handleChatCompletions(json, oaiCompat=true)` with manually-constructed `messages[].content = [{type:"text",...},{type:"image_url",image_url:{url:"data:image/png;base64,..."}}]` JSON. No typed helper.
-
-This is genuinely PARTIALLY FIXED and only a Java-side enhancement closes the gap; no runtime investigation is required to confirm.
+2. **Request payload:** typed and raw OpenAI multipart content is decoded by the OAI parser, retained across JNI dispatch, and processed by `mtmd` before generation.
 
 ---
 
@@ -696,9 +694,9 @@ prebuilt artifact for Android targets.
 Feature request: add multimodal input support (referencing
 [ggerganov/llama.cpp#3436](https://github.com/ggerganov/llama.cpp/pull/3436)).
 
-**Status in fork:** FIXED. The upstream `mtmd` library is built and linked into `jllama` (`CMakeLists.txt:125-145, 253-255`), `ModelParameters` exposes `setMmproj`, `setMmprojUrl`, `enableMmprojAuto`, `enableMmprojOffload` (`ModelParameters.java:1250-1281`), and the typed image API now exists via `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)`. The serializer emits OAI array-form `content`; the upstream chat path already understands `image_url` blocks. See #103 for the parallel write-up and PR #189 for the implementation.
+**Status in fork:** FIXED. The upstream `mtmd` library is built and linked into `jllama`; projector controls and the typed `ContentPart` API are exposed; and decoded media is retained by the native bridge and processed through the upstream multimodal task path for blocking and streaming requests. See #103 for the full write-up.
 
-**Deep-dive analysis:** Same conclusion as #103 — confirmable from code, no runtime needed. The original 2023 feature request asked for "multimodal input support"; in 2025 terms this splits into model loading (DONE) and request payload (DONE: typed `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)` emit the OAI multipart `content` the upstream chat path already consumes). Verdict is FIXED as of PR #189.
+**Deep-dive analysis:** Same conclusion as #103. Model loading and request execution are both verified; the real-model regression confirms image content affects the answer rather than merely producing a non-empty text-only response.
 
 ---
 
@@ -761,7 +759,7 @@ Feature request: add multimodal input support (referencing
 | 110 | FIXED | `handleEmbeddings` accepts batch JSON | `LlamaModel.java:316`, `json_helpers.hpp:137` |
 | 107 | FIXED | CMake matches both Mac and Darwin | `CMakeLists.txt:196` |
 | 104 | FIXED | `NO_KV_OFFLOAD` flag exposed | `args/ModelFlag.java:50` |
-| 103 | FIXED | mtmd linked; typed image API in PR #189 (`ContentPart`, `ChatMessage(role, List<ContentPart>)`, `InferenceParameters.setMessages(List<ChatMessage>)`) | `ContentPart.java`, `ChatMessage.java`, `InferenceParameters.java`, `ModelParameters.java:1250-1281` |
+| 103 | FIXED | mtmd linked; typed image API in PR #189 (`ContentPart`, `ChatMessage(role, List<ContentPart>)`, `InferenceParameters.withMessages(List<ChatMessage>)`) | `ContentPart.java`, `ChatMessage.java`, `InferenceParameters.java`, `ModelParameters.java:1250-1281` |
 | 102 | FIXED | Destructor drains workers and frees ctx; covered by `MemoryManagementTest#testOpenCloseLoopDoesNotLeak` (commit `cba693c`, PR #185) | `jllama.cpp:917-948` |
 | 101 | FIXED | Trampoline calls BiConsumer | `jllama.cpp:954-977` |
 | 98  | FIXED | `enableEmbedding` + `setPoolingType`; covered by `LlamaEmbeddingsTest#testNomicEmbedLoads` (commit `cba693c`, PR #185; CI downloads `nomic-embed-text-v1.5.f16.gguf`) | `ModelParameters.java:1040,606` |
@@ -804,7 +802,7 @@ or change the verdict.
 | 98 | Reporter's config was *literally* `new ModelParameters().setModel(...).setBatchSize(8192).setUbatchSize(8192)` — **no `enableEmbedding()` call**. The original "bug" was that the bindings did not forward `--embedding` at all; the upstream `result_output` assertion fired because the embedding pipeline was never initialised. | **DONE** — `LlamaEmbeddingsTest#testNomicEmbedLoads` (commit `713d426`) runs the reporter's exact config plus `enableEmbedding()`; gated on `net.ladenthin.llama.nomic.path`; CI downloads the model via `NOMIC_EMBED_MODEL_URL` in `publish.yml`. |
 | 95 | Reporter pastes the `next()` method and argues the design is wrong: when `output.stop=true`, the method returns that output and ends. No model, prompt or reproduction provided. | **DONE** — `LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt` (commit `713d426`) drives the iterator with a repetitive prompt at `nPredict=30`, `temperature=0.0f` and asserts termination within `nPredict+1` outputs. |
 | 80 | Exact repro: Kotlin-style 3 lines (`val params...`, `val model = new LlamaModel(params)`, `model.close()`) with `qwen2-0_5b-instruct-q4_0.gguf`. JDK 17.0.12+7, java-llama.cpp 3.4.1. SIGSEGV in `std::_Rb_tree` during `delete`. Reporter said they intended to follow up with a `-DLLAMA_DEBUG` build but never did. | **DONE** — `MemoryManagementTest#testOpenCloseWithoutGeneration` (commit `713d426`) maps the 3-line repro to 20 iterations of try-with-resources open + immediate close; a JVM crash exits the runner non-zero. |
-| 103 | Specifically asks about **Qwen2.5-VL on Android**. No code attempted. | **DONE (typed API)** — PR #189 ships `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)`. Android-sample tail tracked separately. |
+| 103 | Specifically asks about **Qwen2.5-VL on Android**. No code attempted. | **DONE (typed API)** — PR #189 ships `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.withMessages(List<ChatMessage>)`. Android-sample tail tracked separately. |
 | 86 | Just a question: "does the CUDA jar handle CPU fallback?". No code. | Not unit-testable. Documentation task. |
 | 34 | One-line feature request linking upstream PR #3436 (LLaVA). No specifics. | **DONE** — subsumed by #103 (PR #189). |
 | 121 | (Not refetched — Android `aarch64` vs `arm64-v8a` mismatch; already analysed in deep-dive.) | Verified by code; needs an Android boot test, not a unit test. |
@@ -930,7 +928,7 @@ the same pattern as the existing CodeLlama / Jina-Reranker model downloads.
 
 | # | Why not unit-testable | Action |
 |---|---|---|
-| 103, 34 | (Historic — typed image API was missing at the time of this audit.) | **DONE in PR #189**: `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)` emit the OAI multipart `content` the upstream chat path already routes through `mtmd`. `MultimodalIntegrationTest` is added under model-gated `Assume`. |
+| 103, 34 | (Historic — typed image API and native media dispatch were incomplete at the time of this audit.) | **DONE**: `ContentPart` + `ChatMessage(role, List<ContentPart>)` emit OAI multipart content, and the JNI bridge retains decoded media for upstream `mtmd` processing. The model-gated integration test covers blocking, typed, streaming, and semantic image handling. |
 | 86 | Question about jar packaging behaviour, not code defect. | Documentation: add a README section "Choosing the right classifier" stating that the CUDA jar requires the CUDA runtime libraries at load time and does not auto-fall-back. |
 | 121, 50 | Android runtime / cross-host build path — needs an emulator boot or a macOS-M2 cross-compile, not a JVM test. | CI matrix expansion: add an Android emulator job that boots a stock `arm64-v8a` AVD and runs the existing `LlamaModelTest` against the dockcross-built `libjllama.so`. |
 
@@ -954,7 +952,7 @@ the same pattern as the existing CodeLlama / Jina-Reranker model downloads.
    the residual gap on #121.
 3. **Third PR (feature): SHIPPED as PR #189.** Adds the typed multimodal
    surface (`ContentPart`, `ChatMessage(role, List<ContentPart>)`,
-   `InferenceParameters.setMessages(List<ChatMessage>)`) plus a
+   `InferenceParameters.withMessages(List<ChatMessage>)`) plus a
    `MultimodalIntegrationTest` gated on `net.ladenthin.llama.vision.model`,
    `.vision.mmproj`, and `.vision.image` system properties. CI in
    `publish.yml` downloads a small vision model + mmproj + an author-
Original file line number	Diff line number	Diff line change
`@@ -408,6 +408,7 @@ if(BUILD_TESTING)`
`408`	`408`	`${llama.cpp_SOURCE_DIR}/tools/server/server-context.cpp`
`409`	`409`	`${llama.cpp_SOURCE_DIR}/tools/server/server-queue.cpp`
`410`	`410`	`${llama.cpp_SOURCE_DIR}/tools/server/server-task.cpp`
	`411`	`+ ${llama.cpp_SOURCE_DIR}/tools/server/server-schema.cpp`
`411`	`412`	`${llama.cpp_SOURCE_DIR}/tools/server/server-models.cpp`
`412`	`413`	`)`
`413`	`414`