Skip to content

Commit 7a67d95

Browse files
Merge pull request #249 from vaiju1981/feature/vision-update
feat: wire vision/multimodal input end to end
2 parents 2b0fcf5 + e5ca3e3 commit 7a67d95

21 files changed

Lines changed: 379 additions & 96 deletions

CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
1515
- OpenSSF Best Practices badge (project 12862) on README.
1616
- OpenAI-compatible `parallel_tool_calls` support: `ChatRequest.withParallelToolCalls(Boolean)` / `getParallelToolCalls()`, `InferenceParameters.withParallelToolCalls(boolean)`, and pass-through in the `/v1/chat/completions` server mapper.
1717
- Real-model tool-calling integration tests for blocking and streaming required tool calls (`ToolCallingIntegrationTest`, Qwen2.5-1.5B-Instruct), wired into CI and `validate-models`.
18+
- End-to-end vision input across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify that distinct red and blue images produce the correct semantic answers.
19+
- Explicit `setMmprojAuto(boolean)` and `setMmprojOffload(boolean)` controls, including the upstream `--no-mmproj-auto` and `--no-mmproj-offload` flags.
1820

1921
### Changed
2022
- Unified `CONTRIBUTING.md` and `SECURITY.md` structure with sibling repositories in the project family.
@@ -24,6 +26,11 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
2426
- Upgraded llama.cpp from b9151 to b9172.
2527
- Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.
2628

29+
### Fixed
30+
- Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's upstream multimodal task path instead of silently tokenizing them as text-only prompts.
31+
- Preserved multipart image content when using the typed `ChatRequest` serializer.
32+
- The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
33+
2734
### Added
2835
- Reasoning-budget tests (Qwen3-0.6B).
2936

README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -383,6 +383,33 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
383383
Reasoning/thinking models can receive custom Jinja template variables via
384384
`ModelParameters#setChatTemplateKwargs(Map)`.
385385

386+
### Vision / Multimodal Chat
387+
388+
Load a vision-capable GGUF with its matching projector, then place text and image parts in the
389+
same user message. Images may come from a file, raw bytes, a data URI, or an HTTP(S) URL:
390+
391+
```java
392+
ModelParameters modelParams = new ModelParameters()
393+
.setModel("models/SmolVLM-500M-Instruct-Q8_0.gguf")
394+
.setMmproj("models/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf");
395+
396+
ChatMessage message = ChatMessage.userMultimodal(
397+
ContentPart.text("Describe this image in one short sentence."),
398+
ContentPart.imageFile(Paths.get("photo.jpg")));
399+
400+
try (LlamaModel model = new LlamaModel(modelParams)) {
401+
String answer = model.chatCompleteText(InferenceParameters.empty()
402+
.withMessages(Collections.singletonList(message))
403+
.withNPredict(64));
404+
System.out.println(answer);
405+
}
406+
```
407+
408+
The same multipart `messages[].content` shape works through `ChatRequest` and the embedded
409+
OpenAI-compatible `/v1/chat/completions` server. For a strictly CPU-only run, use
410+
`setDevices("none").setMmprojOffload(false)` in addition to `setGpuLayers(0)`; projector offload
411+
has its own upstream default.
412+
386413
### Tool Calling
387414

388415
Use a tool-aware instruct model and enable Jinja when loading it. A typed request can either return
@@ -742,7 +769,7 @@ Forward-looking ideas being tracked for this fork:
742769

743770
- **Adopt feature ideas from the Kotlin Llama Stack client.** Candidates (multimodal image input, typed chat messages, async API, batch inference, typed usage/timings) are inventoried with effort estimates in [`docs/feature-investigation-llama-stack-client-kotlin.md`](docs/feature-investigation-llama-stack-client-kotlin.md), derived from [`ogx-ai/llama-stack-client-kotlin`](https://github.com/ogx-ai/llama-stack-client-kotlin).
744771
- **Ship a directly Android-capable artifact.** Building on the existing [Importing in Android](#importing-in-android) flow and the `opencl-android-aarch64` classifier (see [Choosing the right classifier](#choosing-the-right-classifier)), the goal is a first-class Android Maven artifact — including a typed image-input helper for VLMs such as Qwen2.5-VL — so downstream Android projects can drop their dependency on [`ogx-ai/llama-stack-client-kotlin`](https://github.com/ogx-ai/llama-stack-client-kotlin) entirely.
745-
- **Resolve all upstream `kherud/java-llama.cpp` open issues.** All 37 open issues at fork time are catalogued with per-issue verdicts in [`docs/history/49be664_open_issues.md`](docs/history/49be664_open_issues.md); fixes land in this fork as they are completed. The remaining headline item is a typed Java image API for multimodal inputs (issues [#103](docs/history/49be664_open_issues.md#103--vlm-support--image-input-for-multimodal-models) and [#34](docs/history/49be664_open_issues.md#34--support-multimodal-inputs), both PARTIALLY FIXED) — the same work that closes §2.1 of the Kotlin feature inventory.
772+
- **Resolve all upstream `kherud/java-llama.cpp` open issues.** All 37 open issues at fork time are catalogued with per-issue verdicts in [`docs/history/49be664_open_issues.md`](docs/history/49be664_open_issues.md); fixes land in this fork as they are completed. Vision inputs (issues [#103](docs/history/49be664_open_issues.md#103--vlm-support--image-input-for-multimodal-models) and [#34](docs/history/49be664_open_issues.md#34--support-multimodal-inputs)) are now wired end to end through blocking, typed, streaming, and OpenAI-compatible request surfaces.
746773

747774
## Troubleshooting
748775

docs/feature-investigation-llama-stack-client-kotlin.md

Lines changed: 13 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -62,13 +62,12 @@ branch unless noted.
6262

6363
### 2.1 Multimodal image input (mtmd) — **L**
6464

65-
**Status: SHIPPED (typed Java surface).** The original L-effort scope assumed
66-
new JNI plumbing was required, but on inspection the upstream OAI chat path
67-
(`oaicompat_chat_params_parse` in `server-common.cpp`) already detects
68-
`{"type":"image_url","image_url":{"url":"data:..."}}` blocks and routes them
69-
through the compiled-in `mtmd` pipeline, and the project's
70-
`handleChatCompletions` JNI method forwards the request JSON intact. Only the
71-
Java-side convenience to emit the multipart-array `content` was missing.
65+
**Status: SHIPPED (end to end).** The upstream OAI parser detects
66+
`{"type":"image_url","image_url":{"url":"data:..."}}` blocks and decodes them,
67+
but the binding previously discarded those decoded buffers before creating the
68+
server task. The JNI bridge now preserves the media and submits it through the
69+
upstream CLI multimodal task path, which invokes `process_mtmd_prompt` with the
70+
loaded projector.
7271

7372
This pass adds:
7473
- **`ContentPart`** value type (`TEXT` / `IMAGE_URL`) with static factories
@@ -79,13 +78,14 @@ This pass adds:
7978
`userMultimodal(ContentPart...)` factory, `getParts()`, and `hasParts()`.
8079
The legacy `ChatMessage(role, content)` ctor and existing serializer path
8180
are unchanged.
82-
- **`InferenceParameters.setMessages(List<ChatMessage>)`** overload that
81+
- **`InferenceParameters.withMessages(List<ChatMessage>)`** overload that
8382
routes through a new `ParameterJsonSerializer.buildMessages(List<ChatMessage>)`
8483
emitting array-form `content` only when a message has parts.
85-
- 25 unit tests in `ContentPartTest` and `MultimodalMessagesTest` cover the
84+
- Unit tests in `ContentPartTest`, `MultimodalMessagesTest`, `ChatRequestTest`,
85+
and `OpenAiRequestMapperTest` cover the
8686
factory contracts, the parts/legacy split, and the OAI multipart JSON shape;
87-
the 123 existing `ChatMessage` / `InferenceParameters` /
88-
`ParameterJsonSerializer` tests still pass.
87+
- `MultimodalIntegrationTest` exercises blocking, typed, and streaming calls
88+
with a real model and verifies semantic red/blue image discrimination.
8989

9090
A multimodal call from Java now looks like:
9191
```java
@@ -99,24 +99,8 @@ String reply = model.chatCompleteText(new InferenceParameters("")
9999
ContentPart.imageFile(java.nio.file.Paths.get("photo.jpg"))))));
100100
```
101101

102-
Zero new JNI symbols; zero risk to existing text-only chat callers.
103-
104-
**Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and
105-
the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No
106-
Java method currently accepts image input. Kotlin examples show base64 image
107-
chat against vision models.
108-
109-
**Proposal.**
110-
- `InferenceParameters.addImage(byte[] png)` / `addImage(Path)` / `addImageBase64(String)`.
111-
- `ModelParameters.setMmproj(Path)` to load the mmproj projector file.
112-
- JNI: feed images into the server task params (`mtmd_*` API).
113-
114-
**Effort: L** — non-trivial JNI plumbing, lifecycle of `mtmd_context`,
115-
test fixtures for vision models, but most of the heavy lifting is already
116-
upstream.
117-
118-
**Value.** Biggest user-visible capability missing today. Unlocks Qwen-VL,
119-
Gemma 3, MiniCPM-V, LLaVA, etc.
102+
No new exported JNI symbols are needed; existing text-only chat callers retain
103+
the ordinary tokenization path.
120104

121105
---
122106

docs/history/49be664_open_issues.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ After a second-pass analysis of every `LIKELY FIXED` and `PARTIALLY FIXED` issue
2121

2222
- **Confirmable from code inspection alone (no runtime needed):**
2323
- #103, #34 — image input API: now FIXED via PR #189 (typed `ContentPart`
24-
+ `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)`
24+
+ `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.withMessages(List<ChatMessage>)`
2525
emitting OAI multipart content; the upstream chat path already routes
2626
`image_url` blocks through the compiled-in `mtmd` pipeline, so zero new
2727
JNI was needed).
@@ -305,13 +305,11 @@ Java API.
305305
Feature request: support visual-language models such as Qwen2.5-VL (image
306306
inputs) on Android.
307307

308-
**Status in fork:** FIXED in the same PR that closes the typed Java surface gap. The build still links the upstream `mtmd` multimodal library into `jllama` (`CMakeLists.txt:125-145, 253-255`) and `ModelParameters` still exposes `setMmproj`, `setMmprojUrl`, `enableMmprojAuto`, `enableMmprojOffload` (`ModelParameters.java:1250-1281`). The previously-missing typed image API now exists: `ContentPart.text(...)`, `ContentPart.imageUrl(...)`, `ContentPart.imageBytes(byte[], mime)`, `ContentPart.imageFile(Path)` (auto-detects png/jpeg/webp/gif), and `ChatMessage(role, List<ContentPart>)` + `ChatMessage.userMultimodal(ContentPart...)`. `InferenceParameters.setMessages(List<ChatMessage>)` serializes parts-bearing messages to the OAI array-form `content` that the upstream `oaicompat_chat_params_parse` already routes through the compiled-in `mtmd` pipeline &#x2014; zero new JNI required. See PR #189 (§2.1 of `docs/feature-investigation-llama-stack-client-kotlin.md`).
308+
**Status in fork:** FIXED end to end. The build links the upstream `mtmd` multimodal library into `jllama`, `ModelParameters` exposes projector loading and explicit auto/offload controls, and `ContentPart` + `ChatMessage.userMultimodal(...)` provide the typed image API. The native parser now preserves decoded media buffers and submits them through the upstream multimodal task path; previously those buffers were discarded and requests silently became text-only. Blocking, typed `ChatRequest`, streaming, and OpenAI-compatible mapping are covered, including a real-model semantic red/blue regression.
309309

310-
**Deep-dive analysis:** Definitively confirmable from code inspection — no runtime test changes this verdict. Two distinct surfaces exist for VLM:
310+
**Deep-dive analysis:** Two distinct surfaces exist for VLM:
311311
1. **Model loading:** fully wired (mmproj path, auto-detect, GPU offload) — these flags reach the upstream server-context unchanged.
312-
2. **Request payload:** the only path is `LlamaModel.handleChatCompletions(json, oaiCompat=true)` with manually-constructed `messages[].content = [{type:"text",...},{type:"image_url",image_url:{url:"data:image/png;base64,..."}}]` JSON. No typed helper.
313-
314-
This is genuinely PARTIALLY FIXED and only a Java-side enhancement closes the gap; no runtime investigation is required to confirm.
312+
2. **Request payload:** typed and raw OpenAI multipart content is decoded by the OAI parser, retained across JNI dispatch, and processed by `mtmd` before generation.
315313

316314
---
317315

@@ -696,9 +694,9 @@ prebuilt artifact for Android targets.
696694
Feature request: add multimodal input support (referencing
697695
[ggerganov/llama.cpp#3436](https://github.com/ggerganov/llama.cpp/pull/3436)).
698696

699-
**Status in fork:** FIXED. The upstream `mtmd` library is built and linked into `jllama` (`CMakeLists.txt:125-145, 253-255`), `ModelParameters` exposes `setMmproj`, `setMmprojUrl`, `enableMmprojAuto`, `enableMmprojOffload` (`ModelParameters.java:1250-1281`), and the typed image API now exists via `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)`. The serializer emits OAI array-form `content`; the upstream chat path already understands `image_url` blocks. See #103 for the parallel write-up and PR #189 for the implementation.
697+
**Status in fork:** FIXED. The upstream `mtmd` library is built and linked into `jllama`; projector controls and the typed `ContentPart` API are exposed; and decoded media is retained by the native bridge and processed through the upstream multimodal task path for blocking and streaming requests. See #103 for the full write-up.
700698

701-
**Deep-dive analysis:** Same conclusion as #103 — confirmable from code, no runtime needed. The original 2023 feature request asked for "multimodal input support"; in 2025 terms this splits into model loading (DONE) and request payload (DONE: typed `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)` emit the OAI multipart `content` the upstream chat path already consumes). Verdict is FIXED as of PR #189.
699+
**Deep-dive analysis:** Same conclusion as #103. Model loading and request execution are both verified; the real-model regression confirms image content affects the answer rather than merely producing a non-empty text-only response.
702700

703701
---
704702

@@ -761,7 +759,7 @@ Feature request: add multimodal input support (referencing
761759
| 110 | FIXED | `handleEmbeddings` accepts batch JSON | `LlamaModel.java:316`, `json_helpers.hpp:137` |
762760
| 107 | FIXED | CMake matches both Mac and Darwin | `CMakeLists.txt:196` |
763761
| 104 | FIXED | `NO_KV_OFFLOAD` flag exposed | `args/ModelFlag.java:50` |
764-
| 103 | FIXED | mtmd linked; typed image API in PR #189 (`ContentPart`, `ChatMessage(role, List<ContentPart>)`, `InferenceParameters.setMessages(List<ChatMessage>)`) | `ContentPart.java`, `ChatMessage.java`, `InferenceParameters.java`, `ModelParameters.java:1250-1281` |
762+
| 103 | FIXED | mtmd linked; typed image API in PR #189 (`ContentPart`, `ChatMessage(role, List<ContentPart>)`, `InferenceParameters.withMessages(List<ChatMessage>)`) | `ContentPart.java`, `ChatMessage.java`, `InferenceParameters.java`, `ModelParameters.java:1250-1281` |
765763
| 102 | FIXED | Destructor drains workers and frees ctx; covered by `MemoryManagementTest#testOpenCloseLoopDoesNotLeak` (commit `cba693c`, PR #185) | `jllama.cpp:917-948` |
766764
| 101 | FIXED | Trampoline calls BiConsumer | `jllama.cpp:954-977` |
767765
| 98 | FIXED | `enableEmbedding` + `setPoolingType`; covered by `LlamaEmbeddingsTest#testNomicEmbedLoads` (commit `cba693c`, PR #185; CI downloads `nomic-embed-text-v1.5.f16.gguf`) | `ModelParameters.java:1040,606` |
@@ -804,7 +802,7 @@ or change the verdict.
804802
| 98 | Reporter's config was *literally* `new ModelParameters().setModel(...).setBatchSize(8192).setUbatchSize(8192)`**no `enableEmbedding()` call**. The original "bug" was that the bindings did not forward `--embedding` at all; the upstream `result_output` assertion fired because the embedding pipeline was never initialised. | **DONE**`LlamaEmbeddingsTest#testNomicEmbedLoads` (commit `713d426`) runs the reporter's exact config plus `enableEmbedding()`; gated on `net.ladenthin.llama.nomic.path`; CI downloads the model via `NOMIC_EMBED_MODEL_URL` in `publish.yml`. |
805803
| 95 | Reporter pastes the `next()` method and argues the design is wrong: when `output.stop=true`, the method returns that output and ends. No model, prompt or reproduction provided. | **DONE**`LlamaModelTest#testIteratorTerminatesOnRepetitivePrompt` (commit `713d426`) drives the iterator with a repetitive prompt at `nPredict=30`, `temperature=0.0f` and asserts termination within `nPredict+1` outputs. |
806804
| 80 | Exact repro: Kotlin-style 3 lines (`val params...`, `val model = new LlamaModel(params)`, `model.close()`) with `qwen2-0_5b-instruct-q4_0.gguf`. JDK 17.0.12+7, java-llama.cpp 3.4.1. SIGSEGV in `std::_Rb_tree` during `delete`. Reporter said they intended to follow up with a `-DLLAMA_DEBUG` build but never did. | **DONE**`MemoryManagementTest#testOpenCloseWithoutGeneration` (commit `713d426`) maps the 3-line repro to 20 iterations of try-with-resources open + immediate close; a JVM crash exits the runner non-zero. |
807-
| 103 | Specifically asks about **Qwen2.5-VL on Android**. No code attempted. | **DONE (typed API)** — PR #189 ships `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)`. Android-sample tail tracked separately. |
805+
| 103 | Specifically asks about **Qwen2.5-VL on Android**. No code attempted. | **DONE (typed API)** — PR #189 ships `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.withMessages(List<ChatMessage>)`. Android-sample tail tracked separately. |
808806
| 86 | Just a question: "does the CUDA jar handle CPU fallback?". No code. | Not unit-testable. Documentation task. |
809807
| 34 | One-line feature request linking upstream PR #3436 (LLaVA). No specifics. | **DONE** — subsumed by #103 (PR #189). |
810808
| 121 | (Not refetched — Android `aarch64` vs `arm64-v8a` mismatch; already analysed in deep-dive.) | Verified by code; needs an Android boot test, not a unit test. |
@@ -930,7 +928,7 @@ the same pattern as the existing CodeLlama / Jina-Reranker model downloads.
930928

931929
| # | Why not unit-testable | Action |
932930
|---|---|---|
933-
| 103, 34 | (Historic — typed image API was missing at the time of this audit.) | **DONE in PR #189**: `ContentPart` + `ChatMessage(role, List<ContentPart>)` + `InferenceParameters.setMessages(List<ChatMessage>)` emit the OAI multipart `content` the upstream chat path already routes through `mtmd`. `MultimodalIntegrationTest` is added under model-gated `Assume`. |
931+
| 103, 34 | (Historic — typed image API and native media dispatch were incomplete at the time of this audit.) | **DONE**: `ContentPart` + `ChatMessage(role, List<ContentPart>)` emit OAI multipart content, and the JNI bridge retains decoded media for upstream `mtmd` processing. The model-gated integration test covers blocking, typed, streaming, and semantic image handling. |
934932
| 86 | Question about jar packaging behaviour, not code defect. | Documentation: add a README section "Choosing the right classifier" stating that the CUDA jar requires the CUDA runtime libraries at load time and does not auto-fall-back. |
935933
| 121, 50 | Android runtime / cross-host build path — needs an emulator boot or a macOS-M2 cross-compile, not a JVM test. | CI matrix expansion: add an Android emulator job that boots a stock `arm64-v8a` AVD and runs the existing `LlamaModelTest` against the dockcross-built `libjllama.so`. |
936934

@@ -954,7 +952,7 @@ the same pattern as the existing CodeLlama / Jina-Reranker model downloads.
954952
the residual gap on #121.
955953
3. **Third PR (feature): SHIPPED as PR #189.** Adds the typed multimodal
956954
surface (`ContentPart`, `ChatMessage(role, List<ContentPart>)`,
957-
`InferenceParameters.setMessages(List<ChatMessage>)`) plus a
955+
`InferenceParameters.withMessages(List<ChatMessage>)`) plus a
958956
`MultimodalIntegrationTest` gated on `net.ladenthin.llama.vision.model`,
959957
`.vision.mmproj`, and `.vision.image` system properties. CI in
960958
`publish.yml` downloads a small vision model + mmproj + an author-

0 commit comments

Comments
 (0)