Skip to content

Commit 871a700

Browse files
committed
Add typed multimodal chat surface (§2.1, Java-only)
The original §2.1 scope assumed new JNI plumbing was needed for image input. On inspection the upstream OAI chat path already handles {"type":"image_url","image_url":{"url":"data:..."}} blocks and routes them through the compiled-in mtmd pipeline, and our handleChatCompletions JNI method forwards the request JSON intact. The only gap was a typed Java surface to emit the multipart-array form of the message content. New API - ContentPart value type (TEXT / IMAGE_URL) with factories text(), imageUrl(), imageBytes(bytes, mime), imageFile(Path). imageFile detects png/jpeg/webp/gif from the file extension and produces a base64 data URI. - ChatMessage(String role, List<ContentPart> parts) constructor plus ChatMessage.userMultimodal(ContentPart...) factory. getParts() returns an unmodifiable list (null for legacy text-only messages); hasParts() is the boolean accessor. getContent() concatenates text parts so legacy readers stay functional. - InferenceParameters.setMessages(List<ChatMessage>) overload routing through ParameterJsonSerializer.buildMessages(List<ChatMessage>) which emits array-form content only when a message has parts. Backwards compat - Existing ChatMessage(role, content) constructor unchanged. Existing setMessages(String, List<Pair>) unchanged. Existing serializer buildMessages(String, List<Pair>) unchanged. All 123 prior tests in ChatMessageTest / ParameterJsonSerializerTest / InferenceParametersTest still pass. Tests - ContentPartTest (14): factory contracts, MIME detection, data URI shape. - MultimodalMessagesTest (11): hasParts/getParts split, concatText behaviour, OAI array-form serialisation, mixed legacy+parts message lists, and end-to-end round-trip into the InferenceParameters JSON. All 25 new tests run without a model (pure JSON shape verification). Zero new JNI symbols. Mmproj wiring (ModelParameters.setMmproj) already shipped previously and is untouched.
1 parent 99d1c89 commit 871a700

7 files changed

Lines changed: 623 additions & 3 deletions

File tree

docs/feature-investigation-llama-stack-client-kotlin.md

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,44 @@ branch unless noted.
6262

6363
### 2.1 Multimodal image input (mtmd) — **L**
6464

65-
**Status: OPEN.** `ModelParameters.setMmproj` already wires the projector; no
66-
typed Java image API yet. Same gap as issues #103 / #34.
65+
**Status: SHIPPED (typed Java surface).** The original L-effort scope assumed
66+
new JNI plumbing was required, but on inspection the upstream OAI chat path
67+
(`oaicompat_chat_params_parse` in `server-common.cpp`) already detects
68+
`{"type":"image_url","image_url":{"url":"data:..."}}` blocks and routes them
69+
through the compiled-in `mtmd` pipeline, and the project's
70+
`handleChatCompletions` JNI method forwards the request JSON intact. Only the
71+
Java-side convenience to emit the multipart-array `content` was missing.
72+
73+
This pass adds:
74+
- **`ContentPart`** value type (`TEXT` / `IMAGE_URL`) with static factories
75+
`text(...)`, `imageUrl(...)`, `imageBytes(byte[], mime)`, and
76+
`imageFile(Path)` (auto-detects png/jpeg/webp/gif from the extension and
77+
base64-encodes into a `data:` URI).
78+
- **`ChatMessage(String role, List<ContentPart> parts)`** constructor plus
79+
`userMultimodal(ContentPart...)` factory, `getParts()`, and `hasParts()`.
80+
The legacy `ChatMessage(role, content)` ctor and existing serializer path
81+
are unchanged.
82+
- **`InferenceParameters.setMessages(List<ChatMessage>)`** overload that
83+
routes through a new `ParameterJsonSerializer.buildMessages(List<ChatMessage>)`
84+
emitting array-form `content` only when a message has parts.
85+
- 25 unit tests in `ContentPartTest` and `MultimodalMessagesTest` cover the
86+
factory contracts, the parts/legacy split, and the OAI multipart JSON shape;
87+
the 123 existing `ChatMessage` / `InferenceParameters` /
88+
`ParameterJsonSerializer` tests still pass.
89+
90+
A multimodal call from Java now looks like:
91+
```java
92+
LlamaModel model = new LlamaModel(new ModelParameters()
93+
.setModel("vision-model.gguf")
94+
.setMmproj("vision-projector.gguf"));
95+
String reply = model.chatCompleteText(new InferenceParameters("")
96+
.setMessages(java.util.Collections.singletonList(
97+
ChatMessage.userMultimodal(
98+
ContentPart.text("What is in this image?"),
99+
ContentPart.imageFile(java.nio.file.Paths.get("photo.jpg"))))));
100+
```
101+
102+
Zero new JNI symbols; zero risk to existing text-only chat callers.
67103

68104
**Gap.** Upstream llama.cpp ships `mtmd` (vision + audio for some models) and
69105
the compiled-in server already pulls it in via `mtmd.h` / `mtmd-helper.h`. No

src/main/java/net/ladenthin/llama/ChatMessage.java

Lines changed: 78 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
package net.ladenthin.llama;
66

7+
import java.util.Arrays;
78
import java.util.Collections;
89
import java.util.List;
910

@@ -17,13 +18,21 @@
1718
* {@link #getToolCalls()} list. Tool-result turns have role {@code "tool"}, the tool's
1819
* output as content, and {@link #getToolCallId()} pointing back at the originating call.
1920
* </p>
21+
* <p>
22+
* Multimodal turns carry a non-null {@link #getParts()} list of {@link ContentPart}s
23+
* (text and image references). When parts are present they take precedence over
24+
* {@link #getContent()} during serialization; the upstream OAI chat path
25+
* (see {@link InferenceParameters#setMessages(java.util.List)}) emits an array-form
26+
* {@code content} field that the compiled-in {@code mtmd} pipeline understands.
27+
* </p>
2028
*/
2129
public final class ChatMessage {
2230

2331
private final String role;
2432
private final String content;
2533
private final String toolCallId;
2634
private final List<ToolCall> toolCalls;
35+
private final List<ContentPart> parts;
2736

2837
/**
2938
* Plain user/assistant/system message.
@@ -32,7 +41,7 @@ public final class ChatMessage {
3241
* @param content the message text
3342
*/
3443
public ChatMessage(String role, String content) {
35-
this(role, content, null, Collections.<ToolCall>emptyList());
44+
this(role, content, null, Collections.<ToolCall>emptyList(), null);
3645
}
3746

3847
/**
@@ -44,10 +53,49 @@ public ChatMessage(String role, String content) {
4453
* @param toolCalls for assistant tool-call turns, the list of calls; empty otherwise
4554
*/
4655
public ChatMessage(String role, String content, String toolCallId, List<ToolCall> toolCalls) {
56+
this(role, content, toolCallId, toolCalls, null);
57+
}
58+
59+
/**
60+
* Multimodal constructor: build a message whose content is a list of
61+
* {@link ContentPart}s (text and/or image references). The {@link #getContent()}
62+
* accessor returns the concatenation of the text parts for legacy callers that
63+
* cannot consume the array form.
64+
*
65+
* @param role the message role
66+
* @param parts ordered list of content parts (must not be {@code null} or empty)
67+
*/
68+
public ChatMessage(String role, List<ContentPart> parts) {
69+
this(role, concatText(parts), null, Collections.<ToolCall>emptyList(),
70+
Collections.unmodifiableList(new java.util.ArrayList<ContentPart>(requireNonEmpty(parts))));
71+
}
72+
73+
private ChatMessage(String role, String content, String toolCallId,
74+
List<ToolCall> toolCalls, List<ContentPart> parts) {
4775
this.role = role;
4876
this.content = content;
4977
this.toolCallId = toolCallId;
5078
this.toolCalls = toolCalls == null ? Collections.<ToolCall>emptyList() : toolCalls;
79+
this.parts = parts;
80+
}
81+
82+
private static List<ContentPart> requireNonEmpty(List<ContentPart> parts) {
83+
if (parts == null || parts.isEmpty()) {
84+
throw new IllegalArgumentException("parts must not be null or empty");
85+
}
86+
return parts;
87+
}
88+
89+
private static String concatText(List<ContentPart> parts) {
90+
if (parts == null) return "";
91+
StringBuilder sb = new StringBuilder();
92+
for (ContentPart p : parts) {
93+
if (p.getType() == ContentPart.Type.TEXT) {
94+
if (sb.length() > 0) sb.append('\n');
95+
sb.append(p.getText());
96+
}
97+
}
98+
return sb.toString();
5199
}
52100

53101
/**
@@ -72,6 +120,17 @@ public static ChatMessage assistantToolCalls(String content, List<ToolCall> tool
72120
return new ChatMessage("assistant", content == null ? "" : content, null, toolCalls);
73121
}
74122

123+
/**
124+
* Convenience factory for a {@code "user"} turn mixing text and one or more
125+
* images. Equivalent to {@code new ChatMessage("user", parts)}.
126+
*
127+
* @param parts ordered text and image parts; at least one is required
128+
* @return a multimodal user message
129+
*/
130+
public static ChatMessage userMultimodal(ContentPart... parts) {
131+
return new ChatMessage("user", Arrays.asList(parts));
132+
}
133+
75134
/**
76135
* Message role accessor.
77136
* @return the message role string
@@ -104,6 +163,24 @@ public List<ToolCall> getToolCalls() {
104163
return toolCalls;
105164
}
106165

166+
/**
167+
* Multimodal content parts accessor.
168+
* @return an unmodifiable list of text and image parts, or {@code null} for
169+
* legacy text-only messages built via {@link #ChatMessage(String, String)}
170+
*/
171+
public List<ContentPart> getParts() {
172+
return parts;
173+
}
174+
175+
/**
176+
* Whether this message carries multimodal parts (i.e. was constructed via
177+
* {@link #ChatMessage(String, List)} or {@link #userMultimodal(ContentPart...)}).
178+
* @return {@code true} when {@link #getParts()} is non-null
179+
*/
180+
public boolean hasParts() {
181+
return parts != null;
182+
}
183+
107184
@Override
108185
public String toString() {
109186
if (!toolCalls.isEmpty()) return role + " (tool_calls=" + toolCalls.size() + "): " + content;
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama;
6+
7+
import java.io.IOException;
8+
import java.nio.file.Files;
9+
import java.nio.file.Path;
10+
import java.util.Base64;
11+
import java.util.Locale;
12+
import java.util.Objects;
13+
14+
/**
15+
* One piece of a {@link ChatMessage}'s multimodal content array: either a text
16+
* fragment or an image URL (typically a {@code data:image/...;base64,...} URI).
17+
* Mirrors the OpenAI-compatible {@code content} part shape that the upstream
18+
* {@code llama.cpp} server already understands, so no new JNI plumbing is
19+
* required &#x2014; an image-bearing message is serialized to
20+
* <pre>
21+
* {"role":"user","content":[
22+
* {"type":"text","text":"What is in this image?"},
23+
* {"type":"image_url","image_url":{"url":"data:image/png;base64,..."}}
24+
* ]}
25+
* </pre>
26+
* and the upstream {@code oaicompat_chat_params_parse} routes it through the
27+
* compiled-in {@code mtmd} pipeline (requires
28+
* {@link ModelParameters#setMmproj(String)} to be wired).
29+
* <p>
30+
* Instances are immutable and safe to share across threads. Use the static
31+
* factories &#x2014; the constructor is private.
32+
* </p>
33+
*/
34+
public final class ContentPart {
35+
36+
/** Discriminator for the two part kinds the OAI multipart schema supports. */
37+
public enum Type {
38+
/** A plain-text fragment. */
39+
TEXT,
40+
/** An image reference (data URI or remote URL). */
41+
IMAGE_URL
42+
}
43+
44+
private final Type type;
45+
private final String text;
46+
private final String imageUrl;
47+
48+
private ContentPart(Type type, String text, String imageUrl) {
49+
this.type = type;
50+
this.text = text;
51+
this.imageUrl = imageUrl;
52+
}
53+
54+
/**
55+
* Build a text part.
56+
*
57+
* @param text the text fragment (must not be {@code null})
58+
* @return a TEXT part wrapping {@code text}
59+
*/
60+
public static ContentPart text(String text) {
61+
Objects.requireNonNull(text, "text");
62+
return new ContentPart(Type.TEXT, text, null);
63+
}
64+
65+
/**
66+
* Build an image part from a pre-formed URL or data URI. Pass either an
67+
* HTTP(S) URL (if the server is configured to fetch it) or a complete
68+
* {@code data:image/...;base64,...} string.
69+
*
70+
* @param url image URL or data URI (must not be {@code null})
71+
* @return an IMAGE_URL part wrapping {@code url}
72+
*/
73+
public static ContentPart imageUrl(String url) {
74+
Objects.requireNonNull(url, "url");
75+
return new ContentPart(Type.IMAGE_URL, null, url);
76+
}
77+
78+
/**
79+
* Build an image part from raw bytes plus an explicit MIME type. The bytes
80+
* are base64-encoded and wrapped in a {@code data:} URI.
81+
*
82+
* @param bytes raw image bytes (must not be {@code null})
83+
* @param mimeType MIME type, e.g. {@code "image/png"} (must not be {@code null} or empty)
84+
* @return an IMAGE_URL part carrying the data URI
85+
*/
86+
public static ContentPart imageBytes(byte[] bytes, String mimeType) {
87+
Objects.requireNonNull(bytes, "bytes");
88+
Objects.requireNonNull(mimeType, "mimeType");
89+
if (mimeType.isEmpty()) {
90+
throw new IllegalArgumentException("mimeType must not be empty");
91+
}
92+
String encoded = Base64.getEncoder().encodeToString(bytes);
93+
return new ContentPart(Type.IMAGE_URL, null, "data:" + mimeType + ";base64," + encoded);
94+
}
95+
96+
/**
97+
* Build an image part by reading a file from disk and detecting its MIME
98+
* type from the file extension. Recognised extensions: {@code .png},
99+
* {@code .jpg}, {@code .jpeg}, {@code .webp}, {@code .gif}. Anything else
100+
* throws {@link IllegalArgumentException}; use {@link #imageBytes(byte[], String)}
101+
* to force a MIME type explicitly.
102+
*
103+
* @param imagePath path to the image file (must not be {@code null})
104+
* @return an IMAGE_URL part carrying the data URI
105+
* @throws IOException if the file cannot be read
106+
*/
107+
public static ContentPart imageFile(Path imagePath) throws IOException {
108+
Objects.requireNonNull(imagePath, "imagePath");
109+
String name = imagePath.getFileName().toString().toLowerCase(Locale.ROOT);
110+
String mimeType;
111+
if (name.endsWith(".png")) {
112+
mimeType = "image/png";
113+
} else if (name.endsWith(".jpg") || name.endsWith(".jpeg")) {
114+
mimeType = "image/jpeg";
115+
} else if (name.endsWith(".webp")) {
116+
mimeType = "image/webp";
117+
} else if (name.endsWith(".gif")) {
118+
mimeType = "image/gif";
119+
} else {
120+
throw new IllegalArgumentException(
121+
"Cannot infer MIME type from extension: " + imagePath
122+
+ " &#x2014; use ContentPart.imageBytes(bytes, mimeType) instead");
123+
}
124+
return imageBytes(Files.readAllBytes(imagePath), mimeType);
125+
}
126+
127+
/**
128+
* Part-kind accessor.
129+
* @return the discriminator selecting {@link #getText()} or {@link #getImageUrl()}
130+
*/
131+
public Type getType() {
132+
return type;
133+
}
134+
135+
/**
136+
* Text accessor (only set for {@link Type#TEXT}).
137+
* @return the text fragment, or {@code null} for {@link Type#IMAGE_URL} parts
138+
*/
139+
public String getText() {
140+
return text;
141+
}
142+
143+
/**
144+
* Image URL accessor (only set for {@link Type#IMAGE_URL}).
145+
* @return the URL or data URI, or {@code null} for {@link Type#TEXT} parts
146+
*/
147+
public String getImageUrl() {
148+
return imageUrl;
149+
}
150+
}

src/main/java/net/ladenthin/llama/InferenceParameters.java

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -554,6 +554,28 @@ public InferenceParameters setMessages(String systemMessage, List<Pair<String, S
554554
return this;
555555
}
556556

557+
/**
558+
* Multimodal-capable variant. Accepts {@link ChatMessage} objects so messages
559+
* with non-null {@link ChatMessage#getParts()} are serialized as OAI array-form
560+
* {@code content} (text + image_url parts). Plain text messages emit the legacy
561+
* string-form {@code content}, so this overload is also a drop-in replacement
562+
* for the {@code List&lt;Pair&gt;} variant when callers prefer the typed
563+
* {@link ChatMessage} surface.
564+
* <p>
565+
* Image parts require the model to have a multimodal projector loaded via
566+
* {@link ModelParameters#setMmproj(String)}. The upstream OAI chat parser
567+
* routes {@code image_url} blocks through the compiled-in {@code mtmd}
568+
* pipeline; no additional JNI configuration is needed on the Java side.
569+
* </p>
570+
*
571+
* @param messages ordered messages, including any {@code "system"} prelude
572+
* @return this builder
573+
*/
574+
public InferenceParameters setMessages(List<ChatMessage> messages) {
575+
parameters.put(PARAM_MESSAGES, serializer.buildMessages(messages).toString());
576+
return this;
577+
}
578+
557579
/**
558580
* Set the {@code messages} array directly from a pre-built JSON string. Use this
559581
* for the typed chat API (see {@link ChatRequest#buildMessagesJson()}) when the

0 commit comments

Comments
 (0)