Skip to content

Commit 55a6fa0

Browse files
committed
feat(multimodal): audio input via OpenAI input_audio content parts
Extends the typed multimodal API from vision to audio (llama.cpp discussion #13759). No native/JNI change is needed: upstream b9739 already decodes the OpenAI `input_audio` content part (server-common.cpp) into the same media buffer vision uses, which the JNI bridge already threads through to mtmd's audio pipeline; mtmd supports audio (mtmd_support_audio / mtmd_bitmap_init_from_audio). - ContentPart: new INPUT_AUDIO kind + factories ContentPart.inputAudio(byte[], "wav"|"mp3") and audioFile(Path) (extension -> format), with base64 data + format accessors. - ParameterJsonSerializer.buildMessages emits {"type":"input_audio","input_audio":{"data","format"}}; ChatMessage.concatText already skips non-text parts, so getContent() is unaffected. - LlamaModel.supportsAudio() (parallel to supportsVision(); ModelMeta.supportsAudio already existed, fed by the native meta's modalities.audio). - Tests: ContentPart audio factories + format validation, a ChatRequest serializer test asserting the input_audio JSON shape, and a gated AudioInputIntegrationTest (Ultravox / Qwen2.5-Omni) that self-skips without the audio model / mmproj / clip (3 new audio.* system properties). - Docs: README "Vision / Multimodal Chat" audio example + system-property rows; CLAUDE.md property table + run command. The OpenAI server needs no change — audio content parts already round-trip verbatim through /v1/chat/completions. The audio model download is intentionally NOT added to CI (Ultravox is large and the test self-skips); it's documented as locally/CI-runnable. Verified: 50 affected unit tests + audio serializer test green, integration test self-skips, Spotless + Javadoc clean.
1 parent 90f95d0 commit 55a6fa0

9 files changed

Lines changed: 289 additions & 6 deletions

File tree

CLAUDE.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -585,6 +585,9 @@ the README. The summary below covers only the optional-model bindings:
585585
| `net.ladenthin.llama.vision.model` | `MultimodalIntegrationTest` (upstream kherud/java-llama.cpp#103 / #34) | `SmolVLM-500M-Instruct-Q8_0.gguf` (any vision-capable GGUF works) |
586586
| `net.ladenthin.llama.vision.mmproj` | `MultimodalIntegrationTest` | matching mmproj for the vision model, e.g. `mmproj-SmolVLM-500M-Instruct-Q8_0.gguf` |
587587
| `net.ladenthin.llama.vision.image` | `MultimodalIntegrationTest` | committed default `src/test/resources/images/test-image.jpg`; override to any png/jpeg/webp/gif on disk |
588+
| `net.ladenthin.llama.audio.model` | `AudioInputIntegrationTest` (llama.cpp discussion #13759) | audio-input model GGUF, e.g. `ultravox-v0_5-llama-3_2-1b.gguf` |
589+
| `net.ladenthin.llama.audio.mmproj` | `AudioInputIntegrationTest` | matching audio mmproj/encoder, e.g. `mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf` |
590+
| `net.ladenthin.llama.audio.input` | `AudioInputIntegrationTest` | a `.wav`/`.mp3` clip on disk (no committed default — audio is not committed) |
588591

589592
Run those tests by setting the property:
590593
```bash
@@ -596,6 +599,12 @@ mvn test -Dtest=MultimodalIntegrationTest \
596599
# The vision.image property defaults to src/test/resources/images/test-image.jpg
597600
# (a CC-BY-4.0 / MIT-granted photo of flowers and bees by the project author);
598601
# override only if you want to test a different image.
602+
603+
# Audio input (Ultravox / Qwen2.5-Omni; the audio clip has no committed default):
604+
mvn test -Dtest=AudioInputIntegrationTest \
605+
-Dnet.ladenthin.llama.audio.model=models/ultravox-v0_5-llama-3_2-1b.gguf \
606+
-Dnet.ladenthin.llama.audio.mmproj=models/mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf \
607+
-Dnet.ladenthin.llama.audio.input=/path/to/speech.wav
599608
```
600609

601610
`MultimodalIntegrationTest` self-skips when any of the three vision properties

README.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,8 +278,11 @@ Every `net.ladenthin.llama.*` system property recognised by the library, deep-sc
278278
| `net.ladenthin.llama.vision.model` | unset (test self-skips) | test | `MultimodalIntegrationTest` (upstream kherud/java-llama.cpp#103 / #34) | Path to a vision-capable model GGUF. Any vision-capable GGUF works; CI default is `SmolVLM-500M-Instruct-Q8_0.gguf`. |
279279
| `net.ladenthin.llama.vision.mmproj` | unset (test self-skips) | test | `MultimodalIntegrationTest` | Matching mmproj GGUF for the vision model. |
280280
| `net.ladenthin.llama.vision.image` | `src/test/resources/images/test-image.jpg` (a CC-BY-4.0 / MIT-granted photo committed to the repo) | test | `MultimodalIntegrationTest` | Visual prompt image. Any png/jpeg/webp/gif works; the extension drives MIME detection. |
281+
| `net.ladenthin.llama.audio.model` | unset (test self-skips) | test | `AudioInputIntegrationTest` (llama.cpp discussion #13759) | Path to an audio-input model GGUF (e.g. Ultravox, Qwen2.5-Omni). |
282+
| `net.ladenthin.llama.audio.mmproj` | unset (test self-skips) | test | `AudioInputIntegrationTest` | Matching audio mmproj (encoder) GGUF. |
283+
| `net.ladenthin.llama.audio.input` | unset (test self-skips) | test | `AudioInputIntegrationTest` | `.wav`/`.mp3` audio prompt clip; the extension drives format detection. |
281284

282-
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring.
285+
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring. `AudioInputIntegrationTest` self-skips the same way over the three `audio.*` properties.
283286

284287
## Documentation
285288

@@ -409,6 +412,30 @@ OpenAI-compatible `/v1/chat/completions` server. For a strictly CPU-only run, us
409412
`setDevices("none").setMmprojOffload(false)` in addition to `setGpuLayers(0)`; projector offload
410413
has its own upstream default.
411414

415+
**Audio input** works identically — load an audio-capable model (Ultravox, Qwen2.5-Omni, …) with its
416+
audio `--mmproj` and add a `ContentPart.audioFile(...)` (or `inputAudio(bytes, "wav"|"mp3")`) part. It
417+
serializes to the OpenAI `input_audio` content part and routes through the same `mtmd` pipeline:
418+
419+
```java
420+
ModelParameters modelParams = new ModelParameters()
421+
.setModel("models/ultravox-v0_5-llama-3_2-1b.gguf")
422+
.setMmproj("models/mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf");
423+
424+
ChatMessage message = ChatMessage.userMultimodal(
425+
ContentPart.text("Transcribe the audio."),
426+
ContentPart.audioFile(Paths.get("speech.wav")));
427+
428+
try (LlamaModel model = new LlamaModel(modelParams)) {
429+
System.out.println(model.supportsAudio()); // true
430+
String answer = model.chatCompleteText(InferenceParameters.empty()
431+
.withMessages(Collections.singletonList(message))
432+
.withNPredict(64));
433+
System.out.println(answer);
434+
}
435+
```
436+
437+
`LlamaModel.supportsVision()` / `supportsAudio()` report which modalities the loaded projector enables.
438+
412439
### Tool Calling
413440

414441
Use a tool-aware instruct model and enable Jinja when loading it. A typed request can either return

src/main/java/net/ladenthin/llama/LlamaModel.java

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -851,6 +851,17 @@ public boolean supportsVision() {
851851
return getModelMeta().supportsVision();
852852
}
853853

854+
/**
855+
* Reports whether the loaded model accepts audio input (an audio-capable {@code --mmproj},
856+
* e.g. Ultravox / Qwen2.5-Omni). Audio clips are supplied as
857+
* {@link net.ladenthin.llama.value.ContentPart#inputAudio(byte[], String)} parts.
858+
*
859+
* @return {@code true} when audio input is available
860+
*/
861+
public boolean supportsAudio() {
862+
return getModelMeta().supportsAudio();
863+
}
864+
854865
native String getModelMetaJson();
855866

856867
/**

src/main/java/net/ladenthin/llama/parameters/ParameterJsonSerializer.java

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,14 @@ public ArrayNode buildMessages(List<ChatMessage> messages) {
126126
part.put("type", "text");
127127
final String text = p.getText();
128128
part.put("text", text != null ? text : "");
129+
} else if (p.getType() == ContentPart.Type.INPUT_AUDIO) {
130+
part.put("type", "input_audio");
131+
ObjectNode inputAudio = OBJECT_MAPPER.createObjectNode();
132+
final String data = p.getAudioData();
133+
final String format = p.getAudioFormat();
134+
inputAudio.put("data", data != null ? data : "");
135+
inputAudio.put("format", format != null ? format : "wav");
136+
part.set("input_audio", inputAudio);
129137
} else {
130138
part.put("type", "image_url");
131139
ObjectNode imageUrl = OBJECT_MAPPER.createObjectNode();

src/main/java/net/ladenthin/llama/value/ContentPart.java

Lines changed: 82 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,17 +44,28 @@ public enum Type {
4444
/** A plain-text fragment. */
4545
TEXT,
4646
/** An image reference (data URI or remote URL). */
47-
IMAGE_URL
47+
IMAGE_URL,
48+
/** An audio clip (base64 {@code data} + {@code format}), for audio-input models. */
49+
INPUT_AUDIO
4850
}
4951

5052
private final Type type;
5153
private final @Nullable String text;
5254
private final @Nullable String imageUrl;
55+
private final @Nullable String audioData;
56+
private final @Nullable String audioFormat;
5357

54-
private ContentPart(Type type, @Nullable String text, @Nullable String imageUrl) {
58+
private ContentPart(
59+
Type type,
60+
@Nullable String text,
61+
@Nullable String imageUrl,
62+
@Nullable String audioData,
63+
@Nullable String audioFormat) {
5564
this.type = type;
5665
this.text = text;
5766
this.imageUrl = imageUrl;
67+
this.audioData = audioData;
68+
this.audioFormat = audioFormat;
5869
}
5970

6071
/**
@@ -65,7 +76,7 @@ private ContentPart(Type type, @Nullable String text, @Nullable String imageUrl)
6576
*/
6677
public static ContentPart text(String text) {
6778
Objects.requireNonNull(text, "text");
68-
return new ContentPart(Type.TEXT, text, null);
79+
return new ContentPart(Type.TEXT, text, null, null, null);
6980
}
7081

7182
/**
@@ -78,7 +89,7 @@ public static ContentPart text(String text) {
7889
*/
7990
public static ContentPart imageUrl(String url) {
8091
Objects.requireNonNull(url, "url");
81-
return new ContentPart(Type.IMAGE_URL, null, url);
92+
return new ContentPart(Type.IMAGE_URL, null, url, null, null);
8293
}
8394

8495
/**
@@ -96,7 +107,7 @@ public static ContentPart imageBytes(byte[] bytes, String mimeType) {
96107
throw new IllegalArgumentException("mimeType must not be empty (bytes.length=" + bytes.length + ")");
97108
}
98109
String encoded = Base64.getEncoder().encodeToString(bytes);
99-
return new ContentPart(Type.IMAGE_URL, null, "data:" + mimeType + ";base64," + encoded);
110+
return new ContentPart(Type.IMAGE_URL, null, "data:" + mimeType + ";base64," + encoded, null, null);
100111
}
101112

102113
/**
@@ -133,6 +144,56 @@ public static ContentPart imageFile(Path imagePath) throws IOException {
133144
return imageBytes(Files.readAllBytes(imagePath), mimeType);
134145
}
135146

147+
/**
148+
* Build an audio part from raw bytes plus an explicit container format. Mirrors the OpenAI
149+
* {@code input_audio} content part the upstream {@code llama.cpp} server understands, routed
150+
* through the {@code mtmd} audio pipeline (requires an audio-capable {@code --mmproj}). The bytes
151+
* are base64-encoded.
152+
*
153+
* @param audioBytes raw audio bytes (must not be {@code null})
154+
* @param format container format, {@code "wav"} or {@code "mp3"} (case-insensitive)
155+
* @return an INPUT_AUDIO part carrying the base64 data and normalised format
156+
* @throws IllegalArgumentException if {@code format} is not {@code "wav"} or {@code "mp3"}
157+
*/
158+
public static ContentPart inputAudio(byte[] audioBytes, String format) {
159+
Objects.requireNonNull(audioBytes, "audioBytes");
160+
Objects.requireNonNull(format, "format");
161+
String normalized = format.toLowerCase(Locale.ROOT);
162+
if (!normalized.equals("wav") && !normalized.equals("mp3")) {
163+
throw new IllegalArgumentException("audio format must be 'wav' or 'mp3', was: " + format);
164+
}
165+
String encoded = Base64.getEncoder().encodeToString(audioBytes);
166+
return new ContentPart(Type.INPUT_AUDIO, null, null, encoded, normalized);
167+
}
168+
169+
/**
170+
* Build an audio part by reading a file from disk and detecting its format from the file
171+
* extension. Recognised extensions: {@code .wav}, {@code .mp3}. Anything else throws
172+
* {@link IllegalArgumentException}; use {@link #inputAudio(byte[], String)} to force a format.
173+
*
174+
* @param audioPath path to the audio file (must not be {@code null})
175+
* @return an INPUT_AUDIO part carrying the data
176+
* @throws IOException if the file cannot be read
177+
*/
178+
public static ContentPart audioFile(Path audioPath) throws IOException {
179+
Objects.requireNonNull(audioPath, "audioPath");
180+
Path fileNamePath = audioPath.getFileName();
181+
if (fileNamePath == null) {
182+
throw new IllegalArgumentException("audioPath has no file name component: " + audioPath);
183+
}
184+
String name = fileNamePath.toString().toLowerCase(Locale.ROOT);
185+
String format;
186+
if (name.endsWith(".wav")) {
187+
format = "wav";
188+
} else if (name.endsWith(".mp3")) {
189+
format = "mp3";
190+
} else {
191+
throw new IllegalArgumentException("Cannot infer audio format from extension: " + audioPath
192+
+ " — use ContentPart.inputAudio(bytes, format) instead");
193+
}
194+
return inputAudio(Files.readAllBytes(audioPath), format);
195+
}
196+
136197
/**
137198
* Part-kind accessor.
138199
* @return the discriminator selecting {@link #getText()} or {@link #getImageUrl()}
@@ -156,4 +217,20 @@ public Type getType() {
156217
public @Nullable String getImageUrl() {
157218
return imageUrl;
158219
}
220+
221+
/**
222+
* Base64 audio-data accessor (only set for {@link Type#INPUT_AUDIO}).
223+
* @return the base64-encoded audio bytes, or {@code null} for non-audio parts
224+
*/
225+
public @Nullable String getAudioData() {
226+
return audioData;
227+
}
228+
229+
/**
230+
* Audio container-format accessor (only set for {@link Type#INPUT_AUDIO}).
231+
* @return {@code "wav"} or {@code "mp3"}, or {@code null} for non-audio parts
232+
*/
233+
public @Nullable String getAudioFormat() {
234+
return audioFormat;
235+
}
159236
}
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
//
3+
// SPDX-License-Identifier: MIT
4+
5+
package net.ladenthin.llama;
6+
7+
import static org.junit.jupiter.api.Assertions.assertFalse;
8+
import static org.junit.jupiter.api.Assertions.assertTrue;
9+
10+
import java.io.File;
11+
import java.io.IOException;
12+
import java.nio.file.Paths;
13+
import java.util.Collections;
14+
import java.util.concurrent.TimeUnit;
15+
import net.ladenthin.llama.parameters.InferenceParameters;
16+
import net.ladenthin.llama.parameters.ModelParameters;
17+
import net.ladenthin.llama.value.ChatMessage;
18+
import net.ladenthin.llama.value.ContentPart;
19+
import org.junit.jupiter.api.AfterAll;
20+
import org.junit.jupiter.api.Assumptions;
21+
import org.junit.jupiter.api.BeforeAll;
22+
import org.junit.jupiter.api.DisplayName;
23+
import org.junit.jupiter.api.Test;
24+
import org.junit.jupiter.api.Timeout;
25+
26+
/**
27+
* Real-model coverage for <b>audio input</b> (llama.cpp discussion #13759). Loads an audio-capable
28+
* model (Ultravox / Qwen2.5-Omni) with its audio {@code --mmproj} and sends a multipart message
29+
* carrying a {@link ContentPart#audioFile(java.nio.file.Path)} clip, exercising:
30+
* <ul>
31+
* <li>{@link ModelParameters#setMmproj(String)} wiring an audio encoder;</li>
32+
* <li>{@code ParameterJsonSerializer.buildMessages} emitting the OAI {@code input_audio} part;</li>
33+
* <li>the upstream {@code oaicompat_chat_params_parse} routing {@code input_audio} through the
34+
* compiled-in {@code mtmd} audio pipeline.</li>
35+
* </ul>
36+
*
37+
* <p>Self-skips when any of the three system properties
38+
* ({@link TestConstants#PROP_AUDIO_MODEL_PATH} / {@link TestConstants#PROP_AUDIO_MMPROJ_PATH} /
39+
* {@link TestConstants#PROP_AUDIO_PATH}) is unset or its file is missing, so it runs only in CI or on a
40+
* dev machine where the (large) audio model and a clip have been staged.
41+
*/
42+
public class AudioInputIntegrationTest {
43+
44+
private static LlamaModel model;
45+
private static String audioPath;
46+
47+
@BeforeAll
48+
public static void setup() {
49+
String modelPath = System.getProperty(TestConstants.PROP_AUDIO_MODEL_PATH);
50+
String mmprojPath = System.getProperty(TestConstants.PROP_AUDIO_MMPROJ_PATH);
51+
audioPath = System.getProperty(TestConstants.PROP_AUDIO_PATH);
52+
53+
Assumptions.assumeTrue(
54+
modelPath != null && !modelPath.isEmpty(),
55+
"Audio model path not set (-D" + TestConstants.PROP_AUDIO_MODEL_PATH + "=...)");
56+
Assumptions.assumeTrue(
57+
mmprojPath != null && !mmprojPath.isEmpty(),
58+
"Audio mmproj path not set (-D" + TestConstants.PROP_AUDIO_MMPROJ_PATH + "=...)");
59+
Assumptions.assumeTrue(
60+
audioPath != null && !audioPath.isEmpty(),
61+
"Audio clip path not set (-D" + TestConstants.PROP_AUDIO_PATH + "=...)");
62+
Assumptions.assumeTrue(new File(modelPath).exists(), "Audio model file missing: " + modelPath);
63+
Assumptions.assumeTrue(new File(mmprojPath).exists(), "Audio mmproj file missing: " + mmprojPath);
64+
Assumptions.assumeTrue(new File(audioPath).exists(), "Audio clip missing: " + audioPath);
65+
66+
int gpuLayers = Integer.getInteger(TestConstants.PROP_TEST_NGL, TestConstants.DEFAULT_TEST_NGL);
67+
ModelParameters parameters = new ModelParameters()
68+
.setCtxSize(4096)
69+
.setModel(modelPath)
70+
.setMmproj(mmprojPath)
71+
.setGpuLayers(gpuLayers)
72+
.setFit(false);
73+
if (gpuLayers == 0) {
74+
parameters.setDevices("none").setMmprojOffload(false);
75+
}
76+
model = new LlamaModel(parameters);
77+
assertTrue(model.supportsAudio(), "loaded model + mmproj must advertise audio input");
78+
}
79+
80+
@AfterAll
81+
public static void tearDown() {
82+
if (model != null) {
83+
model.close();
84+
}
85+
}
86+
87+
@Test
88+
@DisplayName("an input_audio content part reaches the model and yields a non-empty reply")
89+
@Timeout(value = 240_000, unit = TimeUnit.MILLISECONDS)
90+
public void audioInputProducesNonEmptyReply() throws IOException {
91+
ChatMessage message = ChatMessage.userMultimodal(
92+
ContentPart.text("Transcribe the audio."), ContentPart.audioFile(Paths.get(audioPath)));
93+
94+
String reply = model.chatCompleteText(InferenceParameters.empty()
95+
.withMessages(Collections.singletonList(message))
96+
.withNPredict(64));
97+
98+
assertFalse(reply.trim().isEmpty(), "reply must be non-empty for an audio prompt; got: \"" + reply + "\"");
99+
}
100+
}

src/test/java/net/ladenthin/llama/TestConstants.java

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,21 @@ public class TestConstants {
7171
* resource so the test needs no network access for the visual prompt.
7272
*/
7373
public static final String DEFAULT_VISION_IMAGE_PATH = "src/test/resources/images/test-image.jpg";
74+
75+
/**
76+
* System property holding a path to an audio-input model GGUF (e.g. Ultravox / Qwen2.5-Omni).
77+
* Consumed by {@code AudioInputIntegrationTest} (llama.cpp discussion #13759). The test self-skips
78+
* when this, the mmproj, or the audio clip is unset/missing.
79+
*/
80+
public static final String PROP_AUDIO_MODEL_PATH = LlamaSystemProperties.PREFIX + ".audio.model";
81+
82+
/** System property holding a path to the matching audio mmproj (encoder) GGUF. */
83+
public static final String PROP_AUDIO_MMPROJ_PATH = LlamaSystemProperties.PREFIX + ".audio.mmproj";
84+
85+
/**
86+
* System property holding a path to a {@code .wav} or {@code .mp3} clip used as the audio prompt in
87+
* {@code AudioInputIntegrationTest}. The matching extension drives format detection in
88+
* {@code ContentPart.audioFile(Path)}.
89+
*/
90+
public static final String PROP_AUDIO_PATH = LlamaSystemProperties.PREFIX + ".audio.input";
7491
}

src/test/java/net/ladenthin/llama/parameters/ChatRequestTest.java

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,19 @@ void buildMessagesJsonPreservesMultimodalParts() {
210210
assertThat(json, containsString("data:image/png;base64,AAAA"));
211211
}
212212

213+
@Test
214+
void buildMessagesJsonEmitsInputAudioParts() {
215+
ChatRequest req = ChatRequest.empty()
216+
.appendMessage(ChatMessage.userMultimodal(
217+
ContentPart.text("transcribe"), ContentPart.inputAudio(new byte[] {1, 2, 3}, "wav")));
218+
String json = req.buildMessagesJson();
219+
assertThat(json, containsString("\"type\":\"input_audio\""));
220+
assertThat(json, containsString("\"format\":\"wav\""));
221+
assertThat(
222+
json,
223+
containsString("\"data\":\"" + java.util.Base64.getEncoder().encodeToString(new byte[] {1, 2, 3})));
224+
}
225+
213226
@Test
214227
void buildToolsJsonEmptyWhenNoTools() {
215228
assertThat(ChatRequest.empty().buildToolsJson().isPresent(), is(false));

0 commit comments

Comments
 (0)