You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(multimodal): audio input via OpenAI input_audio content parts
Extends the typed multimodal API from vision to audio (llama.cpp discussion #13759). No native/JNI
change is needed: upstream b9739 already decodes the OpenAI `input_audio` content part
(server-common.cpp) into the same media buffer vision uses, which the JNI bridge already threads
through to mtmd's audio pipeline; mtmd supports audio (mtmd_support_audio / mtmd_bitmap_init_from_audio).
- ContentPart: new INPUT_AUDIO kind + factories ContentPart.inputAudio(byte[], "wav"|"mp3") and
audioFile(Path) (extension -> format), with base64 data + format accessors.
- ParameterJsonSerializer.buildMessages emits {"type":"input_audio","input_audio":{"data","format"}};
ChatMessage.concatText already skips non-text parts, so getContent() is unaffected.
- LlamaModel.supportsAudio() (parallel to supportsVision(); ModelMeta.supportsAudio already existed,
fed by the native meta's modalities.audio).
- Tests: ContentPart audio factories + format validation, a ChatRequest serializer test asserting the
input_audio JSON shape, and a gated AudioInputIntegrationTest (Ultravox / Qwen2.5-Omni) that
self-skips without the audio model / mmproj / clip (3 new audio.* system properties).
- Docs: README "Vision / Multimodal Chat" audio example + system-property rows; CLAUDE.md property
table + run command.
The OpenAI server needs no change — audio content parts already round-trip verbatim through
/v1/chat/completions. The audio model download is intentionally NOT added to CI (Ultravox is large and
the test self-skips); it's documented as locally/CI-runnable.
Verified: 50 affected unit tests + audio serializer test green, integration test self-skips,
Spotless + Javadoc clean.
|`net.ladenthin.llama.vision.mmproj`|`MultimodalIntegrationTest`| matching mmproj for the vision model, e.g. `mmproj-SmolVLM-500M-Instruct-Q8_0.gguf`|
587
587
|`net.ladenthin.llama.vision.image`|`MultimodalIntegrationTest`| committed default `src/test/resources/images/test-image.jpg`; override to any png/jpeg/webp/gif on disk |
588
+
|`net.ladenthin.llama.audio.model`|`AudioInputIntegrationTest` (llama.cpp discussion #13759) | audio-input model GGUF, e.g. `ultravox-v0_5-llama-3_2-1b.gguf`|
589
+
|`net.ladenthin.llama.audio.mmproj`|`AudioInputIntegrationTest`| matching audio mmproj/encoder, e.g. `mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf`|
590
+
|`net.ladenthin.llama.audio.input`|`AudioInputIntegrationTest`| a `.wav`/`.mp3` clip on disk (no committed default — audio is not committed) |
588
591
589
592
Run those tests by setting the property:
590
593
```bash
@@ -596,6 +599,12 @@ mvn test -Dtest=MultimodalIntegrationTest \
596
599
# The vision.image property defaults to src/test/resources/images/test-image.jpg
597
600
# (a CC-BY-4.0 / MIT-granted photo of flowers and bees by the project author);
598
601
# override only if you want to test a different image.
602
+
603
+
# Audio input (Ultravox / Qwen2.5-Omni; the audio clip has no committed default):
Copy file name to clipboardExpand all lines: README.md
+28-1Lines changed: 28 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -278,8 +278,11 @@ Every `net.ladenthin.llama.*` system property recognised by the library, deep-sc
278
278
|`net.ladenthin.llama.vision.model`| unset (test self-skips) | test |`MultimodalIntegrationTest` (upstream kherud/java-llama.cpp#103 / #34) | Path to a vision-capable model GGUF. Any vision-capable GGUF works; CI default is `SmolVLM-500M-Instruct-Q8_0.gguf`. |
279
279
|`net.ladenthin.llama.vision.mmproj`| unset (test self-skips) | test |`MultimodalIntegrationTest`| Matching mmproj GGUF for the vision model. |
280
280
|`net.ladenthin.llama.vision.image`|`src/test/resources/images/test-image.jpg` (a CC-BY-4.0 / MIT-granted photo committed to the repo) | test |`MultimodalIntegrationTest`| Visual prompt image. Any png/jpeg/webp/gif works; the extension drives MIME detection. |
281
+
|`net.ladenthin.llama.audio.model`| unset (test self-skips) | test |`AudioInputIntegrationTest` (llama.cpp discussion #13759) | Path to an audio-input model GGUF (e.g. Ultravox, Qwen2.5-Omni). |
|`net.ladenthin.llama.audio.input`| unset (test self-skips) | test |`AudioInputIntegrationTest`|`.wav`/`.mp3` audio prompt clip; the extension drives format detection. |
281
284
282
-
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring.
285
+
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring.`AudioInputIntegrationTest` self-skips the same way over the three `audio.*` properties.
283
286
284
287
## Documentation
285
288
@@ -409,6 +412,30 @@ OpenAI-compatible `/v1/chat/completions` server. For a strictly CPU-only run, us
409
412
`setDevices("none").setMmprojOffload(false)` in addition to `setGpuLayers(0)`; projector offload
410
413
has its own upstream default.
411
414
415
+
**Audio input** works identically — load an audio-capable model (Ultravox, Qwen2.5-Omni, …) with its
416
+
audio `--mmproj` and add a `ContentPart.audioFile(...)` (or `inputAudio(bytes, "wav"|"mp3")`) part. It
417
+
serializes to the OpenAI `input_audio` content part and routes through the same `mtmd` pipeline:
0 commit comments