You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(tts): document TextToSpeech API, test properties, and CI models
- README: Features bullet + "Text-to-Speech" usage section (TextToSpeech, the
two-model OuteTTS + WavTokenizer pipeline, WAV output, known number-drop
limitation, compatible GGUF links); two new rows in the System Properties
Reference (tts.ttc.model / tts.vocoder.model).
- CLAUDE.md: TextToSpeech in the Java-layer architecture list; jllama.cpp
method/line count refreshed (30 native methods incl. 3 TTS, ~1,516 lines);
TtsIntegrationTest property table + run example; test_tts_dsp.cpp added to the
C++ test-file table and the drifted counts reconciled to the actual 457
(test_server 188->189, test_jni_helpers 41->47, +5 TTS DSP).
Javadoc release gate verified (BUILD SUCCESS) with the new public TextToSpeech.
Copy file name to clipboardExpand all lines: CLAUDE.md
+11-4Lines changed: 11 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -588,6 +588,8 @@ the README. The summary below covers only the optional-model bindings:
588
588
|`net.ladenthin.llama.audio.model`|`AudioInputIntegrationTest` (llama.cpp discussion #13759) | audio-input model GGUF, e.g. `ultravox-v0_5-llama-3_2-1b.gguf`|
589
589
|`net.ladenthin.llama.audio.mmproj`|`AudioInputIntegrationTest`| matching audio mmproj/encoder, e.g. `mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf`|
590
590
|`net.ladenthin.llama.audio.input`|`AudioInputIntegrationTest`| a `.wav`/`.mp3` clip on disk (no committed default — audio is not committed) |
591
+
|`net.ladenthin.llama.tts.ttc.model`|`TtsIntegrationTest`| OuteTTS text-to-codes model, e.g. `OuteTTS-0.2-500M-Q4_K_M.gguf`|
592
+
|`net.ladenthin.llama.tts.vocoder.model`|`TtsIntegrationTest`| matching codes-to-speech vocoder, e.g. `WavTokenizer-Large-75-F16.gguf`|
591
593
592
594
Run those tests by setting the property:
593
595
```bash
@@ -605,6 +607,9 @@ mvn test -Dtest=AudioInputIntegrationTest \
-`LlamaModel` — Main API class (AutoCloseable). Wraps native context for inference, embeddings, re-ranking, and tokenization.
747
+
-`TextToSpeech` — Separate AutoCloseable native type for speech synthesis over the two-model OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech vocoder) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. Native engine in `tts_engine.{h,cpp}`, output DSP in `tts_dsp.hpp`.
742
748
-`ModelParameters` / `InferenceParameters` — Builder-pattern parameter classes that serialize to JSON (extend `JsonParameters`) for passing to native code.
743
749
-`LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
744
750
-`LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
@@ -750,7 +756,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
750
756
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java``requires`). See README "OpenAI-compatible HTTP server".
Copy file name to clipboardExpand all lines: README.md
+30-1Lines changed: 30 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,6 +103,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
103
103
- Text completion (blocking and streaming) with full control over sampling parameters.
104
104
- OpenAI-compatible **chat completion** with automatic chat-template application, including streaming and tool/function calling support via the upstream server.
105
105
-**Embeddings** and **reranking** for retrieval pipelines.
106
+
-**Text-to-speech** (`TextToSpeech`) over the two-model OuteTTS + WavTokenizer pipeline, returning WAV audio.
106
107
-**Infilling** (fill-in-the-middle) for code models.
107
108
-**Tokenize / detokenize** and **JSON-schema → grammar** conversion.
108
109
-**Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
@@ -287,8 +288,10 @@ Every `net.ladenthin.llama.*` system property recognised by the library, deep-sc
287
288
|`net.ladenthin.llama.audio.model`| unset (test self-skips) | test |`AudioInputIntegrationTest` (llama.cpp discussion #13759) | Path to an audio-input model GGUF (e.g. Ultravox, Qwen2.5-Omni). |
|`net.ladenthin.llama.audio.input`| unset (test self-skips) | test |`AudioInputIntegrationTest`|`.wav`/`.mp3` audio prompt clip; the extension drives format detection. |
291
+
|`net.ladenthin.llama.tts.ttc.model`| unset (test self-skips) | test |`TtsIntegrationTest`| Path to the OuteTTS text-to-codes GGUF. CI default is `OuteTTS-0.2-500M-Q4_K_M.gguf`. |
292
+
|`net.ladenthin.llama.tts.vocoder.model`| unset (test self-skips) | test |`TtsIntegrationTest`| Path to the matching codes-to-speech vocoder GGUF. CI default is `WavTokenizer-Large-75-F16.gguf`. |
290
293
291
-
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring. `AudioInputIntegrationTest` self-skips the same way over the three `audio.*` properties.
294
+
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring. `AudioInputIntegrationTest` self-skips the same way over the three `audio.*` properties.`TtsIntegrationTest` likewise self-skips unless both `tts.ttc.model` and `tts.vocoder.model` point at existing files.
292
295
293
296
## Documentation
294
297
@@ -494,6 +497,32 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
494
497
}
495
498
```
496
499
500
+
### Text-to-Speech
501
+
502
+
`TextToSpeech` synthesizes audio from text over llama.cpp's OuteTTS pipeline. It is a separate
503
+
`AutoCloseable` native type (not a `LlamaModel`) because TTS is a **two-model** pipeline: a
504
+
text-to-codes model (OuteTTS) and a codes-to-speech vocoder (WavTokenizer). `synthesize(String)`
505
+
returns a 24 kHz mono 16-bit WAV byte stream.
506
+
507
+
```java
508
+
try (TextToSpeech tts =newTextToSpeech(
509
+
"models/OuteTTS-0.2-500M-Q4_K_M.gguf",
510
+
"models/WavTokenizer-Large-75-F16.gguf")) {
511
+
byte[] wav = tts.synthesize("Hello from llama dot c p p.");
512
+
Files.write(Paths.get("out.wav"), wav);
513
+
}
514
+
```
515
+
516
+
Add `(ttcPath, vocoderPath, gpuLayers, threads)` to offload to the GPU, or
517
+
`synthesize(text, maxCodeTokens, topK, seed)` for explicit sampling. As with `LlamaModel`, native
518
+
memory is not GC-managed — use try-with-resources or call `close()`. **Known limitation:** numeric
519
+
digits in the input are dropped (number-to-words romanization is not yet ported), so spell numbers
520
+
out for now; synthesis uses the built-in default speaker profile.
0 commit comments