You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(tts): document TextToSpeech API, test properties, and CI models
- README: Features bullet + "Text-to-Speech" usage section (TextToSpeech, the
two-model OuteTTS + WavTokenizer pipeline, WAV output, known number-drop
limitation, compatible GGUF links); two new rows in the System Properties
Reference (tts.ttc.model / tts.vocoder.model).
- CLAUDE.md: TextToSpeech in the Java-layer architecture list; jllama.cpp
method/line count refreshed (30 native methods incl. 3 TTS, ~1,516 lines);
TtsIntegrationTest property table + run example; test_tts_dsp.cpp added to the
C++ test-file table and the drifted counts reconciled to the actual 457
(test_server 188->189, test_jni_helpers 41->47, +5 TTS DSP).
Javadoc release gate verified (BUILD SUCCESS) with the new public TextToSpeech.
|`net.ladenthin.llama.vision.mmproj`|`MultimodalIntegrationTest`| matching mmproj for the vision model, e.g. `mmproj-SmolVLM-500M-Instruct-Q8_0.gguf`|
587
587
|`net.ladenthin.llama.vision.image`|`MultimodalIntegrationTest`| committed default `src/test/resources/images/test-image.jpg`; override to any png/jpeg/webp/gif on disk |
588
+
|`net.ladenthin.llama.tts.ttc.model`|`TtsIntegrationTest`| OuteTTS text-to-codes model, e.g. `OuteTTS-0.2-500M-Q4_K_M.gguf`|
589
+
|`net.ladenthin.llama.tts.vocoder.model`|`TtsIntegrationTest`| matching codes-to-speech vocoder, e.g. `WavTokenizer-Large-75-F16.gguf`|
588
590
589
591
Run those tests by setting the property:
590
592
```bash
@@ -596,6 +598,9 @@ mvn test -Dtest=MultimodalIntegrationTest \
596
598
# The vision.image property defaults to src/test/resources/images/test-image.jpg
597
599
# (a CC-BY-4.0 / MIT-granted photo of flowers and bees by the project author);
598
600
# override only if you want to test a different image.
-`LlamaModel` — Main API class (AutoCloseable). Wraps native context for inference, embeddings, re-ranking, and tokenization.
738
+
-`TextToSpeech` — Separate AutoCloseable native type for speech synthesis over the two-model OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech vocoder) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. Native engine in `tts_engine.{h,cpp}`, output DSP in `tts_dsp.hpp`.
733
739
-`ModelParameters` / `InferenceParameters` — Builder-pattern parameter classes that serialize to JSON (extend `JsonParameters`) for passing to native code.
734
740
-`LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
735
741
-`LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
@@ -741,7 +747,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
741
747
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java``requires`). See README "OpenAI-compatible HTTP server".
Copy file name to clipboardExpand all lines: README.md
+30-1Lines changed: 30 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -97,6 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
97
97
- Text completion (blocking and streaming) with full control over sampling parameters.
98
98
- OpenAI-compatible **chat completion** with automatic chat-template application, including streaming and tool/function calling support via the upstream server.
99
99
-**Embeddings** and **reranking** for retrieval pipelines.
100
+
-**Text-to-speech** (`TextToSpeech`) over the two-model OuteTTS + WavTokenizer pipeline, returning WAV audio.
100
101
-**Infilling** (fill-in-the-middle) for code models.
101
102
-**Tokenize / detokenize** and **JSON-schema → grammar** conversion.
102
103
-**Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
@@ -278,8 +279,10 @@ Every `net.ladenthin.llama.*` system property recognised by the library, deep-sc
278
279
|`net.ladenthin.llama.vision.model`| unset (test self-skips) | test |`MultimodalIntegrationTest` (upstream kherud/java-llama.cpp#103 / #34) | Path to a vision-capable model GGUF. Any vision-capable GGUF works; CI default is `SmolVLM-500M-Instruct-Q8_0.gguf`. |
279
280
|`net.ladenthin.llama.vision.mmproj`| unset (test self-skips) | test |`MultimodalIntegrationTest`| Matching mmproj GGUF for the vision model. |
280
281
|`net.ladenthin.llama.vision.image`|`src/test/resources/images/test-image.jpg` (a CC-BY-4.0 / MIT-granted photo committed to the repo) | test |`MultimodalIntegrationTest`| Visual prompt image. Any png/jpeg/webp/gif works; the extension drives MIME detection. |
282
+
|`net.ladenthin.llama.tts.ttc.model`| unset (test self-skips) | test |`TtsIntegrationTest`| Path to the OuteTTS text-to-codes GGUF. CI default is `OuteTTS-0.2-500M-Q4_K_M.gguf`. |
283
+
|`net.ladenthin.llama.tts.vocoder.model`| unset (test self-skips) | test |`TtsIntegrationTest`| Path to the matching codes-to-speech vocoder GGUF. CI default is `WavTokenizer-Large-75-F16.gguf`. |
281
284
282
-
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring.
285
+
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring.`TtsIntegrationTest` likewise self-skips unless both `tts.ttc.model` and `tts.vocoder.model` point at existing files.
283
286
284
287
## Documentation
285
288
@@ -461,6 +464,32 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
461
464
}
462
465
```
463
466
467
+
### Text-to-Speech
468
+
469
+
`TextToSpeech` synthesizes audio from text over llama.cpp's OuteTTS pipeline. It is a separate
470
+
`AutoCloseable` native type (not a `LlamaModel`) because TTS is a **two-model** pipeline: a
471
+
text-to-codes model (OuteTTS) and a codes-to-speech vocoder (WavTokenizer). `synthesize(String)`
472
+
returns a 24 kHz mono 16-bit WAV byte stream.
473
+
474
+
```java
475
+
try (TextToSpeech tts =newTextToSpeech(
476
+
"models/OuteTTS-0.2-500M-Q4_K_M.gguf",
477
+
"models/WavTokenizer-Large-75-F16.gguf")) {
478
+
byte[] wav = tts.synthesize("Hello from llama dot c p p.");
479
+
Files.write(Paths.get("out.wav"), wav);
480
+
}
481
+
```
482
+
483
+
Add `(ttcPath, vocoderPath, gpuLayers, threads)` to offload to the GPU, or
484
+
`synthesize(text, maxCodeTokens, topK, seed)` for explicit sampling. As with `LlamaModel`, native
485
+
memory is not GC-managed — use try-with-resources or call `close()`. **Known limitation:** numeric
486
+
digits in the input are dropped (number-to-words romanization is not yet ported), so spell numbers
487
+
out for now; synthesis uses the built-in default speaker profile.
0 commit comments