Skip to content

Commit f009837

Browse files
committed
docs(tts): document TextToSpeech API, test properties, and CI models
- README: Features bullet + "Text-to-Speech" usage section (TextToSpeech, the two-model OuteTTS + WavTokenizer pipeline, WAV output, known number-drop limitation, compatible GGUF links); two new rows in the System Properties Reference (tts.ttc.model / tts.vocoder.model). - CLAUDE.md: TextToSpeech in the Java-layer architecture list; jllama.cpp method/line count refreshed (30 native methods incl. 3 TTS, ~1,516 lines); TtsIntegrationTest property table + run example; test_tts_dsp.cpp added to the C++ test-file table and the drifted counts reconciled to the actual 457 (test_server 188->189, test_jni_helpers 41->47, +5 TTS DSP). Javadoc release gate verified (BUILD SUCCESS) with the new public TextToSpeech.
1 parent f499122 commit f009837

2 files changed

Lines changed: 41 additions & 5 deletions

File tree

CLAUDE.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -588,6 +588,8 @@ the README. The summary below covers only the optional-model bindings:
588588
| `net.ladenthin.llama.audio.model` | `AudioInputIntegrationTest` (llama.cpp discussion #13759) | audio-input model GGUF, e.g. `ultravox-v0_5-llama-3_2-1b.gguf` |
589589
| `net.ladenthin.llama.audio.mmproj` | `AudioInputIntegrationTest` | matching audio mmproj/encoder, e.g. `mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf` |
590590
| `net.ladenthin.llama.audio.input` | `AudioInputIntegrationTest` | a `.wav`/`.mp3` clip on disk (no committed default — audio is not committed) |
591+
| `net.ladenthin.llama.tts.ttc.model` | `TtsIntegrationTest` | OuteTTS text-to-codes model, e.g. `OuteTTS-0.2-500M-Q4_K_M.gguf` |
592+
| `net.ladenthin.llama.tts.vocoder.model` | `TtsIntegrationTest` | matching codes-to-speech vocoder, e.g. `WavTokenizer-Large-75-F16.gguf` |
591593

592594
Run those tests by setting the property:
593595
```bash
@@ -605,6 +607,9 @@ mvn test -Dtest=AudioInputIntegrationTest \
605607
-Dnet.ladenthin.llama.audio.model=models/ultravox-v0_5-llama-3_2-1b.gguf \
606608
-Dnet.ladenthin.llama.audio.mmproj=models/mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf \
607609
-Dnet.ladenthin.llama.audio.input=/path/to/speech.wav
610+
mvn test -Dtest=TtsIntegrationTest \
611+
-Dnet.ladenthin.llama.tts.ttc.model=models/OuteTTS-0.2-500M-Q4_K_M.gguf \
612+
-Dnet.ladenthin.llama.tts.vocoder.model=models/WavTokenizer-Large-75-F16.gguf
608613
```
609614

610615
`MultimodalIntegrationTest` self-skips when any of the three vision properties
@@ -739,6 +744,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
739744

740745
**Java layer** (`src/main/java/net/ladenthin/llama/`):
741746
- `LlamaModel` — Main API class (AutoCloseable). Wraps native context for inference, embeddings, re-ranking, and tokenization.
747+
- `TextToSpeech` — Separate AutoCloseable native type for speech synthesis over the two-model OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech vocoder) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. Native engine in `tts_engine.{h,cpp}`, output DSP in `tts_dsp.hpp`.
742748
- `ModelParameters` / `InferenceParameters` — Builder-pattern parameter classes that serialize to JSON (extend `JsonParameters`) for passing to native code.
743749
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
744750
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
@@ -750,7 +756,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
750756
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server".
751757

752758
**Native layer** (`src/main/cpp/`):
753-
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.
759+
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,516 lines; 30 native methods (27 `LlamaModel` + 3 `TextToSpeech`).
754760
- `utils.hpp` — Helper utilities (format helpers, argv stripping, token-piece serialisation).
755761
- `json_helpers.hpp` — Pure JSON transformation helpers (no JNI, no llama state). Independently unit-testable.
756762
- `jni_helpers.hpp` — JNI bridge helpers (handle management + server orchestration). Includes `json_helpers.hpp`.
@@ -905,12 +911,13 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
905911
| File | Tests | Scope |
906912
|------|-------|-------|
907913
| `src/test/cpp/test_utils.cpp` | 156 | Upstream helpers: `server_tokens`, `server_grammar_trigger`, `gen_tool_call_id`, `json_value`, `json_get_nested_values`, UTF-8 helpers, `format_response_rerank`, `format_embeddings_response_oaicompat`, `oaicompat_completion_params_parse`, `oaicompat_chat_params_parse`, `are_lora_equal`, `strip_flag_from_argv`, `token_piece_value`, `json_is_array_and_contains_numbers`, `format_oai_sse`, `format_oai_resp_sse`, `format_anthropic_sse` |
908-
| `src/test/cpp/test_server.cpp` | 188 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths), `response_fields` projection |
914+
| `src/test/cpp/test_server.cpp` | 189 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths), `response_fields` projection |
909915
| `src/test/cpp/test_json_helpers.cpp` | 47 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk` |
910916
| `src/test/cpp/test_log_helpers.cpp` | 13 | All functions in `log_helpers.hpp`: `log_level_name`, `format_log_as_json` |
911-
| `src/test/cpp/test_jni_helpers.cpp` | 41 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
917+
| `src/test/cpp/test_jni_helpers.cpp` | 47 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
918+
| `src/test/cpp/test_tts_dsp.cpp` | 5 | All functions in `tts_dsp.hpp` (OuteTTS output DSP): `pcm_to_wav16_bytes` (WAV header/payload + little-endian clamping), `fill_hann_window`, `fold`, `embd_to_audio` |
912919

913-
**Current total: 445 tests (all passing).**
920+
**Current total: 457 tests (all passing).**
914921

915922
#### Upstream source location (in CMake build tree)
916923

README.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
103103
- Text completion (blocking and streaming) with full control over sampling parameters.
104104
- OpenAI-compatible **chat completion** with automatic chat-template application, including streaming and tool/function calling support via the upstream server.
105105
- **Embeddings** and **reranking** for retrieval pipelines.
106+
- **Text-to-speech** (`TextToSpeech`) over the two-model OuteTTS + WavTokenizer pipeline, returning WAV audio.
106107
- **Infilling** (fill-in-the-middle) for code models.
107108
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
108109
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
@@ -287,8 +288,10 @@ Every `net.ladenthin.llama.*` system property recognised by the library, deep-sc
287288
| `net.ladenthin.llama.audio.model` | unset (test self-skips) | test | `AudioInputIntegrationTest` (llama.cpp discussion #13759) | Path to an audio-input model GGUF (e.g. Ultravox, Qwen2.5-Omni). |
288289
| `net.ladenthin.llama.audio.mmproj` | unset (test self-skips) | test | `AudioInputIntegrationTest` | Matching audio mmproj (encoder) GGUF. |
289290
| `net.ladenthin.llama.audio.input` | unset (test self-skips) | test | `AudioInputIntegrationTest` | `.wav`/`.mp3` audio prompt clip; the extension drives format detection. |
291+
| `net.ladenthin.llama.tts.ttc.model` | unset (test self-skips) | test | `TtsIntegrationTest` | Path to the OuteTTS text-to-codes GGUF. CI default is `OuteTTS-0.2-500M-Q4_K_M.gguf`. |
292+
| `net.ladenthin.llama.tts.vocoder.model` | unset (test self-skips) | test | `TtsIntegrationTest` | Path to the matching codes-to-speech vocoder GGUF. CI default is `WavTokenizer-Large-75-F16.gguf`. |
290293

291-
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring. `AudioInputIntegrationTest` self-skips the same way over the three `audio.*` properties.
294+
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring. `AudioInputIntegrationTest` self-skips the same way over the three `audio.*` properties. `TtsIntegrationTest` likewise self-skips unless both `tts.ttc.model` and `tts.vocoder.model` point at existing files.
292295

293296
## Documentation
294297

@@ -494,6 +497,32 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
494497
}
495498
```
496499

500+
### Text-to-Speech
501+
502+
`TextToSpeech` synthesizes audio from text over llama.cpp's OuteTTS pipeline. It is a separate
503+
`AutoCloseable` native type (not a `LlamaModel`) because TTS is a **two-model** pipeline: a
504+
text-to-codes model (OuteTTS) and a codes-to-speech vocoder (WavTokenizer). `synthesize(String)`
505+
returns a 24 kHz mono 16-bit WAV byte stream.
506+
507+
```java
508+
try (TextToSpeech tts = new TextToSpeech(
509+
"models/OuteTTS-0.2-500M-Q4_K_M.gguf",
510+
"models/WavTokenizer-Large-75-F16.gguf")) {
511+
byte[] wav = tts.synthesize("Hello from llama dot c p p.");
512+
Files.write(Paths.get("out.wav"), wav);
513+
}
514+
```
515+
516+
Add `(ttcPath, vocoderPath, gpuLayers, threads)` to offload to the GPU, or
517+
`synthesize(text, maxCodeTokens, topK, seed)` for explicit sampling. As with `LlamaModel`, native
518+
memory is not GC-managed — use try-with-resources or call `close()`. **Known limitation:** numeric
519+
digits in the input are dropped (number-to-words romanization is not yet ported), so spell numbers
520+
out for now; synthesis uses the built-in default speaker profile.
521+
522+
Compatible GGUFs (the CI test defaults): OuteTTS
523+
[`OuteTTS-0.2-500M-GGUF`](https://huggingface.co/second-state/OuteTTS-0.2-500M-GGUF) +
524+
[`WavTokenizer`](https://huggingface.co/ggml-org/WavTokenizer).
525+
497526
### Raw JSON Endpoints
498527

499528
For direct access to the upstream llama.cpp server API, the following methods take a JSON request and return

0 commit comments

Comments
 (0)