Skip to content

Commit 2c0321b

Browse files
committed
docs(tts): document TextToSpeech API, test properties, and CI models
- README: Features bullet + "Text-to-Speech" usage section (TextToSpeech, the two-model OuteTTS + WavTokenizer pipeline, WAV output, known number-drop limitation, compatible GGUF links); two new rows in the System Properties Reference (tts.ttc.model / tts.vocoder.model). - CLAUDE.md: TextToSpeech in the Java-layer architecture list; jllama.cpp method/line count refreshed (30 native methods incl. 3 TTS, ~1,516 lines); TtsIntegrationTest property table + run example; test_tts_dsp.cpp added to the C++ test-file table and the drifted counts reconciled to the actual 457 (test_server 188->189, test_jni_helpers 41->47, +5 TTS DSP). Javadoc release gate verified (BUILD SUCCESS) with the new public TextToSpeech.
1 parent 92b83ed commit 2c0321b

2 files changed

Lines changed: 41 additions & 5 deletions

File tree

CLAUDE.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -585,6 +585,8 @@ the README. The summary below covers only the optional-model bindings:
585585
| `net.ladenthin.llama.vision.model` | `MultimodalIntegrationTest` (upstream kherud/java-llama.cpp#103 / #34) | `SmolVLM-500M-Instruct-Q8_0.gguf` (any vision-capable GGUF works) |
586586
| `net.ladenthin.llama.vision.mmproj` | `MultimodalIntegrationTest` | matching mmproj for the vision model, e.g. `mmproj-SmolVLM-500M-Instruct-Q8_0.gguf` |
587587
| `net.ladenthin.llama.vision.image` | `MultimodalIntegrationTest` | committed default `src/test/resources/images/test-image.jpg`; override to any png/jpeg/webp/gif on disk |
588+
| `net.ladenthin.llama.tts.ttc.model` | `TtsIntegrationTest` | OuteTTS text-to-codes model, e.g. `OuteTTS-0.2-500M-Q4_K_M.gguf` |
589+
| `net.ladenthin.llama.tts.vocoder.model` | `TtsIntegrationTest` | matching codes-to-speech vocoder, e.g. `WavTokenizer-Large-75-F16.gguf` |
588590

589591
Run those tests by setting the property:
590592
```bash
@@ -596,6 +598,9 @@ mvn test -Dtest=MultimodalIntegrationTest \
596598
# The vision.image property defaults to src/test/resources/images/test-image.jpg
597599
# (a CC-BY-4.0 / MIT-granted photo of flowers and bees by the project author);
598600
# override only if you want to test a different image.
601+
mvn test -Dtest=TtsIntegrationTest \
602+
-Dnet.ladenthin.llama.tts.ttc.model=models/OuteTTS-0.2-500M-Q4_K_M.gguf \
603+
-Dnet.ladenthin.llama.tts.vocoder.model=models/WavTokenizer-Large-75-F16.gguf
599604
```
600605

601606
`MultimodalIntegrationTest` self-skips when any of the three vision properties
@@ -730,6 +735,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
730735

731736
**Java layer** (`src/main/java/net/ladenthin/llama/`):
732737
- `LlamaModel` — Main API class (AutoCloseable). Wraps native context for inference, embeddings, re-ranking, and tokenization.
738+
- `TextToSpeech` — Separate AutoCloseable native type for speech synthesis over the two-model OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech vocoder) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. Native engine in `tts_engine.{h,cpp}`, output DSP in `tts_dsp.hpp`.
733739
- `ModelParameters` / `InferenceParameters` — Builder-pattern parameter classes that serialize to JSON (extend `JsonParameters`) for passing to native code.
734740
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
735741
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
@@ -741,7 +747,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
741747
- The `server` package is a dedicated top layer in the ArchUnit `layeredArchitecture` rule (the only layer allowed to access the root `Api`); `noInternalJdkImports` carries an explicit exception for the supported `com.sun.net.httpserver` (the exported `jdk.httpserver` module, which `module-info.java` `requires`). See README "OpenAI-compatible HTTP server".
742748

743749
**Native layer** (`src/main/cpp/`):
744-
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,215 lines; 17 native methods.
750+
- `jllama.cpp` — JNI implementation bridging Java calls to llama.cpp. ~1,516 lines; 30 native methods (27 `LlamaModel` + 3 `TextToSpeech`).
745751
- `utils.hpp` — Helper utilities (format helpers, argv stripping, token-piece serialisation).
746752
- `json_helpers.hpp` — Pure JSON transformation helpers (no JNI, no llama state). Independently unit-testable.
747753
- `jni_helpers.hpp` — JNI bridge helpers (handle management + server orchestration). Includes `json_helpers.hpp`.
@@ -896,12 +902,13 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
896902
| File | Tests | Scope |
897903
|------|-------|-------|
898904
| `src/test/cpp/test_utils.cpp` | 156 | Upstream helpers: `server_tokens`, `server_grammar_trigger`, `gen_tool_call_id`, `json_value`, `json_get_nested_values`, UTF-8 helpers, `format_response_rerank`, `format_embeddings_response_oaicompat`, `oaicompat_completion_params_parse`, `oaicompat_chat_params_parse`, `are_lora_equal`, `strip_flag_from_argv`, `token_piece_value`, `json_is_array_and_contains_numbers`, `format_oai_sse`, `format_oai_resp_sse`, `format_anthropic_sse` |
899-
| `src/test/cpp/test_server.cpp` | 188 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths), `response_fields` projection |
905+
| `src/test/cpp/test_server.cpp` | 189 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths), `response_fields` projection |
900906
| `src/test/cpp/test_json_helpers.cpp` | 47 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk` |
901907
| `src/test/cpp/test_log_helpers.cpp` | 13 | All functions in `log_helpers.hpp`: `log_level_name`, `format_log_as_json` |
902-
| `src/test/cpp/test_jni_helpers.cpp` | 41 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
908+
| `src/test/cpp/test_jni_helpers.cpp` | 47 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
909+
| `src/test/cpp/test_tts_dsp.cpp` | 5 | All functions in `tts_dsp.hpp` (OuteTTS output DSP): `pcm_to_wav16_bytes` (WAV header/payload + little-endian clamping), `fill_hann_window`, `fold`, `embd_to_audio` |
903910

904-
**Current total: 445 tests (all passing).**
911+
**Current total: 457 tests (all passing).**
905912

906913
#### Upstream source location (in CMake build tree)
907914

README.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ Inference of Meta's LLaMA model (and others) in pure C/C++.
9797
- Text completion (blocking and streaming) with full control over sampling parameters.
9898
- OpenAI-compatible **chat completion** with automatic chat-template application, including streaming and tool/function calling support via the upstream server.
9999
- **Embeddings** and **reranking** for retrieval pipelines.
100+
- **Text-to-speech** (`TextToSpeech`) over the two-model OuteTTS + WavTokenizer pipeline, returning WAV audio.
100101
- **Infilling** (fill-in-the-middle) for code models.
101102
- **Tokenize / detokenize** and **JSON-schema → grammar** conversion.
102103
- **Raw JSON endpoint handlers** mirroring the upstream llama.cpp HTTP server (`/completions`, `/v1/completions`, `/embeddings`, `/infill`, `/tokenize`, `/detokenize`).
@@ -278,8 +279,10 @@ Every `net.ladenthin.llama.*` system property recognised by the library, deep-sc
278279
| `net.ladenthin.llama.vision.model` | unset (test self-skips) | test | `MultimodalIntegrationTest` (upstream kherud/java-llama.cpp#103 / #34) | Path to a vision-capable model GGUF. Any vision-capable GGUF works; CI default is `SmolVLM-500M-Instruct-Q8_0.gguf`. |
279280
| `net.ladenthin.llama.vision.mmproj` | unset (test self-skips) | test | `MultimodalIntegrationTest` | Matching mmproj GGUF for the vision model. |
280281
| `net.ladenthin.llama.vision.image` | `src/test/resources/images/test-image.jpg` (a CC-BY-4.0 / MIT-granted photo committed to the repo) | test | `MultimodalIntegrationTest` | Visual prompt image. Any png/jpeg/webp/gif works; the extension drives MIME detection. |
282+
| `net.ladenthin.llama.tts.ttc.model` | unset (test self-skips) | test | `TtsIntegrationTest` | Path to the OuteTTS text-to-codes GGUF. CI default is `OuteTTS-0.2-500M-Q4_K_M.gguf`. |
283+
| `net.ladenthin.llama.tts.vocoder.model` | unset (test self-skips) | test | `TtsIntegrationTest` | Path to the matching codes-to-speech vocoder GGUF. CI default is `WavTokenizer-Large-75-F16.gguf`. |
281284

282-
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring.
285+
`MultimodalIntegrationTest` self-skips when any of the three `vision.*` properties points at a missing path, so a partial setup (just the vision model + the committed image, no mmproj) lets the test class load without erroring. `TtsIntegrationTest` likewise self-skips unless both `tts.ttc.model` and `tts.vocoder.model` point at existing files.
283286

284287
## Documentation
285288

@@ -461,6 +464,32 @@ try (LlamaModel model = new LlamaModel(modelParams)) {
461464
}
462465
```
463466

467+
### Text-to-Speech
468+
469+
`TextToSpeech` synthesizes audio from text over llama.cpp's OuteTTS pipeline. It is a separate
470+
`AutoCloseable` native type (not a `LlamaModel`) because TTS is a **two-model** pipeline: a
471+
text-to-codes model (OuteTTS) and a codes-to-speech vocoder (WavTokenizer). `synthesize(String)`
472+
returns a 24 kHz mono 16-bit WAV byte stream.
473+
474+
```java
475+
try (TextToSpeech tts = new TextToSpeech(
476+
"models/OuteTTS-0.2-500M-Q4_K_M.gguf",
477+
"models/WavTokenizer-Large-75-F16.gguf")) {
478+
byte[] wav = tts.synthesize("Hello from llama dot c p p.");
479+
Files.write(Paths.get("out.wav"), wav);
480+
}
481+
```
482+
483+
Add `(ttcPath, vocoderPath, gpuLayers, threads)` to offload to the GPU, or
484+
`synthesize(text, maxCodeTokens, topK, seed)` for explicit sampling. As with `LlamaModel`, native
485+
memory is not GC-managed — use try-with-resources or call `close()`. **Known limitation:** numeric
486+
digits in the input are dropped (number-to-words romanization is not yet ported), so spell numbers
487+
out for now; synthesis uses the built-in default speaker profile.
488+
489+
Compatible GGUFs (the CI test defaults): OuteTTS
490+
[`OuteTTS-0.2-500M-GGUF`](https://huggingface.co/second-state/OuteTTS-0.2-500M-GGUF) +
491+
[`WavTokenizer`](https://huggingface.co/ggml-org/WavTokenizer).
492+
464493
### Raw JSON Endpoints
465494

466495
For direct access to the upstream llama.cpp server API, the following methods take a JSON request and return

0 commit comments

Comments
 (0)