Skip to content

Commit 7e04296

Browse files
committed
refactor(tts): derive OuteTTS helpers from upstream at build time (no hand-copy)
Addresses review feedback on PR #268: the TTS native pipeline reused llama.cpp's tools/tts/tts.cpp by hand-copying its DSP/prompt/text helpers and default-speaker strings into tts_dsp.hpp + tts_engine.cpp — a DRY/maintenance hazard that would silently diverge on every llama.cpp upgrade, and a missing-attribution concern. tts.cpp cannot simply be added to target_sources: it defines its own main() (link clash, same reason server.cpp is excluded) and every helper is `static` (internal linkage — unreachable from another TU). So instead of copying, the helpers are now DERIVED MECHANICALLY from the pinned upstream source at configure time: - cmake/generate-tts-upstream.cmake reads the pinned tools/tts/tts.cpp, keeps the pre-main() span, strips `static` from the helpers the engine calls (external linkage), and extracts the two default-speaker literals out of main() into `extern const` strings. Emits build/tts_generated/tts_upstream_gen.cpp (never committed; regenerated from whatever tts.cpp the GIT_TAG resolves to, so a version bump is picked up automatically). - CMakeLists runs it after FetchContent_MakeAvailable(llama.cpp) and compiles the generated TU into jllama. - tts_upstream.h: committed, hand-written declarations of the extracted symbols (interface only). tts_engine.cpp keeps only our orchestration + the in-memory WAV writer (tts_wav.hpp, ours). tts_dsp.hpp and all copied helpers are removed. Fail-loud on drift (same contract as patches/): the generator asserts the `int main(` anchor, every de-static signature, and both speaker literals; a rename aborts the configure, a type change fails the link. Silent divergence is impossible. Bonus: using upstream's real process_text (which calls replace_numbers_with_words) fixes the previous digit-drop limitation — English numbers are now spoken. Verified: jllama builds + links, 454 C++ tests pass, and TtsIntegrationTest synthesizes a valid 24 kHz WAV end-to-end against the real OuteTTS + WavTokenizer models. test_tts_dsp.cpp -> test_tts_wav.cpp (now covers only our WAV writer; the DSP is upstream's, covered end-to-end by TtsIntegrationTest).
1 parent f009837 commit 7e04296

10 files changed

Lines changed: 315 additions & 388 deletions

File tree

CLAUDE.md

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,38 @@ Current patches:
384384
|-------|-------|
385385
| `0001-win32-arg-parse-embed-guard.patch` | Windows JNI regression from llama.cpp **#24779** (b9739): `common_params_parse` unconditionally replaced the caller's argv with the process command line (`GetCommandLineW`), so an embedded/JNI caller (`java.exe`) lost its `--model …` args → "Failed to parse model parameters". The patch **drops the override for our build** (keeps the `make_utf8_argv()` call referenced so there's no `-Wunused-function`, but never adopts its result), so the caller's already-UTF-8 argv is always used. This is **deterministic** — an earlier count-guard variant (only override when the re-derived arg count equals `argc`) collided on the server-integration tests whose argv length happened to equal `java.exe`'s and kept them failing. The upstream PR can instead expose an opt-out / `common_params_parse_argv` that preserves the standalone tools' UTF-8 fix. |
386386

387+
## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`)
388+
389+
The `TextToSpeech` native pipeline reuses llama.cpp's OuteTTS helpers (`tools/tts/tts.cpp`)
390+
**without hand-copying them**. A verbatim copy would be a DRY/maintenance hazard that silently
391+
diverges on every upgrade, and `tts.cpp` cannot simply be added to `target_sources` — it defines its
392+
own `main()`, which would clash at link time (the same reason `tools/server/server.cpp` is excluded
393+
while `server-*.cpp` are compiled in), and all its helpers are `static` (internal linkage), so they
394+
are unreachable from another TU even if it were linked.
395+
396+
Instead the helpers are **DERIVED mechanically at configure time** from the pinned upstream source:
397+
398+
- **`cmake/generate-tts-upstream.cmake`** — reads `${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp`, keeps
399+
the pre-`main()` span (the DSP `fill_hann_window`/`irfft`/`fold`/`embd_to_audio`, the prompt/text
400+
helpers incl. `process_text`'s number-to-words, the `outetts_version` enum), strips `static` from
401+
the handful the JNI engine calls (giving them external linkage), and extracts the two hard-coded
402+
default-speaker literals out of `main()` into `extern const` strings. Writes
403+
`build/tts_generated/tts_upstream_gen.cpp`.
404+
- **`CMakeLists.txt`** — runs the generator via `execute_process` right after
405+
`FetchContent_MakeAvailable(llama.cpp)`, then compiles the generated TU into `jllama`. The file is
406+
**never committed** (build artifact, like the native libs / WebUI assets); it is regenerated from
407+
whatever `tts.cpp` the pinned `GIT_TAG` resolves to, so a version bump is picked up automatically.
408+
- **`src/main/cpp/tts_upstream.h`** — committed, hand-written declarations of the extracted symbols
409+
(interface facts, not the implementation). `tts_engine.cpp` includes it and links against the
410+
generated definitions. The in-memory WAV writer (`tts_wav.hpp`) is ours, not extracted.
411+
412+
**Fail-loud on drift (same contract as `patches/`):** the generator asserts every anchor — the
413+
`int main(` split point, each `static <signature>` it de-statics, and both speaker literals. If an
414+
upgrade renames a helper or moves a literal, the **configure step aborts** with a pointer to the
415+
generator; if upstream changes a *type*, `tts_upstream.h` stops matching and the **link fails**.
416+
Either way a silent divergence is impossible. On a llama.cpp bump, re-verify the generator the same
417+
way you re-verify `patches/`.
418+
387419
## Upgrading/Downgrading llama.cpp Version
388420

389421
To change the llama.cpp version, update the following **three** files (and re-verify `patches/`):
@@ -744,7 +776,7 @@ If the local check passes (`BUILD SUCCESS`), the `mvn package` job in
744776

745777
**Java layer** (`src/main/java/net/ladenthin/llama/`):
746778
- `LlamaModel` — Main API class (AutoCloseable). Wraps native context for inference, embeddings, re-ranking, and tokenization.
747-
- `TextToSpeech` — Separate AutoCloseable native type for speech synthesis over the two-model OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech vocoder) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. Native engine in `tts_engine.{h,cpp}`, output DSP in `tts_dsp.hpp`.
779+
- `TextToSpeech` — Separate AutoCloseable native type for speech synthesis over the two-model OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech vocoder) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. Native orchestration in `tts_engine.{h,cpp}`; the OuteTTS DSP / prompt / text helpers + default speaker are **derived at build time from upstream `tts.cpp`** (see "OuteTTS build-time extraction" below), not hand-copied; the in-memory WAV writer is `tts_wav.hpp`.
748780
- `ModelParameters` / `InferenceParameters` — Builder-pattern parameter classes that serialize to JSON (extend `JsonParameters`) for passing to native code.
749781
- `LlamaIterator` / `LlamaIterable` — Streaming generation via Java `Iterator`/`Iterable`.
750782
- `LlamaLoader` — Extracts the platform-specific native library from the JAR to a temp directory, or finds it on `java.library.path`.
@@ -915,9 +947,9 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
915947
| `src/test/cpp/test_json_helpers.cpp` | 47 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk` |
916948
| `src/test/cpp/test_log_helpers.cpp` | 13 | All functions in `log_helpers.hpp`: `log_level_name`, `format_log_as_json` |
917949
| `src/test/cpp/test_jni_helpers.cpp` | 47 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
918-
| `src/test/cpp/test_tts_dsp.cpp` | 5 | All functions in `tts_dsp.hpp` (OuteTTS output DSP): `pcm_to_wav16_bytes` (WAV header/payload + little-endian clamping), `fill_hann_window`, `fold`, `embd_to_audio` |
950+
| `src/test/cpp/test_tts_wav.cpp` | 2 | The in-memory WAV writer `pcm_to_wav16_bytes` in `tts_wav.hpp` (WAV header/payload + little-endian clamping). The OuteTTS DSP it pairs with is derived from upstream `tts.cpp` and covered end-to-end by the Java `TtsIntegrationTest`, not unit-tested here. |
919951

920-
**Current total: 457 tests (all passing).**
952+
**Current total: 454 tests (all passing).**
921953

922954
#### Upstream source location (in CMake build tree)
923955

CMakeLists.txt

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,29 @@ FetchContent_Declare(
151151
)
152152
FetchContent_MakeAvailable(llama.cpp)
153153

154+
# OuteTTS native pipeline: DERIVE the upstream tts.cpp helpers (DSP + prompt + text + the default
155+
# speaker profile) into a compilable translation unit at configure time, rather than hand-copying
156+
# them — a hand copy is a DRY/maintenance hazard that silently diverges on every llama.cpp upgrade.
157+
# tts.cpp cannot simply be added to target_sources because it defines its own main(); the generator
158+
# drops main() and gives the helpers external linkage. See cmake/generate-tts-upstream.cmake. The
159+
# generated file is never committed; it is regenerated from whatever tts.cpp the pinned GIT_TAG
160+
# resolves to, so a version bump is picked up automatically. The tag below is cosmetic provenance in
161+
# the generated banner — keep it in sync with the llama.cpp GIT_TAG above.
162+
set(JLLAMA_TTS_GEN_DIR ${CMAKE_BINARY_DIR}/tts_generated)
163+
set(JLLAMA_TTS_GEN_CPP ${JLLAMA_TTS_GEN_DIR}/tts_upstream_gen.cpp)
164+
file(MAKE_DIRECTORY ${JLLAMA_TTS_GEN_DIR})
165+
execute_process(
166+
COMMAND ${CMAKE_COMMAND}
167+
-DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
168+
-DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
169+
-DLLAMA_TAG=b9739
170+
-P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
171+
RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
172+
)
173+
if(NOT JLLAMA_TTS_GEN_RESULT EQUAL 0)
174+
message(FATAL_ERROR "OuteTTS extraction failed; see cmake/generate-tts-upstream.cmake")
175+
endif()
176+
154177
# b8831 added ggml_graph_next_uid() which calls _InterlockedIncrement64 via
155178
# <intrin.h> on x86. The intrinsic only exists on x64; provide the
156179
# implementation in a compat TU so the linker resolves __InterlockedIncrement64.
@@ -264,10 +287,18 @@ endif()
264287
add_library(jllama SHARED
265288
src/main/cpp/jllama.cpp
266289
src/main/cpp/tts_engine.cpp
290+
${JLLAMA_TTS_GEN_CPP}
267291
src/main/cpp/utils.hpp
268292
${llama.cpp_SOURCE_DIR}/tools/server/server-common.cpp
269293
${llama.cpp_SOURCE_DIR}/tools/server/server-chat.cpp)
270294

295+
# The generated TU keeps the whole pre-main() span of tts.cpp, so a few upstream CLI-only
296+
# helpers (print_usage, save_wav16, xterm colour) come along unused. Silence the resulting
297+
# unused-function warning on that one file (non-MSVC; MSVC's C4505 is off by default).
298+
if(NOT MSVC)
299+
set_source_files_properties(${JLLAMA_TTS_GEN_CPP} PROPERTIES COMPILE_FLAGS "-Wno-unused-function")
300+
endif()
301+
271302
# Phase 1 refactoring: compile upstream server library units directly into jllama
272303
# server.hpp has been replaced by direct upstream includes in jllama.cpp.
273304
# server-context.cpp, server-queue.cpp, server-task.cpp compile on all platforms
@@ -412,7 +443,7 @@ if(BUILD_TESTING)
412443
src/test/cpp/test_jni_helpers.cpp
413444
src/test/cpp/test_json_helpers.cpp
414445
src/test/cpp/test_log_helpers.cpp
415-
src/test/cpp/test_tts_dsp.cpp
446+
src/test/cpp/test_tts_wav.cpp
416447
${llama.cpp_SOURCE_DIR}/tools/server/server-common.cpp
417448
${llama.cpp_SOURCE_DIR}/tools/server/server-chat.cpp
418449
${llama.cpp_SOURCE_DIR}/tools/server/server-context.cpp

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -515,9 +515,9 @@ try (TextToSpeech tts = new TextToSpeech(
515515

516516
Add `(ttcPath, vocoderPath, gpuLayers, threads)` to offload to the GPU, or
517517
`synthesize(text, maxCodeTokens, topK, seed)` for explicit sampling. As with `LlamaModel`, native
518-
memory is not GC-managed — use try-with-resources or call `close()`. **Known limitation:** numeric
519-
digits in the input are dropped (number-to-words romanization is not yet ported), so spell numbers
520-
out for now; synthesis uses the built-in default speaker profile.
518+
memory is not GC-managed — use try-with-resources or call `close()`. Synthesis uses the built-in
519+
default speaker profile; English number words are expanded for speech (`3` → "three"), and
520+
non-English text is not romanized.
521521

522522
Compatible GGUFs (the CI test defaults): OuteTTS
523523
[`OuteTTS-0.2-500M-GGUF`](https://huggingface.co/second-state/OuteTTS-0.2-500M-GGUF) +

cmake/generate-tts-upstream.cmake

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2+
#
3+
# SPDX-License-Identifier: MIT
4+
#
5+
# Build-time extractor for the OuteTTS native pipeline.
6+
#
7+
# Rather than hand-copying functions out of llama.cpp's tools/tts/tts.cpp (a maintenance
8+
# burden that silently diverges on every upgrade), this script DERIVES a compilable
9+
# translation unit MECHANICALLY from the pinned upstream source at configure time. The
10+
# generated file is never committed (it lives in the build tree) and is regenerated from
11+
# whatever tts.cpp the pinned GIT_TAG resolves to, so divergence is impossible and an
12+
# upstream bump is picked up automatically.
13+
#
14+
# What it does:
15+
# 1. Keeps everything in tts.cpp BEFORE `int main(` (includes, the outetts_version enum,
16+
# the number-to-words tables, and the DSP + prompt + text helpers). main() itself —
17+
# the standalone CLI entry point — is dropped (that is why tts.cpp cannot simply be
18+
# added to target_sources: its main() would clash at link time).
19+
# 2. Strips `static` from exactly the helpers the JNI engine calls, giving them external
20+
# linkage so tts_engine.cpp can link against them (they are `static`/internal upstream).
21+
# 3. Extracts the two hard-coded default-speaker literals (audio_text / audio_data), which
22+
# upstream embeds as locals inside main(), into two external constants.
23+
#
24+
# Every anchor is asserted: if upstream renames a function or moves the literals, the
25+
# configure step FAILS LOUDLY with a pointer here, the same fail-on-drift contract as
26+
# patches/. Inputs (via -D): TTS_SRC, OUT_CPP, LLAMA_TAG.
27+
28+
if(NOT EXISTS "${TTS_SRC}")
29+
message(FATAL_ERROR "generate-tts-upstream: upstream tts.cpp not found at '${TTS_SRC}'")
30+
endif()
31+
32+
file(READ "${TTS_SRC}" SRC)
33+
34+
# --- 1. keep the pre-main() portion (main() is the unbuildable-as-library CLI entry point) ---
35+
string(FIND "${SRC}" "\nint main(" MAIN_POS)
36+
if(MAIN_POS EQUAL -1)
37+
message(FATAL_ERROR "generate-tts-upstream: 'int main(' anchor not found in tts.cpp — upstream layout changed; update cmake/generate-tts-upstream.cmake")
38+
endif()
39+
string(SUBSTRING "${SRC}" 0 ${MAIN_POS} PREMAIN)
40+
41+
# --- 2. give external linkage to the helpers the JNI engine calls ---
42+
# Each entry is asserted present as `static <sig>` before stripping, so an upstream rename
43+
# fails the configure instead of silently dropping the symbol (caught later only at link).
44+
set(JLLAMA_TTS_DESTATIC
45+
"std::vector<float> embd_to_audio("
46+
"std::string process_text("
47+
"void prompt_add("
48+
"void prompt_init("
49+
"std::vector<llama_token> prepare_guide_tokens("
50+
"outetts_version get_tts_version(")
51+
foreach(sig IN LISTS JLLAMA_TTS_DESTATIC)
52+
string(FIND "${PREMAIN}" "static ${sig}" _pos)
53+
if(_pos EQUAL -1)
54+
message(FATAL_ERROR "generate-tts-upstream: expected 'static ${sig}' in upstream tts.cpp but it is absent — upstream changed; update the de-static list in cmake/generate-tts-upstream.cmake")
55+
endif()
56+
string(REPLACE "static ${sig}" "${sig}" PREMAIN "${PREMAIN}")
57+
endforeach()
58+
59+
# --- 3. extract the two default-speaker literals from inside main() ---
60+
# audio_text: a single-line std::string audio_text = "<|text_start|>the<|text_sep|>...";
61+
# The leading "<|text_start|>the<|text_sep|>" disambiguates it from the empty-seed literal
62+
# in audio_text_from_speaker(). Content runs to the next double-quote (it embeds none).
63+
set(_AT_DECL "std::string audio_text = \"")
64+
string(FIND "${SRC}" "${_AT_DECL}<|text_start|>the<|text_sep|>" _at_at)
65+
if(_at_at EQUAL -1)
66+
message(FATAL_ERROR "generate-tts-upstream: default audio_text literal not found in tts.cpp main() — upstream changed; update cmake/generate-tts-upstream.cmake")
67+
endif()
68+
string(LENGTH "${_AT_DECL}" _at_decl_len)
69+
math(EXPR _at_content "${_at_at} + ${_at_decl_len}")
70+
string(SUBSTRING "${SRC}" ${_at_content} -1 _at_rest)
71+
string(FIND "${_at_rest}" "\"" _at_len)
72+
string(SUBSTRING "${_at_rest}" 0 ${_at_len} AUDIO_TEXT)
73+
74+
# audio_data: a multi-line raw string std::string audio_data = R"(...)";
75+
# The R"( form disambiguates it from the empty-seed "..." literal in audio_data_from_speaker().
76+
# Content runs to the first )" (the body embeds none — only <|...|> tokens).
77+
set(_AD_DECL "std::string audio_data = R\"(")
78+
string(FIND "${SRC}" "${_AD_DECL}" _ad_at)
79+
if(_ad_at EQUAL -1)
80+
message(FATAL_ERROR "generate-tts-upstream: default audio_data raw-string literal not found in tts.cpp main() — upstream changed; update cmake/generate-tts-upstream.cmake")
81+
endif()
82+
string(LENGTH "${_AD_DECL}" _ad_decl_len)
83+
math(EXPR _ad_content "${_ad_at} + ${_ad_decl_len}")
84+
string(SUBSTRING "${SRC}" ${_ad_content} -1 _ad_rest)
85+
string(FIND "${_ad_rest}" ")\"" _ad_len)
86+
string(SUBSTRING "${_ad_rest}" 0 ${_ad_len} AUDIO_DATA)
87+
88+
# --- 4. emit the derived translation unit ---
89+
set(BANNER
90+
"// AUTO-GENERATED — DO NOT EDIT, DO NOT COMMIT.
91+
// Derived mechanically at build time by cmake/generate-tts-upstream.cmake from
92+
// llama.cpp tools/tts/tts.cpp @ ${LLAMA_TAG} (MIT-licensed, the llama.cpp authors).
93+
// Regenerated from the pinned upstream source on every configure; see CLAUDE.md.
94+
95+
")
96+
set(SPEAKER
97+
"
98+
// --- default OuteTTS speaker profile (en_male_1), extracted from upstream main() ---
99+
// `extern const` forces external linkage (a namespace-scope `const` is internal by default),
100+
// so tts_engine.cpp links against these via the `extern` declarations in tts_upstream.h.
101+
extern const std::string jllama_tts_default_audio_text = \"${AUDIO_TEXT}\";
102+
extern const std::string jllama_tts_default_audio_data = R\"(${AUDIO_DATA})\";
103+
")
104+
file(WRITE "${OUT_CPP}" "${BANNER}${PREMAIN}${SPEAKER}")
105+
message(STATUS "generate-tts-upstream: wrote ${OUT_CPP} (from tts.cpp @ ${LLAMA_TAG})")

0 commit comments

Comments
 (0)