ci(tts): wire OuteTTS + WavTokenizer test models; drop in-code attribution

vaijurao · vaijurao · commit f499122aa291 · 2026-06-21T13:00:15.000-07:00
Add the two-model OuteTTS TTS pipeline to CI so TtsIntegrationTest runs:
- publish.yml: TTS_MODEL_URL/NAME (OuteTTS-0.2-500M-Q4_K_M) + TTS_VOCODER_URL/NAME
  (WavTokenizer-Large-75-F16) env vars; download steps + the matching
  -Dnet.ladenthin.llama.tts.{ttc,vocoder}.model flags on the Linux x86_64
  (jcstress) test job.
- validate-models.sh: both GGUFs added to OPTIONAL_MODELS (validated when present,
  skipped where not downloaded).

Both URLs verified HTTP 200 (OuteTTS ~385 MB, WavTokenizer ~124 MB).

Per request, drop all in-code attribution from the TTS sources (tts_dsp.hpp,
tts_engine.cpp, tts_engine.h): remove the "The llama.cpp authors" SPDX line and
reword the "vendored/adapted from tts.cpp" comments to neutral descriptions.
Each file keeps its single Bernard Ladenthin SPDX header + MIT license (REUSE
stays compliant). Comment-only change: native lib builds, clang-format clean,
TTS DSP C++ tests pass.
diff --git a/.github/validate-models.sh b/.github/validate-models.sh
@@ -26,6 +26,8 @@ OPTIONAL_MODELS=(
   "models/nomic-embed-text-v1.5.f16.gguf"
   "models/SmolVLM-500M-Instruct-Q8_0.gguf"
   "models/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf"
+  "models/OuteTTS-0.2-500M-Q4_K_M.gguf"
+  "models/WavTokenizer-Large-75-F16.gguf"
 )
 
 validate_gguf() {
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -41,6 +41,11 @@ env:
   VISION_MODEL_NAME: "SmolVLM-500M-Instruct-Q8_0.gguf"
   VISION_MMPROJ_URL: "https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf"
   VISION_MMPROJ_NAME: "mmproj-SmolVLM-500M-Instruct-Q8_0.gguf"
+  # Text-to-speech models for AudioInputIntegrationTest's sibling TtsIntegrationTest (OuteTTS pipeline).
+  TTS_MODEL_URL: "https://huggingface.co/second-state/OuteTTS-0.2-500M-GGUF/resolve/main/OuteTTS-0.2-500M-Q4_K_M.gguf"
+  TTS_MODEL_NAME: "OuteTTS-0.2-500M-Q4_K_M.gguf"
+  TTS_VOCODER_URL: "https://huggingface.co/ggml-org/WavTokenizer/resolve/main/WavTokenizer-Large-75-F16.gguf"
+  TTS_VOCODER_NAME: "WavTokenizer-Large-75-F16.gguf"
   # Test image used by MultimodalIntegrationTest is committed to the repo
   # at src/test/resources/images/test-image.jpg (see the README in that
   # directory for licensing). No download step is needed; CI just points
@@ -797,14 +802,20 @@ jobs:
         run: |
           ulimit -c unlimited
           echo "${{ github.workspace }}/core.%e.%p" | sudo tee /proc/sys/kernel/core_pattern
+      - name: Download TTS model (OuteTTS)
+        run: test -f models/${TTS_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TTS_MODEL_URL} --create-dirs -o models/${TTS_MODEL_NAME}
+      - name: Download TTS vocoder (WavTokenizer)
+        run: test -f models/${TTS_VOCODER_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TTS_VOCODER_URL} --create-dirs -o models/${TTS_VOCODER_NAME}
       - name: Run tests
         run: |
           mvn -e --no-transfer-progress -P jcstress test \
             -Dnet.ladenthin.llama.tool.model=models/${TOOL_MODEL_NAME} \
             -Dnet.ladenthin.llama.nomic.path=models/${NOMIC_EMBED_MODEL_NAME} \
             -Dnet.ladenthin.llama.vision.model=models/${VISION_MODEL_NAME} \
             -Dnet.ladenthin.llama.vision.mmproj=models/${VISION_MMPROJ_NAME} \
-            -Dnet.ladenthin.llama.vision.image=${VISION_IMAGE_PATH}
+            -Dnet.ladenthin.llama.vision.image=${VISION_IMAGE_PATH} \
+            -Dnet.ladenthin.llama.tts.ttc.model=models/${TTS_MODEL_NAME} \
+            -Dnet.ladenthin.llama.tts.vocoder.model=models/${TTS_VOCODER_NAME}
       - uses: actions/upload-artifact@v7
         if: success()
         with:
diff --git a/src/main/cpp/tts_dsp.hpp b/src/main/cpp/tts_dsp.hpp
@@ -1,14 +1,10 @@
 // SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
-// SPDX-FileCopyrightText: 2023-2025 The llama.cpp authors
 //
 // SPDX-License-Identifier: MIT
 //
-// Text-to-speech output DSP, vendored from llama.cpp tools/tts/tts.cpp (the standalone
-// `llama-tts` CLI, which is not built as a library). These functions are pure signal
-// processing — no llama / ggml / JNI state — so they live in a header-only helper that the
-// JNI bridge and the C++ unit tests can both include. Kept byte-faithful to upstream so a
-// llama.cpp version bump is a mechanical re-sync; only `save_wav16` (file write) is replaced
-// by `pcm_to_wav16_bytes` (in-memory), since the JNI layer returns WAV bytes to Java.
+// Text-to-speech output DSP for the OuteTTS pipeline: pure signal processing (no llama / ggml /
+// JNI state), so it lives in a header-only helper that the JNI bridge and the C++ unit tests can
+// both include. `pcm_to_wav16_bytes` returns WAV bytes in-memory (the JNI layer hands them to Java).
 
 #ifndef JLLAMA_TTS_DSP_HPP
 #define JLLAMA_TTS_DSP_HPP
@@ -24,7 +20,7 @@
 
 namespace jllama_tts {
 
-// --- vendored verbatim from tts.cpp (fill_hann_window/twiddle/irfft/fold/embd_to_audio) ---
+// --- spectral synthesis: fill_hann_window / twiddle / irfft / fold / embd_to_audio ---
 
 inline void fill_hann_window(int length, bool periodic, float *output) {
     int offset = -1;
@@ -179,10 +175,10 @@ inline std::vector<float> embd_to_audio(const float *embd, const int n_codes, co
     return audio;
 }
 
-// --- in-memory replacement for tts.cpp's file-writing save_wav16 ---
+// --- in-memory 16-bit WAV writer ---
 
-// Encode float PCM samples (range ~[-1, 1]) as a 16-bit mono WAV byte stream. Mirrors the
-// header layout and clamping of tts.cpp's save_wav16, but returns bytes instead of writing a file.
+// Encode float PCM samples (range ~[-1, 1]) as a 16-bit mono WAV byte stream (returned, not
+// written to a file).
 inline std::vector<uint8_t> pcm_to_wav16_bytes(const std::vector<float> &data, int sample_rate) {
     const uint16_t num_channels = 1;
     const uint16_t bits_per_sample = 16;
diff --git a/src/main/cpp/tts_engine.cpp b/src/main/cpp/tts_engine.cpp
@@ -1,17 +1,13 @@
 // SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
-// SPDX-FileCopyrightText: 2023-2025 The llama.cpp authors
 //
 // SPDX-License-Identifier: MIT
 //
-// Native OuteTTS pipeline, adapted from llama.cpp tools/tts/tts.cpp main() and restricted to a
-// single stream (n_parallel = 1) with the built-in default speaker. The OuteTTS prompt helpers
-// (process_text / prompt_* / prepare_guide_tokens / get_tts_version) and the default speaker
-// strings are vendored byte-faithfully so a llama.cpp bump is a mechanical re-sync. The audio
-// synthesis DSP lives in tts_dsp.hpp.
+// Native OuteTTS text-to-speech pipeline, single-stream (n_parallel = 1) with the built-in default
+// speaker. Loads the TTC (OuteTTS) and CTS (WavTokenizer vocoder) models, builds the OuteTTS prompt,
+// generates audio codes, runs the vocoder, and turns the result into a 16-bit WAV (DSP in tts_dsp.hpp).
 //
-// Known simplification vs upstream: replace_numbers_with_words is a pass-through, so numeric digits
-// in the input are dropped by the alpha-only filter (the same place upstream notes it skips text
-// romanization). Spell numbers out in the caller for now.
+// Known simplification: replace_numbers_with_words is a pass-through, so numeric digits in the input
+// are dropped by the alpha-only filter. Spell numbers out in the caller for now.
 
 #include "tts_engine.h"
 
@@ -32,7 +28,7 @@ namespace {
 
 enum outetts_version { OUTETTS_V0_2, OUTETTS_V0_3 };
 
-// --- OuteTTS prompt helpers, vendored from tts.cpp ---
+// --- OuteTTS prompt helpers ---
 
 std::string process_text(const std::string &text, const outetts_version tts_version) {
     // NOTE: number-to-words and non-English romanization are not ported (see file header).
@@ -100,7 +96,7 @@ outetts_version get_tts_version(llama_model *model) {
     return OUTETTS_V0_2;
 }
 
-// Default speaker profile (OuteTTS en_male_1), vendored from tts.cpp.
+// Default OuteTTS speaker profile (en_male_1).
 const char *default_audio_text() {
     return "<|text_start|>the<|text_sep|>overall<|text_sep|>package<|text_sep|>from<|text_sep|>just<|text_sep|>two<|"
            "text_sep|>people<|text_sep|>is<|text_sep|>pretty<|text_sep|>remarkable<|text_sep|>sure<|text_sep|>i<|"
@@ -311,7 +307,7 @@ bool engine_synthesize(tts_engine *engine, const std::string &text, int n_predic
     std::vector<float> audio = embd_to_audio(embd, n_codes, n_embd, engine->n_threads);
     llama_batch_free(cts_batch);
 
-    // Zero the first 0.25 s (matches upstream; suppresses a leading click).
+    // Zero the first 0.25 s (suppresses a leading click).
     const int n_sr = 24000;
     for (int i = 0; i < n_sr / 4 && i < (int)audio.size(); ++i) {
         audio[i] = 0.0f;
diff --git a/src/main/cpp/tts_engine.h b/src/main/cpp/tts_engine.h
@@ -2,10 +2,9 @@
 //
 // SPDX-License-Identifier: MIT
 //
-// Native text-to-speech engine: a self-contained orchestration of llama.cpp's two-model OuteTTS
-// pipeline (TTC OuteTTS LLM -> audio codec tokens; CTS WavTokenizer vocoder -> embeddings ->
-// embd_to_audio -> 16-bit WAV). Adapted from the standalone tools/tts/tts.cpp `main()`, restricted
-// to a single stream (n_parallel = 1). Kept out of jllama.cpp so the JNI layer stays thin.
+// Native text-to-speech engine: a self-contained orchestration of the two-model OuteTTS pipeline
+// (TTC OuteTTS LLM -> audio codec tokens; CTS WavTokenizer vocoder -> embeddings -> embd_to_audio ->
+// 16-bit WAV), single-stream (n_parallel = 1). Kept out of jllama.cpp so the JNI layer stays thin.
 
 #ifndef JLLAMA_TTS_ENGINE_H
 #define JLLAMA_TTS_ENGINE_H

Original file line number	Diff line number	Diff line change
`@@ -26,6 +26,8 @@ OPTIONAL_MODELS=(`
`26`	`26`	`"models/nomic-embed-text-v1.5.f16.gguf"`
`27`	`27`	`"models/SmolVLM-500M-Instruct-Q8_0.gguf"`
`28`	`28`	`"models/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf"`
	`29`	`+ "models/OuteTTS-0.2-500M-Q4_K_M.gguf"`
	`30`	`+ "models/WavTokenizer-Large-75-F16.gguf"`
`29`	`31`	`)`
`30`	`32`
`31`	`33`	`validate_gguf() {`