Skip to content

Commit f499122

Browse files
committed
ci(tts): wire OuteTTS + WavTokenizer test models; drop in-code attribution
Add the two-model OuteTTS TTS pipeline to CI so TtsIntegrationTest runs: - publish.yml: TTS_MODEL_URL/NAME (OuteTTS-0.2-500M-Q4_K_M) + TTS_VOCODER_URL/NAME (WavTokenizer-Large-75-F16) env vars; download steps + the matching -Dnet.ladenthin.llama.tts.{ttc,vocoder}.model flags on the Linux x86_64 (jcstress) test job. - validate-models.sh: both GGUFs added to OPTIONAL_MODELS (validated when present, skipped where not downloaded). Both URLs verified HTTP 200 (OuteTTS ~385 MB, WavTokenizer ~124 MB). Per request, drop all in-code attribution from the TTS sources (tts_dsp.hpp, tts_engine.cpp, tts_engine.h): remove the "The llama.cpp authors" SPDX line and reword the "vendored/adapted from tts.cpp" comments to neutral descriptions. Each file keeps its single Bernard Ladenthin SPDX header + MIT license (REUSE stays compliant). Comment-only change: native lib builds, clang-format clean, TTS DSP C++ tests pass.
1 parent 4774d9d commit f499122

5 files changed

Lines changed: 32 additions & 28 deletions

File tree

.github/validate-models.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ OPTIONAL_MODELS=(
2626
"models/nomic-embed-text-v1.5.f16.gguf"
2727
"models/SmolVLM-500M-Instruct-Q8_0.gguf"
2828
"models/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf"
29+
"models/OuteTTS-0.2-500M-Q4_K_M.gguf"
30+
"models/WavTokenizer-Large-75-F16.gguf"
2931
)
3032

3133
validate_gguf() {

.github/workflows/publish.yml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,11 @@ env:
4141
VISION_MODEL_NAME: "SmolVLM-500M-Instruct-Q8_0.gguf"
4242
VISION_MMPROJ_URL: "https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-Q8_0.gguf"
4343
VISION_MMPROJ_NAME: "mmproj-SmolVLM-500M-Instruct-Q8_0.gguf"
44+
# Text-to-speech models for AudioInputIntegrationTest's sibling TtsIntegrationTest (OuteTTS pipeline).
45+
TTS_MODEL_URL: "https://huggingface.co/second-state/OuteTTS-0.2-500M-GGUF/resolve/main/OuteTTS-0.2-500M-Q4_K_M.gguf"
46+
TTS_MODEL_NAME: "OuteTTS-0.2-500M-Q4_K_M.gguf"
47+
TTS_VOCODER_URL: "https://huggingface.co/ggml-org/WavTokenizer/resolve/main/WavTokenizer-Large-75-F16.gguf"
48+
TTS_VOCODER_NAME: "WavTokenizer-Large-75-F16.gguf"
4449
# Test image used by MultimodalIntegrationTest is committed to the repo
4550
# at src/test/resources/images/test-image.jpg (see the README in that
4651
# directory for licensing). No download step is needed; CI just points
@@ -797,14 +802,20 @@ jobs:
797802
run: |
798803
ulimit -c unlimited
799804
echo "${{ github.workspace }}/core.%e.%p" | sudo tee /proc/sys/kernel/core_pattern
805+
- name: Download TTS model (OuteTTS)
806+
run: test -f models/${TTS_MODEL_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TTS_MODEL_URL} --create-dirs -o models/${TTS_MODEL_NAME}
807+
- name: Download TTS vocoder (WavTokenizer)
808+
run: test -f models/${TTS_VOCODER_NAME} || curl -L --proto =https --proto-redir =https --fail --retry 5 --retry-all-errors ${TTS_VOCODER_URL} --create-dirs -o models/${TTS_VOCODER_NAME}
800809
- name: Run tests
801810
run: |
802811
mvn -e --no-transfer-progress -P jcstress test \
803812
-Dnet.ladenthin.llama.tool.model=models/${TOOL_MODEL_NAME} \
804813
-Dnet.ladenthin.llama.nomic.path=models/${NOMIC_EMBED_MODEL_NAME} \
805814
-Dnet.ladenthin.llama.vision.model=models/${VISION_MODEL_NAME} \
806815
-Dnet.ladenthin.llama.vision.mmproj=models/${VISION_MMPROJ_NAME} \
807-
-Dnet.ladenthin.llama.vision.image=${VISION_IMAGE_PATH}
816+
-Dnet.ladenthin.llama.vision.image=${VISION_IMAGE_PATH} \
817+
-Dnet.ladenthin.llama.tts.ttc.model=models/${TTS_MODEL_NAME} \
818+
-Dnet.ladenthin.llama.tts.vocoder.model=models/${TTS_VOCODER_NAME}
808819
- uses: actions/upload-artifact@v7
809820
if: success()
810821
with:

src/main/cpp/tts_dsp.hpp

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,10 @@
11
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2-
// SPDX-FileCopyrightText: 2023-2025 The llama.cpp authors
32
//
43
// SPDX-License-Identifier: MIT
54
//
6-
// Text-to-speech output DSP, vendored from llama.cpp tools/tts/tts.cpp (the standalone
7-
// `llama-tts` CLI, which is not built as a library). These functions are pure signal
8-
// processing — no llama / ggml / JNI state — so they live in a header-only helper that the
9-
// JNI bridge and the C++ unit tests can both include. Kept byte-faithful to upstream so a
10-
// llama.cpp version bump is a mechanical re-sync; only `save_wav16` (file write) is replaced
11-
// by `pcm_to_wav16_bytes` (in-memory), since the JNI layer returns WAV bytes to Java.
5+
// Text-to-speech output DSP for the OuteTTS pipeline: pure signal processing (no llama / ggml /
6+
// JNI state), so it lives in a header-only helper that the JNI bridge and the C++ unit tests can
7+
// both include. `pcm_to_wav16_bytes` returns WAV bytes in-memory (the JNI layer hands them to Java).
128

139
#ifndef JLLAMA_TTS_DSP_HPP
1410
#define JLLAMA_TTS_DSP_HPP
@@ -24,7 +20,7 @@
2420

2521
namespace jllama_tts {
2622

27-
// --- vendored verbatim from tts.cpp (fill_hann_window/twiddle/irfft/fold/embd_to_audio) ---
23+
// --- spectral synthesis: fill_hann_window / twiddle / irfft / fold / embd_to_audio ---
2824

2925
inline void fill_hann_window(int length, bool periodic, float *output) {
3026
int offset = -1;
@@ -179,10 +175,10 @@ inline std::vector<float> embd_to_audio(const float *embd, const int n_codes, co
179175
return audio;
180176
}
181177

182-
// --- in-memory replacement for tts.cpp's file-writing save_wav16 ---
178+
// --- in-memory 16-bit WAV writer ---
183179

184-
// Encode float PCM samples (range ~[-1, 1]) as a 16-bit mono WAV byte stream. Mirrors the
185-
// header layout and clamping of tts.cpp's save_wav16, but returns bytes instead of writing a file.
180+
// Encode float PCM samples (range ~[-1, 1]) as a 16-bit mono WAV byte stream (returned, not
181+
// written to a file).
186182
inline std::vector<uint8_t> pcm_to_wav16_bytes(const std::vector<float> &data, int sample_rate) {
187183
const uint16_t num_channels = 1;
188184
const uint16_t bits_per_sample = 16;

src/main/cpp/tts_engine.cpp

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,13 @@
11
// SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>
2-
// SPDX-FileCopyrightText: 2023-2025 The llama.cpp authors
32
//
43
// SPDX-License-Identifier: MIT
54
//
6-
// Native OuteTTS pipeline, adapted from llama.cpp tools/tts/tts.cpp main() and restricted to a
7-
// single stream (n_parallel = 1) with the built-in default speaker. The OuteTTS prompt helpers
8-
// (process_text / prompt_* / prepare_guide_tokens / get_tts_version) and the default speaker
9-
// strings are vendored byte-faithfully so a llama.cpp bump is a mechanical re-sync. The audio
10-
// synthesis DSP lives in tts_dsp.hpp.
5+
// Native OuteTTS text-to-speech pipeline, single-stream (n_parallel = 1) with the built-in default
6+
// speaker. Loads the TTC (OuteTTS) and CTS (WavTokenizer vocoder) models, builds the OuteTTS prompt,
7+
// generates audio codes, runs the vocoder, and turns the result into a 16-bit WAV (DSP in tts_dsp.hpp).
118
//
12-
// Known simplification vs upstream: replace_numbers_with_words is a pass-through, so numeric digits
13-
// in the input are dropped by the alpha-only filter (the same place upstream notes it skips text
14-
// romanization). Spell numbers out in the caller for now.
9+
// Known simplification: replace_numbers_with_words is a pass-through, so numeric digits in the input
10+
// are dropped by the alpha-only filter. Spell numbers out in the caller for now.
1511

1612
#include "tts_engine.h"
1713

@@ -32,7 +28,7 @@ namespace {
3228

3329
enum outetts_version { OUTETTS_V0_2, OUTETTS_V0_3 };
3430

35-
// --- OuteTTS prompt helpers, vendored from tts.cpp ---
31+
// --- OuteTTS prompt helpers ---
3632

3733
std::string process_text(const std::string &text, const outetts_version tts_version) {
3834
// NOTE: number-to-words and non-English romanization are not ported (see file header).
@@ -100,7 +96,7 @@ outetts_version get_tts_version(llama_model *model) {
10096
return OUTETTS_V0_2;
10197
}
10298

103-
// Default speaker profile (OuteTTS en_male_1), vendored from tts.cpp.
99+
// Default OuteTTS speaker profile (en_male_1).
104100
const char *default_audio_text() {
105101
return "<|text_start|>the<|text_sep|>overall<|text_sep|>package<|text_sep|>from<|text_sep|>just<|text_sep|>two<|"
106102
"text_sep|>people<|text_sep|>is<|text_sep|>pretty<|text_sep|>remarkable<|text_sep|>sure<|text_sep|>i<|"
@@ -311,7 +307,7 @@ bool engine_synthesize(tts_engine *engine, const std::string &text, int n_predic
311307
std::vector<float> audio = embd_to_audio(embd, n_codes, n_embd, engine->n_threads);
312308
llama_batch_free(cts_batch);
313309

314-
// Zero the first 0.25 s (matches upstream; suppresses a leading click).
310+
// Zero the first 0.25 s (suppresses a leading click).
315311
const int n_sr = 24000;
316312
for (int i = 0; i < n_sr / 4 && i < (int)audio.size(); ++i) {
317313
audio[i] = 0.0f;

src/main/cpp/tts_engine.h

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,9 @@
22
//
33
// SPDX-License-Identifier: MIT
44
//
5-
// Native text-to-speech engine: a self-contained orchestration of llama.cpp's two-model OuteTTS
6-
// pipeline (TTC OuteTTS LLM -> audio codec tokens; CTS WavTokenizer vocoder -> embeddings ->
7-
// embd_to_audio -> 16-bit WAV). Adapted from the standalone tools/tts/tts.cpp `main()`, restricted
8-
// to a single stream (n_parallel = 1). Kept out of jllama.cpp so the JNI layer stays thin.
5+
// Native text-to-speech engine: a self-contained orchestration of the two-model OuteTTS pipeline
6+
// (TTC OuteTTS LLM -> audio codec tokens; CTS WavTokenizer vocoder -> embeddings -> embd_to_audio ->
7+
// 16-bit WAV), single-stream (n_parallel = 1). Kept out of jllama.cpp so the JNI layer stays thin.
98

109
#ifndef JLLAMA_TTS_ENGINE_H
1110
#define JLLAMA_TTS_ENGINE_H

0 commit comments

Comments
 (0)