Skip to content

Commit ce9ee96

Browse files
Zbig9000claudeGustavoA1604
authored
QVAC-21483 feat: add output-frequency selection (output_sample_rate) (#69)
* QVAC-21483 feat: add output-frequency selection (output_sample_rate) Add an output_sample_rate knob to the Chatterbox and Supertonic engines and the tts-cli / supertonic-cli CLIs so callers can request an output frequency other than the engine's native rate (24 kHz Chatterbox / the model metadata rate for Supertonic). The final PCM is resampled with the existing Kaiser-windowed sinc resampler and SynthesisResult::sample_rate reports the actual rate. 0 keeps the native rate (default; no behaviour change); explicit rates are validated to [8000, 192000] Hz to match the @qvac/tts-ggml addon's JS-side window. - voice_features: validate_output_sample_rate + resample_for_output helpers - chatterbox::Engine: batch resamples the whole utterance once, streaming resamples per chunk (mel-hop bookkeeping stays native, so the documented result.pcm == concat(chunks) invariant still holds) - supertonic::Engine: resample in run_single_chunk (covers batch + streaming) - CLIs: --output-sample-rate HZ on tts-cli (chatterbox egress + supertonic path) and supertonic-cli - tests: output-rate unit checks in test_resample + a new MTL-gated test_output_sample_rate (batch / streaming / out-of-range validation) * QVAC-21483 fix: make streaming output-rate conversion seam-free (batch-exact) Code-review follow-up to the output_sample_rate feature. Streaming resampled each chunk independently with the stateless whole-buffer resample_sinc, which restarts the output grid and truncates the sinc window at every chunk edge. That is only artifact-free when each chunk length is an exact multiple of the resample ratio's denominator -- which the pipeline does not guarantee (the trim-faded first chunk, the finalized last chunk, or rates such as 11025 Hz break it). When violated it injects a discontinuity at every seam (streamed-vs-batch SNR collapses below 10 dB) plus a per-chunk length drift. Add a stateful OutputResampler that buffers native PCM and emits each output sample only once its sinc window is fully covered, flushing the tail on finish(). A stable sample's window is identical whether computed on the prefix or the full buffer, so concatenating every process() result + finish() is bit-for-bit identical to resampling the whole utterance once: the streamed output now equals the batch resample, with no seams and no drift, while result.pcm == concat(chunks) still holds. - voice_features: OutputResampler (passthrough when target is 0/native) - chatterbox::Engine streaming: one utterance-spanning resampler - tts-cli streaming->stdout: same resampler, with inter-segment silence fed through it so it lands in order; fix the stale "@ 24 kHz" banner - engine.h: correct the "resampled independently / inaudible" comment - test_resample: model-free checks that streamed == batch (bit-exact) for misaligned / trim-faded / hostile-rate (11025) chunkings + passthrough Verified: test-resample (24 checks) passes; the MTL-gated test-output-sample-rate passes on real GGUFs; tts-cli streaming to a file vs to stdout at 16 kHz now produce byte-identical PCM. * QVAC-21483 fix: address review — Supertonic streaming output-rate batch-exact + engine coverage Resolves GustavoA1604's review on PR #69: 1. supertonic_engine.cpp: run_single_chunk now returns NATIVE PCM. Batch synthesize() resamples the whole utterance once at the end; streaming synthesize_streaming() drives a single utterance-spanning OutputResampler (seam-fade moved to the native domain). Per-chunk resampling restarted the output grid at every chunk edge and injected seams + length drift at non-native rates — this mirrors chatterbox::Engine so streamed output is bit-identical to the batch resample. 2. supertonic_cli.cpp: validate --output-sample-rate at parse time against kOutputSampleRateMin/Max (throws -> clean "error:" + exit 2), matching chatterbox_cli / tts-cli instead of aborting later at the engine ctor. 3. test_output_sample_rate.cpp: assert streamed-16k == whole-buffer resample of streamed-native (the OutputResampler property), the form that isolates resampling and would catch a per-chunk regression. 4. test_output_sample_rate_supertonic.cpp (new) + CMake: sibling Supertonic engine coverage — native rate, 16 kHz batch ratio, construction rejection, streaming result.pcm == concat, and the streaming==batch-resample invariant. --------- Co-authored-by: Zbigniew Herman <212399199+Zbig9000@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: GustavoA1604 <54457676+GustavoA1604@users.noreply.github.com>
1 parent 28f37ea commit ce9ee96

13 files changed

Lines changed: 903 additions & 35 deletions

tts-cpp/CMakeLists.txt

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -981,6 +981,31 @@ if (TTS_CPP_BUILD_TESTS)
981981
ARGS "${_tcb_t3_turbo_gguf}" "${_tcb_s3gen_gguf}"
982982
REQUIRES "${_tcb_t3_turbo_gguf}" "${_tcb_s3gen_gguf}")
983983

984+
# QVAC-21483 — output-frequency selection on the chatterbox::Engine API
985+
# (batch + streaming + validation). Uses the multilingual fixtures so it
986+
# runs anywhere the mtl-synth tests do; auto-disabled when they're absent.
987+
add_executable(test-output-sample-rate test/test_output_sample_rate.cpp)
988+
target_link_libraries(test-output-sample-rate PRIVATE tts-cpp ggml tts-cpp-backend-defs)
989+
target_include_directories(test-output-sample-rate PRIVATE ggml/include src include)
990+
tts_cpp_apply_ccache(test-output-sample-rate)
991+
tts_cpp_register_test(test-output-sample-rate
992+
LABEL "fixture"
993+
ARGS "${_tcb_t3_mtl_gguf}" "${_tcb_s3gen_mtl_gguf}"
994+
REQUIRES "${_tcb_t3_mtl_gguf}" "${_tcb_s3gen_mtl_gguf}")
995+
996+
# QVAC-21483 — Supertonic sibling of the above: native rate, 16 kHz batch
997+
# ratio, construction rejection, streaming result.pcm == concat, and the
998+
# streaming-equals-whole-buffer-resample batch-exact property. Gated on the
999+
# Supertonic GGUF fixture; auto-disabled when it's absent.
1000+
add_executable(test-output-sample-rate-supertonic test/test_output_sample_rate_supertonic.cpp)
1001+
target_link_libraries(test-output-sample-rate-supertonic PRIVATE tts-cpp ggml tts-cpp-backend-defs)
1002+
target_include_directories(test-output-sample-rate-supertonic PRIVATE ggml/include src include)
1003+
tts_cpp_apply_ccache(test-output-sample-rate-supertonic)
1004+
tts_cpp_register_test(test-output-sample-rate-supertonic
1005+
LABEL "fixture"
1006+
ARGS "${_tcb_super_gguf}"
1007+
REQUIRES "${_tcb_super_gguf}")
1008+
9841009
# CPU-side persistent-cache validation.
9851010
# Exercises the time_mlp / time_emb / cfm_estimator / weight_mirror
9861011
# caches that amortise per-synth overhead on the multilingual CPU

tts-cpp/README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -403,7 +403,8 @@ harnesses:
403403
| `build/supertonic-cli` | Supertonic-only end-to-end CLI (text → wav) — the same engine `tts-cli` invokes when it sees a Supertonic GGUF, exposed standalone for scripting and parity work |
404404
| `build/supertonic-bench` | Per-stage Supertonic benchmark harness (`--text` / `--out` / `--runs`); machine-readable RTF + per-stage timings |
405405
| `build/test-s3gen` | Staged numerical validation of S3Gen encoder + CFM vs Python dumps |
406-
| `build/test-resample` | Round-trip SNR of the C++ Kaiser-windowed sinc resampler |
406+
| `build/test-resample` | Round-trip SNR of the C++ Kaiser-windowed sinc resampler + output-frequency helpers (validate / passthrough / ratio) |
407+
| `build/test-output-sample-rate` | `--output-sample-rate` on `chatterbox::Engine`: native/16 kHz batch, out-of-range rejection, streaming `pcm == concat(chunks)` invariant (needs the MTL GGUFs) |
407408
| `build/test-voice-features` | 24 kHz 80-ch mel parity (prompt_feat) |
408409
| `build/test-fbank` | 16 kHz 80-ch Kaldi fbank parity |
409410
| `build/test-voice-encoder` | VoiceEncoder 256-d speaker embedding parity |
@@ -634,6 +635,17 @@ N=6 is too aggressive (cosine 0.990 right at the threshold, PCM cosine
634635
drops to 0.88). Streaming chunks ignore this flag and use
635636
`--stream-cfm-steps` instead.
636637

638+
`--output-sample-rate HZ` (QVAC-21483) selects the output frequency. The
639+
pipeline natively emits 24 kHz (Chatterbox) / the model's metadata rate
640+
(Supertonic); pass a positive rate in `8000..192000` to resample the final
641+
PCM with the in-tree Kaiser-windowed sinc resampler before it's written or
642+
streamed (e.g. `--output-sample-rate 16000` for a 16 kHz wav). `0` (the
643+
default) keeps the native rate — zero behaviour change. Works on the
644+
single-shot, auto-split, and streaming paths and on both engines (the
645+
`tts_cpp::*::EngineOptions::output_sample_rate` field exposes the same knob
646+
to library callers; `SynthesisResult::sample_rate` always reports the actual
647+
rate).
648+
637649
Everything is self-contained in the two `.gguf` files:
638650

639651
- `chatterbox-t3-turbo.gguf` embeds the BPE tokenizer (vocab + merges +

tts-cpp/include/tts-cpp/chatterbox/engine.h

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,20 @@ struct EngineOptions {
193193
// S3Gen side. 0 = library default (2-step meanflow).
194194
int cfm_steps = 0;
195195

196+
// QVAC-21483 — desired output sample rate in Hz. The Chatterbox pipeline
197+
// natively emits 24 kHz mono float32; when this is a positive rate other
198+
// than 24000 the engine resamples the final PCM (Kaiser-windowed sinc, the
199+
// same primitive used for reference-audio preprocessing) to the requested
200+
// rate and reports it on SynthesisResult::sample_rate. In streaming mode
201+
// every chunk is fed through one utterance-spanning resampler that emits an
202+
// output sample only once its sinc window is fully covered by the audio seen
203+
// so far, so the delivered chunks concatenate to exactly the same PCM as
204+
// resampling the whole utterance once — no per-chunk seam artifacts — and
205+
// the documented `result.pcm == concat(chunks)` invariant still holds.
206+
// 0 keeps the native 24 kHz (default; zero behaviour change). Validated at
207+
// construction to 0 or [8000, 192000] Hz.
208+
int output_sample_rate = 0;
209+
196210
// ---------------- Streaming synthesis ----------------------------
197211
//
198212
// When `stream_chunk_tokens > 0` AND the caller passes a non-empty

tts-cpp/include/tts-cpp/supertonic/engine.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,15 @@ struct EngineOptions {
116116
int n_threads = 0;
117117
int n_gpu_layers = 0;
118118

119+
// QVAC-21483 — desired output sample rate in Hz. Supertonic natively
120+
// emits at the model's metadata rate (typically 44.1 kHz); when this is a
121+
// positive rate other than the native one the engine resamples the final
122+
// PCM (Kaiser-windowed sinc) and reports it on
123+
// SynthesisResult::sample_rate. Honoured on both the batch and streaming
124+
// paths. 0 keeps the native rate (default; zero behaviour change).
125+
// Validated at construction to 0 or [8000, 192000] Hz.
126+
int output_sample_rate = 0;
127+
119128
// Compute precision for matmul weights — see Precision enum above.
120129
// Default F32 is the current behaviour (load q8_0 GGUF, expand to f32).
121130
// F16 / Q8_0 are non-default GPU paths (Metal-validated).

0 commit comments

Comments
 (0)