vLLM Version: 0.19.0
vLLM-Omni Version: 0.19.0rc2.dev275+ge375b1268
git sha: e375b1268
VoxCPM2 served via vllm serve --omni returns HTTP 200 with content-type: audio/wav,
but the WAV file is only 15,404 bytes (~0.16 seconds) of completely blank/silent audio.
The same model works correctly with the offline end2end.py script (produces valid 3.52s WAV),
but takes 38.6 seconds inference time
Questions:
Is there a known issue with /v1/audio/speech returning blank/silent WAV for VoxCPM2?
Is ref_audio the correct top-level JSON parameter for voice cloning, or should it be passed differently?
Do you have a reference script that loads the model once and benchmarks HTTP inference speed end-to-end?
vLLM Version: 0.19.0
vLLM-Omni Version: 0.19.0rc2.dev275+ge375b1268
git sha: e375b1268
VoxCPM2 served via vllm serve --omni returns HTTP 200 with content-type: audio/wav,
but the WAV file is only 15,404 bytes (~0.16 seconds) of completely blank/silent audio.
The same model works correctly with the offline end2end.py script (produces valid 3.52s WAV),
but takes 38.6 seconds inference time
Questions:
Is there a known issue with /v1/audio/speech returning blank/silent WAV for VoxCPM2?
Is ref_audio the correct top-level JSON parameter for voice cloning, or should it be passed differently?
Do you have a reference script that loads the model once and benchmarks HTTP inference speed end-to-end?