VoxCPM2 /v1/audio/speech returns blank 0.16s WAV contains no sound.

vLLM Version: 0.19.0

vLLM-Omni Version: 0.19.0rc2.dev275+ge375b1268
git sha: e375b1268

VoxCPM2 served via vllm serve --omni returns HTTP 200 with content-type: audio/wav,
but the WAV file is only 15,404 bytes (~0.16 seconds) of completely blank/silent audio.
The same model works correctly with the offline end2end.py script (produces valid 3.52s WAV),
but takes 38.6 seconds inference time

Questions:

Is there a known issue with /v1/audio/speech returning blank/silent WAV for VoxCPM2?
Is ref_audio the correct top-level JSON parameter for voice cloning, or should it be passed differently?
Do you have a reference script that loads the model once and benchmarks HTTP inference speed end-to-end?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VoxCPM2 /v1/audio/speech returns blank 0.16s WAV contains no sound. #287

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VoxCPM2 /v1/audio/speech returns blank 0.16s WAV contains no sound. #287

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions