Skip to content

VoxCPM2 /v1/audio/speech returns blank 0.16s WAV contains no sound. #287

@mishradibyajyoti

Description

@mishradibyajyoti

vLLM Version: 0.19.0

vLLM-Omni Version: 0.19.0rc2.dev275+ge375b1268
git sha: e375b1268

VoxCPM2 served via vllm serve --omni returns HTTP 200 with content-type: audio/wav,
but the WAV file is only 15,404 bytes (~0.16 seconds) of completely blank/silent audio.
The same model works correctly with the offline end2end.py script (produces valid 3.52s WAV),
but takes 38.6 seconds inference time

Questions:

Is there a known issue with /v1/audio/speech returning blank/silent WAV for VoxCPM2?
Is ref_audio the correct top-level JSON parameter for voice cloning, or should it be passed differently?
Do you have a reference script that loads the model once and benchmarks HTTP inference speed end-to-end?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions