Skip to content

Issues with Hindi Language Voice Cloning in VoxCPM #288

@mishradibyajyoti

Description

@mishradibyajyoti

When using the VoxCPM2 model for voice cloning on an Ubuntu GPU server, I encountered a few issues while testing Hindi voice generation.

1. Hindi generation using English reference audio
When I provide Hindi input text along with an English reference audio sample (e.g., a speech by Barack Obama), the model fails to generate proper Hindi audio output. In some cases, no meaningful audio is produced, or the output is unintelligible.

2. Extra word/artifact with Hindi reference audio
When I use a Hindi reference audio and generate Hindi speech, the output consistently contains an unexpected extra word or artifact at the beginning of the generated audio. This behavior is reproducible across multiple test samples.

3. Hi-Fi cloning observation
I also tested the Hi-Fi cloning setup (with prompt audio and transcript), and the above issues—especially the initial artifact in Hindi output—still persist.

Minimal reproducible example:

# -*- coding: utf-8 -*-

from voxcpm import VoxCPM
import soundfile as sf

# Load model
model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False
)

# INPUTS
text = "हम बैकग्राउंड म्यूज़िक के साथ आवाज़ की स्पष्टता की जाँच कर रहे हैं।"
reference_audio = "Obama_reference.wav"

# Generate cloned speech
wav = model.generate(
    text=text,
    reference_wav_path=reference_audio,
    cfg_value=2.0,
    inference_timesteps=10,
)

# Save output
sf.write("output_hindi.wav", wav, model.tts_model.sample_rate)

print("Output saved as output_hindi.wav")

Environment details:

  • OS: Ubuntu
  • GPU: H100
  • Python: 3.10
  • Installation: Cloned from the official VoxCPM repository

Questions:

  1. Is cross-language voice cloning (e.g., English reference → Hindi output) currently supported or recommended?
  2. What could be causing the extra word or artifact at the beginning of Hindi outputs?
  3. Are there any best practices or parameter tuning recommendations for Hindi or multilingual voice cloning?

Any guidance would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions