Qwen3 TTS: Language hint? #2143
-
|
I am absolutely stunned by the Qwen3 TTS model. I really hope that there is going to be a model that has support for inlining emotions or instructions that shall apply to a specific part of the text prompt. But I was wondering: I get really poor results for languages other then English. Its not that the model isn't capable of doing that, but it tends to use an English accent most of the times. Eventually, after enough retries, the model sometimes uses the correct accent and produces a decent output. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Short version: there's no language or accent flag you can pass per request. The TTS request in koboldcpp only carries the text, a speaker seed, an audio seed, the speaker voice/instruction, and an optional reference audio clip. There is no The English-accent leak you're hitting is a model-level thing, not a missing koboldcpp knob. Qwen3-TTS tends to fall back to its dominant English distribution unless it's conditioned on a real example of the target language, which is why you sometimes get the right accent only after a few retries. Upstream has the same report: QwenLM/Qwen3-TTS#134 ("When I clone a voice in another language keep original accent"), where cloning an English voice into Italian gives a strong American accent, and a couple of people confirm the same for German. The takeaway from that thread is that same-language cloning works fine and cross-language does not. So the fix that actually works today is voice cloning with a reference clip recorded in the target language by a native speaker. That conditions both the speaker and the accent. You'll want the Qwen3TTS Base model for this (the VoiceDesign variant is the instruction-based one and can't clone). Steps:
Pass the voice value exactly as the file is named, with the Tips that make the accent stick: the reference clip is the single biggest lever, so use one that's genuinely native and natural-sounding, not TTS-generated. One clip per target language. If the clip itself has any English flavor, the output inherits it. On your other question (inlining emotions/instructions): it's a fair ask, and there is partial support, just not the mid-text kind you're describing. The instruction path takes one instruction describing the speaker for the whole utterance, either as an Heavier route if cloning still isn't enough: a small fine-tune or LoRA on native sentences in the target language. Be aware it's not a guaranteed fix. In that same upstream thread someone fine-tuned on a few hours of audio and the accent still leaked, so treat it as experimental. For most cases a good native reference clip gets you most of the way. |
Beta Was this translation helpful? Give feedback.
Short version: there's no language or accent flag you can pass per request. The TTS request in koboldcpp only carries the text, a speaker seed, an audio seed, the speaker voice/instruction, and an optional reference audio clip. There is no
languagefield. (Thelanguage/langcodeparam you may have seen is for Whisper transcription, not TTS.)The English-accent leak you're hitting is a model-level thing, not a missing koboldcpp knob. Qwen3-TTS tends to fall back to its dominant English distribution unless it's conditioned on a real example of the target language, which is why you sometimes get the right accent only after a few retries. Upstream has the same report: QwenLM/Qwen3-TTS#134 ("Wh…