Qwen3 TTS: Language hint? #2143

schnz · 2026-04-16T07:58:03Z

schnz
Apr 16, 2026

I am absolutely stunned by the Qwen3 TTS model. I really hope that there is going to be a model that has support for inlining emotions or instructions that shall apply to a specific part of the text prompt.

But I was wondering: I get really poor results for languages other then English. Its not that the model isn't capable of doing that, but it tends to use an English accent most of the times. Eventually, after enough retries, the model sometimes uses the correct accent and produces a decent output.
Is it possible to pass a language hint to the model for the to-be-synthesized text?

Answered by michael-chipmates

Jun 22, 2026

Short version: there's no language or accent flag you can pass per request. The TTS request in koboldcpp only carries the text, a speaker seed, an audio seed, the speaker voice/instruction, and an optional reference audio clip. There is no language field. (The language/langcode param you may have seen is for Whisper transcription, not TTS.)

The English-accent leak you're hitting is a model-level thing, not a missing koboldcpp knob. Qwen3-TTS tends to fall back to its dominant English distribution unless it's conditioned on a real example of the target language, which is why you sometimes get the right accent only after a few retries. Upstream has the same report: QwenLM/Qwen3-TTS#134 ("Wh…

View full answer

michael-chipmates · 2026-06-22T10:25:33Z

michael-chipmates
Jun 22, 2026

Short version: there's no language or accent flag you can pass per request. The TTS request in koboldcpp only carries the text, a speaker seed, an audio seed, the speaker voice/instruction, and an optional reference audio clip. There is no language field. (The language/langcode param you may have seen is for Whisper transcription, not TTS.)

The English-accent leak you're hitting is a model-level thing, not a missing koboldcpp knob. Qwen3-TTS tends to fall back to its dominant English distribution unless it's conditioned on a real example of the target language, which is why you sometimes get the right accent only after a few retries. Upstream has the same report: QwenLM/Qwen3-TTS#134 ("When I clone a voice in another language keep original accent"), where cloning an English voice into Italian gives a strong American accent, and a couple of people confirm the same for German. The takeaway from that thread is that same-language cloning works fine and cross-language does not.

So the fix that actually works today is voice cloning with a reference clip recorded in the target language by a native speaker. That conditions both the speaker and the accent. You'll want the Qwen3TTS Base model for this (the VoiceDesign variant is the instruction-based one and can't clone).

Steps:

Grab a clean 5 to 10 second WAV of a native speaker of the language you want. Single speaker, no music or background, silence trimmed, just a normal sentence in that language.
Put it in a folder and point koboldcpp at it (use whatever Base + WavTokenizer ggufs you downloaded):

koboldcpp --ttsmodel your-qwen3tts-base.gguf \
          --ttswavtokenizer your-wavtokenizer.gguf \
          --ttsgpu \
          --ttsdir /path/to/your/voices

--ttsdir loads every .wav/.mp3 in that folder as a selectable voice, and the filename becomes the voice name.

Select that voice per request. Over the API, pass the filename as the voice field, including the extension:

POST /v1/audio/speech
{
  "input": "your text in the target language",
  "voice": "german_sample.wav"
}

Pass the voice value exactly as the file is named, with the .wav on the end. If you drop the extension it silently falls back to no cloning, so that detail matters. In the built-in TTS UI you just pick it from the voice list instead.

Tips that make the accent stick: the reference clip is the single biggest lever, so use one that's genuinely native and natural-sounding, not TTS-generated. One clip per target language. If the clip itself has any English flavor, the output inherits it.

On your other question (inlining emotions/instructions): it's a fair ask, and there is partial support, just not the mid-text kind you're describing. The instruction path takes one instruction describing the speaker for the whole utterance, either as an instruction field or inline as [your instruction] the text to narrate at the start. But it's one instruction per generation, not per-segment, and per the wiki you can't combine an instruction with voice cloning. So choosing emotion for one specific part of the text isn't exposed today.

Heavier route if cloning still isn't enough: a small fine-tune or LoRA on native sentences in the target language. Be aware it's not a guaranteed fix. In that same upstream thread someone fine-tuned on a few hours of audio and the accent still leaked, so treat it as experimental. For most cases a good native reference clip gets you most of the way.

1 reply

schnz Jun 22, 2026
Author

Thanks for taking the time to answer! If this is indeed a model-level thing, I have high hopes that this is going to be addressed in the next Qwen TTS model. I may try the workarounds you provided, although its not feasible when working with VoiceDesign alot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3 TTS: Language hint? #2143

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Qwen3 TTS: Language hint? #2143

Uh oh!

Uh oh!

schnz Apr 16, 2026

Replies: 1 comment · 1 reply

Uh oh!

michael-chipmates Jun 22, 2026

Uh oh!

schnz Jun 22, 2026 Author

schnz
Apr 16, 2026

Replies: 1 comment 1 reply

michael-chipmates
Jun 22, 2026

schnz Jun 22, 2026
Author