Skip to content

Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on 0.4.3 #720

@terrazorc

Description

@terrazorc

Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on mlx-audio 0.4.3

Summary

Qwen3TTSBatchSession appears to be the canonical API for a fixed TTS model
serving many texts, but in mlx-audio==0.4.3 I cannot get it to behave like a
streaming API for Qwen3-TTS. With stream=True and streaming_interval=0.32,
session.step() emits one large audio event only after the utterance is already
substantially complete, then returns empty events until my safety cap is hit.
ev.is_final_chunk never becomes true.

This makes it hard to use the batch-session path for low-latency voice-agent TTS,
even though it looks like the right API shape for a persistent process with one
loaded voice/model and many short utterances.

Environment

  • macOS: macOS-15.6-arm64-arm-64bit-Mach-O
  • Machine: Apple Silicon / arm64
  • Python: 3.14.4
  • mlx-audio: 0.4.3
  • mlx: 0.31.2
  • mlx-metal: 0.31.2
  • mlx-lm: 0.31.3
  • Model tested: mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit

Minimal reproducer shape

import time
from mlx_audio.tts.utils import load_model
from mlx_audio.tts.models.qwen3_tts.continuous_batching import (
    TTSBatchOptions,
    TTSBatchItem,
)

model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit")
options = TTSBatchOptions(
    temperature=0.9,
    top_p=1.0,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=4096,
    lang_code="auto",
    stream=True,
    streaming_interval=0.32,
    max_batch_size=8,
    verbose=False,
)
session = model.create_tts_batch_session(options)

item = TTSBatchItem(
    sequence_id=0,
    text="First halo utterance to warm the session.",
    ref_audio="path/to/ref.wav",
    ref_text="reference transcript text",
)

t_submit = time.monotonic()
session.add([item])

for step_n in range(600):
    events = session.step()
    for ev in events:
        if ev.audio is not None:
            print(
                "first audio",
                round((time.monotonic() - t_submit) * 1000),
                "ms",
                "samples",
                ev.samples,
            )
        if ev.is_final_chunk:
            print("final", step_n)
            raise SystemExit(0)

print("hit safety cap without is_final_chunk")

Observed behavior

Three short texts in one session produced:

Text First audio Samples in first event Completion
First halo utterance to warm the session. 2536 ms 71040 hit 600-step safety cap; no final chunk
Second utterance with more content here. 2339 ms 72960 hit 600-step safety cap; no final chunk
Third probe text. 1190 ms 38400 hit 600-step safety cap; no final chunk

After the first audio event, repeated session.step() calls returned empty
results. I did not observe incremental audio chunks matching
streaming_interval=0.32, and I did not observe is_final_chunk=True.

Expected behavior

With stream=True and streaming_interval=0.32, I expected one of these:

  • multiple smaller audio events roughly as chunks become available, followed by
    a final event, or
  • one audio event plus a reliable completion/final signal soon after.

Why this matters

For voice-agent TTS, the batch-session API is attractive because it should avoid
recreating per-request model/session state. In the current observed behavior, the
batch-session path has worse first-audio latency than our existing direct
model.generate(..., stream=True) path and has no reliable completion signal.

Is this expected for Qwen3-TTS in 0.4.3, or is there another required option /
calling pattern to make Qwen3TTSBatchSession produce incremental streaming
chunks and final events?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions