Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on mlx-audio 0.4.3
Summary
Qwen3TTSBatchSession appears to be the canonical API for a fixed TTS model
serving many texts, but in mlx-audio==0.4.3 I cannot get it to behave like a
streaming API for Qwen3-TTS. With stream=True and streaming_interval=0.32,
session.step() emits one large audio event only after the utterance is already
substantially complete, then returns empty events until my safety cap is hit.
ev.is_final_chunk never becomes true.
This makes it hard to use the batch-session path for low-latency voice-agent TTS,
even though it looks like the right API shape for a persistent process with one
loaded voice/model and many short utterances.
Environment
- macOS:
macOS-15.6-arm64-arm-64bit-Mach-O
- Machine: Apple Silicon /
arm64
- Python:
3.14.4
mlx-audio: 0.4.3
mlx: 0.31.2
mlx-metal: 0.31.2
mlx-lm: 0.31.3
- Model tested:
mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit
Minimal reproducer shape
import time
from mlx_audio.tts.utils import load_model
from mlx_audio.tts.models.qwen3_tts.continuous_batching import (
TTSBatchOptions,
TTSBatchItem,
)
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit")
options = TTSBatchOptions(
temperature=0.9,
top_p=1.0,
top_k=50,
repetition_penalty=1.05,
max_tokens=4096,
lang_code="auto",
stream=True,
streaming_interval=0.32,
max_batch_size=8,
verbose=False,
)
session = model.create_tts_batch_session(options)
item = TTSBatchItem(
sequence_id=0,
text="First halo utterance to warm the session.",
ref_audio="path/to/ref.wav",
ref_text="reference transcript text",
)
t_submit = time.monotonic()
session.add([item])
for step_n in range(600):
events = session.step()
for ev in events:
if ev.audio is not None:
print(
"first audio",
round((time.monotonic() - t_submit) * 1000),
"ms",
"samples",
ev.samples,
)
if ev.is_final_chunk:
print("final", step_n)
raise SystemExit(0)
print("hit safety cap without is_final_chunk")
Observed behavior
Three short texts in one session produced:
| Text |
First audio |
Samples in first event |
Completion |
First halo utterance to warm the session. |
2536 ms |
71040 |
hit 600-step safety cap; no final chunk |
Second utterance with more content here. |
2339 ms |
72960 |
hit 600-step safety cap; no final chunk |
Third probe text. |
1190 ms |
38400 |
hit 600-step safety cap; no final chunk |
After the first audio event, repeated session.step() calls returned empty
results. I did not observe incremental audio chunks matching
streaming_interval=0.32, and I did not observe is_final_chunk=True.
Expected behavior
With stream=True and streaming_interval=0.32, I expected one of these:
- multiple smaller audio events roughly as chunks become available, followed by
a final event, or
- one audio event plus a reliable completion/final signal soon after.
Why this matters
For voice-agent TTS, the batch-session API is attractive because it should avoid
recreating per-request model/session state. In the current observed behavior, the
batch-session path has worse first-audio latency than our existing direct
model.generate(..., stream=True) path and has no reliable completion signal.
Is this expected for Qwen3-TTS in 0.4.3, or is there another required option /
calling pattern to make Qwen3TTSBatchSession produce incremental streaming
chunks and final events?
Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on mlx-audio 0.4.3
Summary
Qwen3TTSBatchSessionappears to be the canonical API for a fixed TTS modelserving many texts, but in
mlx-audio==0.4.3I cannot get it to behave like astreaming API for Qwen3-TTS. With
stream=Trueandstreaming_interval=0.32,session.step()emits one large audio event only after the utterance is alreadysubstantially complete, then returns empty events until my safety cap is hit.
ev.is_final_chunknever becomes true.This makes it hard to use the batch-session path for low-latency voice-agent TTS,
even though it looks like the right API shape for a persistent process with one
loaded voice/model and many short utterances.
Environment
macOS-15.6-arm64-arm-64bit-Mach-Oarm643.14.4mlx-audio:0.4.3mlx:0.31.2mlx-metal:0.31.2mlx-lm:0.31.3mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bitMinimal reproducer shape
Observed behavior
Three short texts in one session produced:
First halo utterance to warm the session.Second utterance with more content here.Third probe text.After the first audio event, repeated
session.step()calls returned emptyresults. I did not observe incremental audio chunks matching
streaming_interval=0.32, and I did not observeis_final_chunk=True.Expected behavior
With
stream=Trueandstreaming_interval=0.32, I expected one of these:a final event, or
Why this matters
For voice-agent TTS, the batch-session API is attractive because it should avoid
recreating per-request model/session state. In the current observed behavior, the
batch-session path has worse first-audio latency than our existing direct
model.generate(..., stream=True)path and has no reliable completion signal.Is this expected for Qwen3-TTS in
0.4.3, or is there another required option /calling pattern to make
Qwen3TTSBatchSessionproduce incremental streamingchunks and final events?