Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on 0.4.3

# Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on mlx-audio 0.4.3

## Summary

`Qwen3TTSBatchSession` appears to be the canonical API for a fixed TTS model
serving many texts, but in `mlx-audio==0.4.3` I cannot get it to behave like a
streaming API for Qwen3-TTS. With `stream=True` and `streaming_interval=0.32`,
`session.step()` emits one large audio event only after the utterance is already
substantially complete, then returns empty events until my safety cap is hit.
`ev.is_final_chunk` never becomes true.

This makes it hard to use the batch-session path for low-latency voice-agent TTS,
even though it looks like the right API shape for a persistent process with one
loaded voice/model and many short utterances.

## Environment

- macOS: `macOS-15.6-arm64-arm-64bit-Mach-O`
- Machine: Apple Silicon / `arm64`
- Python: `3.14.4`
- `mlx-audio`: `0.4.3`
- `mlx`: `0.31.2`
- `mlx-metal`: `0.31.2`
- `mlx-lm`: `0.31.3`
- Model tested: `mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit`

## Minimal reproducer shape

```python
import time
from mlx_audio.tts.utils import load_model
from mlx_audio.tts.models.qwen3_tts.continuous_batching import (
    TTSBatchOptions,
    TTSBatchItem,
)

model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit")
options = TTSBatchOptions(
    temperature=0.9,
    top_p=1.0,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=4096,
    lang_code="auto",
    stream=True,
    streaming_interval=0.32,
    max_batch_size=8,
    verbose=False,
)
session = model.create_tts_batch_session(options)

item = TTSBatchItem(
    sequence_id=0,
    text="First halo utterance to warm the session.",
    ref_audio="path/to/ref.wav",
    ref_text="reference transcript text",
)

t_submit = time.monotonic()
session.add([item])

for step_n in range(600):
    events = session.step()
    for ev in events:
        if ev.audio is not None:
            print(
                "first audio",
                round((time.monotonic() - t_submit) * 1000),
                "ms",
                "samples",
                ev.samples,
            )
        if ev.is_final_chunk:
            print("final", step_n)
            raise SystemExit(0)

print("hit safety cap without is_final_chunk")
```

## Observed behavior

Three short texts in one session produced:

| Text | First audio | Samples in first event | Completion |
|---|---:|---:|---|
| `First halo utterance to warm the session.` | 2536 ms | 71040 | hit 600-step safety cap; no final chunk |
| `Second utterance with more content here.` | 2339 ms | 72960 | hit 600-step safety cap; no final chunk |
| `Third probe text.` | 1190 ms | 38400 | hit 600-step safety cap; no final chunk |

After the first audio event, repeated `session.step()` calls returned empty
results. I did not observe incremental audio chunks matching
`streaming_interval=0.32`, and I did not observe `is_final_chunk=True`.

## Expected behavior

With `stream=True` and `streaming_interval=0.32`, I expected one of these:

- multiple smaller audio events roughly as chunks become available, followed by
  a final event, or
- one audio event plus a reliable completion/final signal soon after.

## Why this matters

For voice-agent TTS, the batch-session API is attractive because it should avoid
recreating per-request model/session state. In the current observed behavior, the
batch-session path has worse first-audio latency than our existing direct
`model.generate(..., stream=True)` path and has no reliable completion signal.

Is this expected for Qwen3-TTS in `0.4.3`, or is there another required option /
calling pattern to make `Qwen3TTSBatchSession` produce incremental streaming
chunks and final events?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on 0.4.3 #720

Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on mlx-audio 0.4.3

Summary

Environment

Minimal reproducer shape

Observed behavior

Expected behavior

Why this matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Text	First audio	Samples in first event	Completion
`First halo utterance to warm the session.`	2536 ms	71040	hit 600-step safety cap; no final chunk
`Second utterance with more content here.`	2339 ms	72960	hit 600-step safety cap; no final chunk
`Third probe text.`	1190 ms	38400	hit 600-step safety cap; no final chunk

Uh oh!

Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on 0.4.3 #720

Description

Qwen3TTSBatchSession.step does not stream incrementally or emit final chunk on mlx-audio 0.4.3

Summary

Environment

Minimal reproducer shape

Observed behavior

Expected behavior

Why this matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions