POST /v1/audio/speech returns 200 OK with empty body when model load fails

## Summary

Any `POST /v1/audio/speech` request that fails inside the worker (e.g., bad model id, HuggingFace 404, missing file) responds with `HTTP/1.1 200 OK`, `Content-Type: audio/wav`, and a zero-byte body. The real exception (most often `huggingface_hub.errors.RepositoryNotFoundError`) only shows up in the server logs. Clients see a successful empty download and have no way to tell what went wrong.

## Repro

```sh
mlx_audio.server --host 127.0.0.1 --port 1237 &
sleep 2
curl -i -X POST http://127.0.0.1:1237/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"model":"does-not-exist","input":"hi","response_format":"wav"}'
```

Observed:

```
HTTP/1.1 200 OK
content-type: audio/wav
content-disposition: attachment; filename=speech.wav
transfer-encoding: chunked

```

Server log:

```
huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error.
Repository Not Found for url: https://huggingface.co/api/models/does-not-exist/revision/main.
```

## Root cause

`server.py::tts_speech` constructs a `StreamingResponse` before the broker worker has touched the model. FastAPI commits the response (status + headers) the moment the response object is returned. Model load happens inside the worker thread behind `_stream_inference_results`. When the worker raises (`raise chunk.error` around `_stream_inference_results`), the generator aborts mid-stream and the client gets a clean close with zero body bytes.

Same hazard on every route that goes through `_stream_inference_results` / `_await_inference_result` (`/v1/audio/speech`, `/v1/audio/transcriptions`, and the separation route).

## Proposed fix

Pre-flight the model load synchronously before constructing the response. If it fails, surface the error as a proper `HTTPException`. Sketch:

```python
@app.post("/v1/audio/speech")
async def tts_speech(payload: SpeechRequest, request: Request):
    ...
    try:
        await asyncio.to_thread(_load_model_for_inference, payload.model)
    except RepositoryNotFoundError as e:
        raise HTTPException(status_code=404, detail=f"Model not found: {payload.model}") from e
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Model load failed: {e}") from e

    handle = get_inference_broker().submit(...)
    return StreamingResponse(...)
```

Pre-flight is a no-op for warm models (already in `model_provider.models`) and only slow on cold load, which is exactly when the client most needs synchronous feedback.

## Acceptance criteria

- `POST /v1/audio/speech {"model":"does-not-exist"}` returns 4xx with a JSON body describing the failure (not 200 with empty body).
- `POST /v1/audio/speech` with a valid model + valid payload still returns 200 + streaming audio bytes.

Happy to send a PR if there's appetite for this shape of fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

POST /v1/audio/speech returns 200 OK with empty body when model load fails #724

Summary

Repro

Root cause

Proposed fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

POST /v1/audio/speech returns 200 OK with empty body when model load fails #724

Description

Summary

Repro

Root cause

Proposed fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions