Summary
Any POST /v1/audio/speech request that fails inside the worker (e.g., bad model id, HuggingFace 404, missing file) responds with HTTP/1.1 200 OK, Content-Type: audio/wav, and a zero-byte body. The real exception (most often huggingface_hub.errors.RepositoryNotFoundError) only shows up in the server logs. Clients see a successful empty download and have no way to tell what went wrong.
Repro
mlx_audio.server --host 127.0.0.1 --port 1237 &
sleep 2
curl -i -X POST http://127.0.0.1:1237/v1/audio/speech \
-H 'Content-Type: application/json' \
-d '{"model":"does-not-exist","input":"hi","response_format":"wav"}'
Observed:
HTTP/1.1 200 OK
content-type: audio/wav
content-disposition: attachment; filename=speech.wav
transfer-encoding: chunked
Server log:
huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error.
Repository Not Found for url: https://huggingface.co/api/models/does-not-exist/revision/main.
Root cause
server.py::tts_speech constructs a StreamingResponse before the broker worker has touched the model. FastAPI commits the response (status + headers) the moment the response object is returned. Model load happens inside the worker thread behind _stream_inference_results. When the worker raises (raise chunk.error around _stream_inference_results), the generator aborts mid-stream and the client gets a clean close with zero body bytes.
Same hazard on every route that goes through _stream_inference_results / _await_inference_result (/v1/audio/speech, /v1/audio/transcriptions, and the separation route).
Proposed fix
Pre-flight the model load synchronously before constructing the response. If it fails, surface the error as a proper HTTPException. Sketch:
@app.post("/v1/audio/speech")
async def tts_speech(payload: SpeechRequest, request: Request):
...
try:
await asyncio.to_thread(_load_model_for_inference, payload.model)
except RepositoryNotFoundError as e:
raise HTTPException(status_code=404, detail=f"Model not found: {payload.model}") from e
except Exception as e:
raise HTTPException(status_code=500, detail=f"Model load failed: {e}") from e
handle = get_inference_broker().submit(...)
return StreamingResponse(...)
Pre-flight is a no-op for warm models (already in model_provider.models) and only slow on cold load, which is exactly when the client most needs synchronous feedback.
Acceptance criteria
POST /v1/audio/speech {"model":"does-not-exist"} returns 4xx with a JSON body describing the failure (not 200 with empty body).
POST /v1/audio/speech with a valid model + valid payload still returns 200 + streaming audio bytes.
Happy to send a PR if there's appetite for this shape of fix.
Summary
Any
POST /v1/audio/speechrequest that fails inside the worker (e.g., bad model id, HuggingFace 404, missing file) responds withHTTP/1.1 200 OK,Content-Type: audio/wav, and a zero-byte body. The real exception (most oftenhuggingface_hub.errors.RepositoryNotFoundError) only shows up in the server logs. Clients see a successful empty download and have no way to tell what went wrong.Repro
Observed:
Server log:
Root cause
server.py::tts_speechconstructs aStreamingResponsebefore the broker worker has touched the model. FastAPI commits the response (status + headers) the moment the response object is returned. Model load happens inside the worker thread behind_stream_inference_results. When the worker raises (raise chunk.erroraround_stream_inference_results), the generator aborts mid-stream and the client gets a clean close with zero body bytes.Same hazard on every route that goes through
_stream_inference_results/_await_inference_result(/v1/audio/speech,/v1/audio/transcriptions, and the separation route).Proposed fix
Pre-flight the model load synchronously before constructing the response. If it fails, surface the error as a proper
HTTPException. Sketch:Pre-flight is a no-op for warm models (already in
model_provider.models) and only slow on cold load, which is exactly when the client most needs synchronous feedback.Acceptance criteria
POST /v1/audio/speech {"model":"does-not-exist"}returns 4xx with a JSON body describing the failure (not 200 with empty body).POST /v1/audio/speechwith a valid model + valid payload still returns 200 + streaming audio bytes.Happy to send a PR if there's appetite for this shape of fix.