Skip to content

POST /v1/audio/speech returns 200 OK with empty body when model load fails #724

@guygrigsby

Description

@guygrigsby

Summary

Any POST /v1/audio/speech request that fails inside the worker (e.g., bad model id, HuggingFace 404, missing file) responds with HTTP/1.1 200 OK, Content-Type: audio/wav, and a zero-byte body. The real exception (most often huggingface_hub.errors.RepositoryNotFoundError) only shows up in the server logs. Clients see a successful empty download and have no way to tell what went wrong.

Repro

mlx_audio.server --host 127.0.0.1 --port 1237 &
sleep 2
curl -i -X POST http://127.0.0.1:1237/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"model":"does-not-exist","input":"hi","response_format":"wav"}'

Observed:

HTTP/1.1 200 OK
content-type: audio/wav
content-disposition: attachment; filename=speech.wav
transfer-encoding: chunked

Server log:

huggingface_hub.errors.RepositoryNotFoundError: 404 Client Error.
Repository Not Found for url: https://huggingface.co/api/models/does-not-exist/revision/main.

Root cause

server.py::tts_speech constructs a StreamingResponse before the broker worker has touched the model. FastAPI commits the response (status + headers) the moment the response object is returned. Model load happens inside the worker thread behind _stream_inference_results. When the worker raises (raise chunk.error around _stream_inference_results), the generator aborts mid-stream and the client gets a clean close with zero body bytes.

Same hazard on every route that goes through _stream_inference_results / _await_inference_result (/v1/audio/speech, /v1/audio/transcriptions, and the separation route).

Proposed fix

Pre-flight the model load synchronously before constructing the response. If it fails, surface the error as a proper HTTPException. Sketch:

@app.post("/v1/audio/speech")
async def tts_speech(payload: SpeechRequest, request: Request):
    ...
    try:
        await asyncio.to_thread(_load_model_for_inference, payload.model)
    except RepositoryNotFoundError as e:
        raise HTTPException(status_code=404, detail=f"Model not found: {payload.model}") from e
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Model load failed: {e}") from e

    handle = get_inference_broker().submit(...)
    return StreamingResponse(...)

Pre-flight is a no-op for warm models (already in model_provider.models) and only slow on cold load, which is exactly when the client most needs synchronous feedback.

Acceptance criteria

  • POST /v1/audio/speech {"model":"does-not-exist"} returns 4xx with a JSON body describing the failure (not 200 with empty body).
  • POST /v1/audio/speech with a valid model + valid payload still returns 200 + streaming audio bytes.

Happy to send a PR if there's appetite for this shape of fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions