Skip to content

feat: add OpenAI-compatible Audio Speech API endpoint#41

Open
pinghe wants to merge 1 commit intoOpenMOSS:mainfrom
pinghe:main
Open

feat: add OpenAI-compatible Audio Speech API endpoint#41
pinghe wants to merge 1 commit intoOpenMOSS:mainfrom
pinghe:main

Conversation

@pinghe
Copy link
Copy Markdown

@pinghe pinghe commented Apr 24, 2026

Summary

  • Add OpenAI-compatible POST /v1/audio/speech endpoint for TTS
  • Fix torchaudio SoX backend segfault- Fix voice preset mappings referencing non-existent audio files- Enable GPU inference (was hardcoded to CPU)- Add timestamps to uvicorn access logs

Changes

New file: openai_audio_api.py

  • OpenAI /v1/audio/speech request/response models (SpeechRequest, make_error_response)
  • Voice mapping: OpenAI voice names to MOSS-TTS-Nano presets (alloy to Junhao, echo to Xiaoyu, etc.)
  • Audio format helpers: WAV header construction, PCM 16-bit encoding, MP3 encoding via lameenc
  • Streaming generators: iter_pcm_audio, generate_wav_stream, generate_mp3_stream, generate_pcm_stream
  • Supports wav, mp3, and pcm response formats

app.py

  • Add OpenAI-compatible endpoint using background thread + queue streaming model
    • Avoids holding _cpu_execution_lock inside ASGI streaming iterator (prevents deadlock on client disconnect)
    • _put() has 30s deadline to prevent threads from blocking indefinitely when client disconnects
    • Explicit events_gen.close() to release the lock promptly
    • Wrap lameenc return values in bytes() to fix Starlette bytearray type error
  • Add _patch_torchaudio_backend(): monkey-patch torchaudio to default to soundfile backend, working around SoX segfault
  • Change --device default from cpu to auto, add cuda option, use resolve_device() for auto GPU detection
  • Customize uvicorn log config to add timestamps to access logs
  • Add request/complete logging with elapsed time and audio chunk count

app_onnx.py, infer.py

  • Add _patch_torchaudio_backend() to fix SoX segfault

moss_tts_nano_runtime.py

  • Remove 8 voice presets whose audio files don't exist in the repository (Zhiming, Weiguo, Trump, Nathan, Sakura, Aoi, Hina, Mei), keeping only the 8 with actual files

pyproject.toml, requirements.txt

  • Add lameenc>=1.7.0 dependency (MP3 encoding)
  • Add openai_audio_api module declaration

Test plan

  • WAV format returns valid audio (RIFF PCM 16-bit stereo 48kHz)
  • MP3 format returns valid audio (MPEG layer III, 128kbps, 48kHz), plays correctly in mpv
  • PCM format returns valid raw audio data
  • Consecutive requests don't deadlock (lock released correctly)
  • Service recovers after client disconnect (_put deadline mechanism)
  • Invalid voice/params return OpenAI-format error JSON (HTTP 400)
  • GPU auto-detection works when CUDA is available

- Add OpenAI-compatible POST /v1/audio/speech endpoint for TTS
- Fix torchaudio SoX backend segfault- Fix voice preset mappings referencing non-existent audio files- Enable GPU inference (was hardcoded to CPU)- Add timestamps to uvicorn access logs

- OpenAI /v1/audio/speech request/response models (SpeechRequest, make_error_response)
- Voice mapping: OpenAI voice names to MOSS-TTS-Nano presets (alloy to Junhao, echo to Xiaoyu, etc.)
- Audio format helpers: WAV header construction, PCM 16-bit encoding, MP3 encoding via lameenc
- Streaming generators: iter_pcm_audio, generate_wav_stream, generate_mp3_stream, generate_pcm_stream
- Supports wav, mp3, and pcm response formats

- Add OpenAI-compatible endpoint using background thread + queue streaming model
  - Avoids holding _cpu_execution_lock inside ASGI streaming iterator (prevents deadlock on client disconnect)
  - _put() has 30s deadline to prevent threads from blocking indefinitely when client disconnects
  - Explicit events_gen.close() to release the lock promptly
  - Wrap lameenc return values in bytes() to fix Starlette bytearray type error
- Add _patch_torchaudio_backend(): monkey-patch torchaudio to default to soundfile backend, working around SoX segfault
- Change --device default from cpu to auto, add cuda option, use resolve_device() for auto GPU detection
- Customize uvicorn log config to add timestamps to access logs
- Add request/complete logging with elapsed time and audio chunk count

- Add _patch_torchaudio_backend() to fix SoX segfault

- Remove 8 voice presets whose audio files don't exist in the repository (Zhiming, Weiguo, Trump, Nathan, Sakura, Aoi, Hina, Mei), keeping only the 8 with actual files

- Add lameenc>=1.7.0 dependency (MP3 encoding)
- Add openai_audio_api module declaration

- WAV format returns valid audio (RIFF PCM 16-bit stereo 48kHz)
- MP3 format returns valid audio (MPEG layer III, 128kbps, 48kHz), plays correctly in mpv
- PCM format returns valid raw audio data
- Consecutive requests don't deadlock (lock released correctly)
- Service recovers after client disconnect (_put deadline mechanism)
- Invalid voice/params return OpenAI-format error JSON (HTTP 400)
- GPU auto-detection works when CUDA is available
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant