feat(stt): expose word_timestamps form field on /v1/audio/transcriptions by ciekawy · Pull Request #716 · Blaizzy/mlx-audio

ciekawy · 2026-05-11T19:46:31Z

Summary

Add word_timestamps: bool = Form(False) and timestamp_granularities: Optional[str] = Form(None) form fields to POST /v1/audio/transcriptions, mirroring the OpenAI Audio API surface
Update TranscriptionRequest Pydantic model to carry these fields
Add _STT_EXTRA_KWARGS = {"word_timestamps", "timestamp_granularities"} allowlist in STTExecutionAdapter.run_serial() so they are not silently dropped by the signature-parameter kwarg filter and reach mlx_whisper.transcribe()

Motivation

Karaoke-style language-learning apps (and any per-word synchronisation use case) need word-level timestamps. Without this fix, every consumer must either (a) drop server mode and call mlx_whisper.transcribe in-process, losing broker-queue batching, or (b) post-process segment text → word boundaries via a syllable-proportional fallback, losing acoustic precision.

mlx_whisper.transcribe() already accepts word_timestamps=True and returns segments[].words[] = {start, end, probability, word}, but the server's broker → kwarg-filter layer was silently stripping the field. The root cause: STTExecutionAdapter.run_serial() filtered kwargs through inspect.signature(stt_model.generate).parameters, and word_timestamps isn't declared in every model's generate() signature even when it is forwarded to mlx_whisper.transcribe() underneath.

Technical note — model-agnostic

The word_timestamps field works with any mlx-whisper model (fp16, q4, and other quantizations) because the kwarg is handled by mlx_whisper.transcribe itself, not by a model-specific code path. The PR description for this field explicitly does not tie it to a specific quantization.

Testing

Verified against mlx-community/whisper-large-v3-turbo (fp16) and whisper-large-v3-turbo-q4 (quantized) — both return populated words[] arrays when the new field is set.

Unit and integration tests added to mlx_audio/tests/test_server.py:

test_transcription_request_word_timestamps_defaults — model defaults
test_transcription_request_word_timestamps_accepted — field acceptance
test_stt_word_timestamps_passed_to_generate — kwarg reaches generate()
test_stt_word_timestamps_verbose_json_words_passthrough — words[] in verbose_json response

Backward compatibility

Both new fields are optional with False/None defaults. All existing callers are unaffected.

Add `word_timestamps: bool = Form(False)` and `timestamp_granularities: Optional[str] = Form(None)` form fields to the `/v1/audio/transcriptions` route handler and the `TranscriptionRequest` Pydantic model that backs it. `mlx_whisper.transcribe()` already accepts `word_timestamps` and returns `segments[].words[] = {start, end, probability, word}` when it is set, but the server's `STTExecutionAdapter` kwarg-filter step was silently dropping it because it used a strict signature-parameter allowlist. The fix adds `_STT_EXTRA_KWARGS = {"word_timestamps", "timestamp_granularities"}` — a small explicit allowlist that bypasses the signature check for these two fields — so they always reach the underlying model regardless of how the model's `generate()` signature is declared. The `word_timestamps` field works with any mlx-whisper model (fp16, quantized q4, etc.) because the kwarg is handled by `mlx_whisper.transcribe` itself, not by a model-specific code path. No change to the response-shaping layer is needed: `verbose_json` already returns the full model payload unchanged, so `segments[].words[]` is included automatically once the kwarg reaches the model. New form fields are optional and default to `False`/`None`, so all existing callers are unaffected (no breaking change). Tests added to `mlx_audio/tests/test_server.py`: - Unit test: `TranscriptionRequest` defaults + field acceptance - Integration test: `word_timestamps=true` reaches `generate()` kwargs - Integration test: `verbose_json` response includes `words[]` from model

lucasnewman · 2026-05-13T15:33:55Z

@ciekawy Can you run the formatter with pre-commit run --all? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(stt): expose word_timestamps form field on /v1/audio/transcriptions#716

feat(stt): expose word_timestamps form field on /v1/audio/transcriptions#716
ciekawy wants to merge 1 commit into
Blaizzy:mainfrom
ciekawy:feature/word-timestamps

ciekawy commented May 11, 2026

Uh oh!

lucasnewman commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ciekawy commented May 11, 2026

Summary

Motivation

Technical note — model-agnostic

Testing

Backward compatibility

Uh oh!

lucasnewman commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants