Skip to content

feat(stt): expose word_timestamps form field on /v1/audio/transcriptions#716

Open
ciekawy wants to merge 1 commit into
Blaizzy:mainfrom
ciekawy:feature/word-timestamps
Open

feat(stt): expose word_timestamps form field on /v1/audio/transcriptions#716
ciekawy wants to merge 1 commit into
Blaizzy:mainfrom
ciekawy:feature/word-timestamps

Conversation

@ciekawy
Copy link
Copy Markdown

@ciekawy ciekawy commented May 11, 2026

Summary

  • Add word_timestamps: bool = Form(False) and timestamp_granularities: Optional[str] = Form(None) form fields to POST /v1/audio/transcriptions, mirroring the OpenAI Audio API surface
  • Update TranscriptionRequest Pydantic model to carry these fields
  • Add _STT_EXTRA_KWARGS = {"word_timestamps", "timestamp_granularities"} allowlist in STTExecutionAdapter.run_serial() so they are not silently dropped by the signature-parameter kwarg filter and reach mlx_whisper.transcribe()

Motivation

Karaoke-style language-learning apps (and any per-word synchronisation use case) need word-level timestamps. Without this fix, every consumer must either (a) drop server mode and call mlx_whisper.transcribe in-process, losing broker-queue batching, or (b) post-process segment text → word boundaries via a syllable-proportional fallback, losing acoustic precision.

mlx_whisper.transcribe() already accepts word_timestamps=True and returns segments[].words[] = {start, end, probability, word}, but the server's broker → kwarg-filter layer was silently stripping the field. The root cause: STTExecutionAdapter.run_serial() filtered kwargs through inspect.signature(stt_model.generate).parameters, and word_timestamps isn't declared in every model's generate() signature even when it is forwarded to mlx_whisper.transcribe() underneath.

Technical note — model-agnostic

The word_timestamps field works with any mlx-whisper model (fp16, q4, and other quantizations) because the kwarg is handled by mlx_whisper.transcribe itself, not by a model-specific code path. The PR description for this field explicitly does not tie it to a specific quantization.

Testing

Verified against mlx-community/whisper-large-v3-turbo (fp16) and whisper-large-v3-turbo-q4 (quantized) — both return populated words[] arrays when the new field is set.

Unit and integration tests added to mlx_audio/tests/test_server.py:

  • test_transcription_request_word_timestamps_defaults — model defaults
  • test_transcription_request_word_timestamps_accepted — field acceptance
  • test_stt_word_timestamps_passed_to_generate — kwarg reaches generate()
  • test_stt_word_timestamps_verbose_json_words_passthroughwords[] in verbose_json response

Backward compatibility

Both new fields are optional with False/None defaults. All existing callers are unaffected.

Add `word_timestamps: bool = Form(False)` and
`timestamp_granularities: Optional[str] = Form(None)` form fields to the
`/v1/audio/transcriptions` route handler and the `TranscriptionRequest`
Pydantic model that backs it.

`mlx_whisper.transcribe()` already accepts `word_timestamps` and returns
`segments[].words[] = {start, end, probability, word}` when it is set, but
the server's `STTExecutionAdapter` kwarg-filter step was silently dropping
it because it used a strict signature-parameter allowlist. The fix adds
`_STT_EXTRA_KWARGS = {"word_timestamps", "timestamp_granularities"}` — a
small explicit allowlist that bypasses the signature check for these two
fields — so they always reach the underlying model regardless of how the
model's `generate()` signature is declared.

The `word_timestamps` field works with any mlx-whisper model (fp16,
quantized q4, etc.) because the kwarg is handled by `mlx_whisper.transcribe`
itself, not by a model-specific code path. No change to the response-shaping
layer is needed: `verbose_json` already returns the full model payload
unchanged, so `segments[].words[]` is included automatically once the kwarg
reaches the model.

New form fields are optional and default to `False`/`None`, so all existing
callers are unaffected (no breaking change).

Tests added to `mlx_audio/tests/test_server.py`:
- Unit test: `TranscriptionRequest` defaults + field acceptance
- Integration test: `word_timestamps=true` reaches `generate()` kwargs
- Integration test: `verbose_json` response includes `words[]` from model
@lucasnewman
Copy link
Copy Markdown
Collaborator

@ciekawy Can you run the formatter with pre-commit run --all? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants