feat(stt): expose word_timestamps form field on /v1/audio/transcriptions#716
Open
ciekawy wants to merge 1 commit into
Open
feat(stt): expose word_timestamps form field on /v1/audio/transcriptions#716ciekawy wants to merge 1 commit into
ciekawy wants to merge 1 commit into
Conversation
Add `word_timestamps: bool = Form(False)` and
`timestamp_granularities: Optional[str] = Form(None)` form fields to the
`/v1/audio/transcriptions` route handler and the `TranscriptionRequest`
Pydantic model that backs it.
`mlx_whisper.transcribe()` already accepts `word_timestamps` and returns
`segments[].words[] = {start, end, probability, word}` when it is set, but
the server's `STTExecutionAdapter` kwarg-filter step was silently dropping
it because it used a strict signature-parameter allowlist. The fix adds
`_STT_EXTRA_KWARGS = {"word_timestamps", "timestamp_granularities"}` — a
small explicit allowlist that bypasses the signature check for these two
fields — so they always reach the underlying model regardless of how the
model's `generate()` signature is declared.
The `word_timestamps` field works with any mlx-whisper model (fp16,
quantized q4, etc.) because the kwarg is handled by `mlx_whisper.transcribe`
itself, not by a model-specific code path. No change to the response-shaping
layer is needed: `verbose_json` already returns the full model payload
unchanged, so `segments[].words[]` is included automatically once the kwarg
reaches the model.
New form fields are optional and default to `False`/`None`, so all existing
callers are unaffected (no breaking change).
Tests added to `mlx_audio/tests/test_server.py`:
- Unit test: `TranscriptionRequest` defaults + field acceptance
- Integration test: `word_timestamps=true` reaches `generate()` kwargs
- Integration test: `verbose_json` response includes `words[]` from model
Collaborator
|
@ciekawy Can you run the formatter with |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
word_timestamps: bool = Form(False)andtimestamp_granularities: Optional[str] = Form(None)form fields toPOST /v1/audio/transcriptions, mirroring the OpenAI Audio API surfaceTranscriptionRequestPydantic model to carry these fields_STT_EXTRA_KWARGS = {"word_timestamps", "timestamp_granularities"}allowlist inSTTExecutionAdapter.run_serial()so they are not silently dropped by the signature-parameter kwarg filter and reachmlx_whisper.transcribe()Motivation
Karaoke-style language-learning apps (and any per-word synchronisation use case) need word-level timestamps. Without this fix, every consumer must either (a) drop server mode and call
mlx_whisper.transcribein-process, losing broker-queue batching, or (b) post-process segment text → word boundaries via a syllable-proportional fallback, losing acoustic precision.mlx_whisper.transcribe()already acceptsword_timestamps=Trueand returnssegments[].words[] = {start, end, probability, word}, but the server's broker → kwarg-filter layer was silently stripping the field. The root cause:STTExecutionAdapter.run_serial()filtered kwargs throughinspect.signature(stt_model.generate).parameters, andword_timestampsisn't declared in every model'sgenerate()signature even when it is forwarded tomlx_whisper.transcribe()underneath.Technical note — model-agnostic
The
word_timestampsfield works with any mlx-whisper model (fp16, q4, and other quantizations) because the kwarg is handled bymlx_whisper.transcribeitself, not by a model-specific code path. The PR description for this field explicitly does not tie it to a specific quantization.Testing
Verified against
mlx-community/whisper-large-v3-turbo(fp16) andwhisper-large-v3-turbo-q4(quantized) — both return populatedwords[]arrays when the new field is set.Unit and integration tests added to
mlx_audio/tests/test_server.py:test_transcription_request_word_timestamps_defaults— model defaultstest_transcription_request_word_timestamps_accepted— field acceptancetest_stt_word_timestamps_passed_to_generate— kwarg reachesgenerate()test_stt_word_timestamps_verbose_json_words_passthrough—words[]inverbose_jsonresponseBackward compatibility
Both new fields are optional with
False/Nonedefaults. All existing callers are unaffected.