Prerequisites
Feature Description
Now that we have support for Gemma 4 STT (#21421), we should consider implementing OpenAI's explicit speech-to-text API. Documentation is here https://platform.openai.com/docs/guides/speech-to-text
Example of v1/audio/transcriptions
from openai import OpenAI
client = OpenAI()
audio_file= open("/path/to/file/audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcription.text)
The llama.cpp web interface already supports uploading audio for transcription but it's not the same API.
Supporting streaming transcriptions would be ideal, but non-streaming would be a great start.
Motivation
Implementing the /v1/audio/transcriptions API will allow a wide variety of existing clients to use llama.cpp for STT.
This enhancement would particularly benefit Home Assistant users. Currently, HA users need to run a separate STT model (such as whisper.cpp or parakeet) and LLM (such as Gemma), which requires more resources and complexity. With this enhancement, HA users could just run Gemma 4 E2B/E4B and use it for both STT and LLM purposes.
Possible Implementation
No response
Prerequisites
Feature Description
Now that we have support for Gemma 4 STT (#21421), we should consider implementing OpenAI's explicit speech-to-text API. Documentation is here https://platform.openai.com/docs/guides/speech-to-text
Example of
v1/audio/transcriptionsThe llama.cpp web interface already supports uploading audio for transcription but it's not the same API.
Supporting streaming transcriptions would be ideal, but non-streaming would be a great start.
Motivation
Implementing the
/v1/audio/transcriptionsAPI will allow a wide variety of existing clients to use llama.cpp for STT.This enhancement would particularly benefit Home Assistant users. Currently, HA users need to run a separate STT model (such as whisper.cpp or parakeet) and LLM (such as Gemma), which requires more resources and complexity. With this enhancement, HA users could just run Gemma 4 E2B/E4B and use it for both STT and LLM purposes.
Possible Implementation
No response