Skip to content

Commit d30fba1

Browse files
committed
Refactor speech stack into built-in Kokoro TTS and Whisper STT plugins
Split the legacy core speech stack into two built-in, independently toggleable plugins: `_kokoro_tts` for TTS and `_whisper_stt` for STT. This refactor keeps dependency installation and bootstrap concerns in Docker/bootstrap/preload, while moving speech-specific tooling, APIs, prompts, UI, and runtime behavior into the plugins. Core now exposes engine-agnostic `tts-service` and `stt-service` brokers, with browser-native TTS preserved as the fallback when Kokoro is disabled. Included in this change: - add built-in `_kokoro_tts` plugin with plugin-owned synth API, config, status UI, and provider registration - add built-in `_whisper_stt` plugin with plugin-owned transcribe API, mic runtime, device UI, prompt injection, and provider registration - remove legacy core speech APIs/helpers/settings/UI and delete unused `webui/js/speech_browser.js` - replace the old hardcoded speech settings section with a generic voice surface backed by plugin extensions - update preload/docs/tests to match the new plugin-owned speech architecture Behavioral intent: - both plugins are built-in but not `always_enabled` - users can now hot-switch TTS and STT independently - browser TTS remains available when `_kokoro_tts` is off - Whisper mic UI only appears when `_whisper_stt` is enabled
1 parent 16deb6a commit d30fba1

59 files changed

Lines changed: 3084 additions & 2086 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,9 @@ A detailed setup guide for Windows, macOS, and Linux can be found in the Agent Z
144144
- The Web UI output is very clean, fluid, colorful, readable, and interactive; nothing is hidden.
145145
- You can load or save chats directly within the Web UI.
146146
- The same output you see in the terminal is automatically saved to an HTML file in **logs/** folder for every session.
147+
- Voice is provided by the built-in `_kokoro_tts` and `_whisper_stt` plugins.
148+
- Docker/bootstrap remains responsible for installing Kokoro, Whisper, `ffmpeg`, and related speech dependencies.
149+
- If `_kokoro_tts` is disabled, spoken output falls back to the browser's native speech synthesis.
147150

148151
![Time example](/docs/res/time_example.jpg)
149152

api/synthesize.py

Lines changed: 0 additions & 96 deletions
This file was deleted.

api/transcribe.py

Lines changed: 0 additions & 18 deletions
This file was deleted.

docs/guides/usage.md

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -748,20 +748,27 @@ If you encounter issues with the tunnel feature:
748748
> Combine tunneling with authentication for secure remote access to your Agent Zero instance from any device, including mobile phones and tablets.
749749
750750
## Voice Interface
751-
Agent Zero provides both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities for natural voice interaction:
751+
Agent Zero provides both Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities for natural voice interaction through built-in plugins:
752+
753+
- `_kokoro_tts` handles server-side Kokoro speech synthesis when enabled
754+
- `_whisper_stt` handles server-side Whisper transcription and injects the microphone UI when enabled
755+
- Browser-native `speechSynthesis` remains the fallback output path when `_kokoro_tts` is disabled
756+
757+
Use the Agent Plugins section in Settings to enable or disable either plugin independently.
752758

753759
### Text-to-Speech
754760
Enable voice responses from agents:
755761
* Toggle the "Speech" switch in the Preferences section of the sidebar
756-
* Agents will use your system's built-in voice synthesizer to speak their messages
762+
* If `_kokoro_tts` is enabled, agents will use Kokoro for spoken output
763+
* If `_kokoro_tts` is disabled, agents will use your browser's built-in voice synthesizer
757764
* Click the "Stop Speech" button above the input area to immediately stop any ongoing speech
758765
* You can also click the speech button when hovering over messages to speak individual messages or their parts
759766

760767
![TTS Stop Speech](../res/usage/ui-tts-stop-speech1.png)
761768

762769
- The interface allows users to stop speech at any time if a response is too lengthy or if they wish to intervene during the conversation.
763770

764-
The TTS uses a standard voice interface provided by modern browsers, which may sound robotic but is effective and does not require complex AI models. This ensures low latency and quick responses across various platforms, including mobile devices.
771+
Kokoro gives you a local container-side TTS path when the plugin is enabled. When it is disabled, Agent Zero falls back to the browser voice stack, which is lower-friction and works well across devices.
765772

766773

767774
> [!TIP]
@@ -771,19 +778,20 @@ The TTS uses a standard voice interface provided by modern browsers, which may s
771778
> - Creating a more interactive experience
772779
773780
### Speech-to-Text
774-
Send voice messages to agents using OpenAI's Whisper model (does not require OpenAI API key!):
781+
Send voice messages to agents using Whisper (does not require an OpenAI API key):
775782

776783
1. Click the microphone button in the input area to start recording
784+
- The microphone button only appears when `_whisper_stt` is enabled
777785
2. The button color indicates the current status:
778786
- Grey: Inactive
779-
- Red: Listening
780-
- Green: Recording
781-
- Teal: Waiting
782-
- Cyan (pulsing): Processing
787+
- Teal: Listening
788+
- Red: Recording
789+
- Amber: Waiting
790+
- Purple: Processing or activating
783791

784792
Users can adjust settings such as silence threshold and message duration before sending to optimize their interaction experience.
785793

786-
Configure STT settings in the Settings page:
794+
Configure Whisper STT from the plugin settings screen in the Voice section or from Agent Plugins:
787795
* **Model Size:** Choose between Base (74M, English) or other models
788796
- Note: Only Large and Turbo models support multiple languages
789797
* **Language Code:** Set your preferred language (e.g., 'en', 'fr', 'it', 'cz')
@@ -795,9 +803,8 @@ Configure STT settings in the Settings page:
795803
![Speech to Text Settings](../res/usage/ui-settings-5-speech-to-text.png)
796804

797805
> [!IMPORTANT]
798-
> All STT and TTS functionalities operate locally within the Docker container,
799-
> ensuring that no data is transmitted to external servers or OpenAI APIs. This
800-
> enhances user privacy while maintaining functionality.
806+
> Whisper STT and Kokoro TTS operate locally within the Docker/container runtime when their plugins are enabled.
807+
> Browser fallback TTS runs locally in the browser. No voice path requires OpenAI APIs.
801808
802809
## Mathematical Expressions
803810
* **Complex Mathematics:** Supports full KaTeX syntax for:

docs/setup/installation.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -355,11 +355,13 @@ Use `claude-sonnet-4-5` for Anthropic, but use `anthropic/claude-sonnet-4-5` for
355355
> [!NOTE]
356356
> Agent Zero uses a local embedding model by default (runs on CPU), but you can switch to OpenAI embeddings like `text-embedding-3-small` or `text-embedding-3-large` if preferred.
357357
358-
### Speech to Text Options
358+
### Built-in Voice Plugins
359359

360-
- **Model Size:** Choose the speech recognition model size
361-
- **Language Code:** Set the primary language for voice recognition
362-
- **Silence Settings:** Configure silence threshold, duration, and timeout parameters for voice input
360+
- Agent Zero ships Whisper STT as the built-in `_whisper_stt` plugin and Kokoro TTS as the built-in `_kokoro_tts` plugin.
361+
- Docker/bootstrap remains responsible for installing the required speech dependencies such as `ffmpeg`, Kokoro, Whisper, and `soundfile`.
362+
- Both plugins can be enabled or disabled independently from the Agent Plugins section in the Web UI.
363+
- Whisper model size, language, and silence behavior are configured from the plugin settings screen.
364+
- If `_kokoro_tts` is disabled, spoken output falls back to the browser's native speech synthesis instead of the container runtime.
363365

364366
### API Keys
365367

helpers/kokoro_tts.py

Lines changed: 0 additions & 127 deletions
This file was deleted.

0 commit comments

Comments
 (0)