Skip to content

Commit f3acd80

Browse files
committed
Update README with replacements, snippets, wake, and sidebar docs
Made-with: Cursor
1 parent d841430 commit f3acd80

1 file changed

Lines changed: 95 additions & 13 deletions

File tree

voxtral_realtime/macos/README.md

Lines changed: 95 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,14 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a
88

99
- **Live transcription** — real-time token streaming with audio waveform visualization
1010
- **System-wide dictation** — press `Ctrl+Space` in any app to transcribe speech and auto-paste the result
11+
- **"Hey torch" voice wake** — hands-free dictation via Silero VAD speech detection and wake phrase matching
12+
- **Text replacements** — auto-correct names, acronyms, and domain terms after transcription
13+
- **Snippets** — say a trigger phrase (e.g., "email signature") to paste pre-written templates
1114
- **Model preloading** — load the model once, transcribe instantly across sessions
1215
- **Pause / resume** — pause and resume within the same session without losing context
13-
- **Session history** — searchable history with rename, copy, and delete
14-
- **Silence detection** — dictation auto-stops after 2 seconds of silence
16+
- **Session history** — searchable history with pinning, recency grouping, rename, copy, and multi-format export (.txt, .json, .srt)
17+
- **Silence detection** — dictation auto-stops after configurable silence timeout
18+
- **Sidebar navigation** — Home, Replacements, Snippets, Wake, and Settings pages
1519
- **Self-contained DMG** — runner binary, model weights, and runtime libraries all bundled
1620

1721
## Download
@@ -40,9 +44,39 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a
4044
2. Focus any text field in any app (Notes, Slack, browser, etc.)
4145
3. Press **`Ctrl+Space`** — a floating overlay appears with a waveform
4246
4. Speak — live transcribed text appears in the overlay
43-
5. Press **`Ctrl+Space`** again to stop, or wait for 2 seconds of silence
47+
5. Press **`Ctrl+Space`** again to stop, or wait for silence auto-stop
4448
6. The transcribed text is automatically pasted into the focused text field
4549

50+
### Voice wake ("Hey torch")
51+
52+
1. Open the **Wake** page in the sidebar and enable it (or press `Ctrl+Shift+W`)
53+
2. Say **"Hey torch"** — Silero VAD detects your speech, Voxtral checks for the wake keyword
54+
3. If matched, the dictation panel appears and you can start speaking
55+
4. Dictation auto-pastes when you stop speaking, then wake listening resumes
56+
57+
The wake keyword is configurable in the Wake settings (default: "torch"). The app accumulates the full speech segment before checking, so you can say "hey torch" naturally as one phrase.
58+
59+
### Replacements
60+
61+
Add text replacements in the **Replacements** page. Each entry has a trigger and replacement — when the trigger appears in transcribed text, it's automatically replaced.
62+
63+
Examples:
64+
- `mtia``MTIA`
65+
- `executorch``ExecuTorch`
66+
- `execute torch``ExecuTorch`
67+
68+
Supports case-preserving matching and word boundary options.
69+
70+
### Snippets
71+
72+
Add voice-triggered templates in the **Snippets** page. When your entire dictation matches a snippet trigger, the template content is pasted instead.
73+
74+
Example: say **"email signature"** to paste:
75+
```
76+
Best,
77+
Younghan
78+
```
79+
4680
### Keyboard shortcuts
4781

4882
| Shortcut | Action |
@@ -53,7 +87,7 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a
5387
| `Cmd+Shift+C` | Copy transcript |
5488
| `Cmd+Shift+U` | Unload model |
5589
| `Ctrl+Space` | Toggle system-wide dictation |
56-
| `Cmd+,` | Settings |
90+
| `Ctrl+Shift+W` | Toggle voice wake on/off |
5791

5892
---
5993

@@ -129,7 +163,19 @@ The runner binary will be at:
129163
${EXECUTORCH_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner
130164
```
131165

132-
#### 4. Install Python packages
166+
#### 4. Build the Silero VAD stream runner (for voice wake)
167+
168+
```bash
169+
cd ${EXECUTORCH_PATH}
170+
make silero-vad-cpu
171+
```
172+
173+
This builds `silero_vad_stream_runner` at:
174+
```
175+
${EXECUTORCH_PATH}/cmake-out/examples/models/silero_vad/silero_vad_stream_runner
176+
```
177+
178+
#### 5. Install Python packages
133179

134180
```bash
135181
pip install huggingface_hub sounddevice
@@ -138,7 +184,7 @@ pip install huggingface_hub sounddevice
138184
- `huggingface_hub` — to download model artifacts from HuggingFace
139185
- `sounddevice` — for the CLI mic streaming test script
140186

141-
#### 5. Download model artifacts
187+
#### 6. Download model artifacts
142188

143189
```bash
144190
export LOCAL_FOLDER="$HOME/voxtral_realtime_quant_metal"
@@ -152,7 +198,13 @@ This downloads three files (~6.2 GB total):
152198

153199
HuggingFace repo: [`mistralai/Voxtral-Mini-4B-Realtime-2602-ExecuTorch`](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-ExecuTorch)
154200

155-
#### 6. Test with CLI (optional)
201+
Download the Silero VAD model:
202+
203+
```bash
204+
hf download younghan-meta/Silero-VAD-ExecuTorch-XNNPACK --local-dir ~/silero_vad_xnnpack
205+
```
206+
207+
#### 7. Test with CLI (optional)
156208

157209
Verify the runner works before building the app:
158210

@@ -170,7 +222,7 @@ cd ${LOCAL_FOLDER} && chmod +x stream_audio.py
170222
--mic
171223
```
172224

173-
#### 7. Build the app and create DMG
225+
#### 8. Build the app and create DMG
174226

175227
```bash
176228
cd voxtral_realtime/macos
@@ -198,6 +250,7 @@ VoxtralRealtimeApp
198250
├── TranscriptStore (@Observable, @MainActor)
199251
│ ├── SessionState: idle → loading → transcribing ⇆ paused → idle
200252
│ ├── ModelState: unloaded → loading → ready
253+
│ ├── TextPipeline: replacements → snippets → style (no-op)
201254
│ └── RunnerBridge (actor)
202255
│ ├── Process (voxtral_realtime_runner)
203256
│ │ ├── stdin ← raw 16kHz mono f32le PCM
@@ -207,9 +260,21 @@ VoxtralRealtimeApp
207260
│ └── AVAudioEngine → format conversion → pipe
208261
├── DictationManager (@Observable, @MainActor)
209262
│ ├── Global hotkey (Carbon RegisterEventHotKey)
263+
│ ├── VadService (actor)
264+
│ │ ├── Process (silero_vad_stream_runner)
265+
│ │ │ ├── stdin ← 16kHz mono f32le PCM
266+
│ │ │ └── stdout → PROB <time> <probability>
267+
│ │ └── AVAudioEngine → speech segment accumulation
268+
│ ├── Wake flow: VAD speech-end → Voxtral phrase check → active
210269
│ ├── DictationPanel (NSPanel, non-activating, floating)
211270
│ └── Paste via CGEvent (Cmd+V to frontmost app)
212-
└── Views (SwiftUI)
271+
├── ReplacementStore / SnippetStore (JSON, Application Support)
272+
└── Views (SwiftUI, sidebar navigation)
273+
├── Home (welcome, transcription, session detail)
274+
├── Replacements (add/edit/delete trigger→replacement)
275+
├── Snippets (add/edit/delete voice-triggered templates)
276+
├── Wake (VAD toggle, keyword, detection tuning, status)
277+
└── Settings (runner paths, model files, silence, shortcuts)
213278
```
214279

215280
## Troubleshooting
@@ -223,11 +288,13 @@ The app needs **Accessibility** permission to simulate `Cmd+V` in other apps.
223288
3. If already listed, toggle it **off and back on**
224289
4. **Quit and relaunch** the app — macOS caches the trust state at process launch
225290

226-
When running Debug builds from Xcode, each rebuild produces a new binary signature. macOS tracks Accessibility trust per binary identity, so you may need to re-grant permission after rebuilding. To avoid this:
227-
- Remove the old entry from Accessibility settings before re-adding
228-
- Or run the Release build for testing dictation
291+
After every fresh build, reset Accessibility permissions:
292+
293+
```bash
294+
tccutil reset Accessibility org.pytorch.executorch.VoxtralRealtime
295+
```
229296

230-
Even if Accessibility isn't granted, the transcribed text is always copied to the clipboard — you can paste manually with `Cmd+V`.
297+
Then relaunch and re-grant when prompted.
231298

232299
### Model fails to load / runner crashes
233300

@@ -249,6 +316,21 @@ The app requests microphone access on first use. If denied:
249316
2. Enable `Voxtral Realtime`
250317
3. **Quit and relaunch** the app — macOS caches permission grants per process lifetime
251318

319+
### Microphone producing silence
320+
321+
If the Wake status shows "Disabled" with a silence error, macOS is delivering zero-audio to the app. This can happen after fresh builds or permission changes.
322+
323+
1. Toggle the app's microphone permission **off and on** in System Settings
324+
2. Quit and relaunch the app
325+
3. If running from Cursor's terminal, try the native macOS Terminal instead
326+
327+
### Voice wake not triggering
328+
329+
- Check the **Wake** page — status should show "Listening for speech..."
330+
- Ensure `silero_vad_stream_runner` and `silero_vad.pte` paths are correct
331+
- Try increasing the **Check window** slider (default 4s) — Voxtral needs 1-2s to produce the first token
332+
- Speak clearly and wait for a brief pause after "hey torch" so VAD detects the speech segment end
333+
252334
### Permission prompts don't appear (stale TCC entries)
253335

254336
If you've built or installed the app multiple times (Debug builds, Release builds, DMG installs), macOS may have accumulated multiple permission entries for the same bundle ID. Reset them to get a clean slate:

0 commit comments

Comments
 (0)