|
1 | | -# Moonshine STT Transcription Example |
| 1 | +# Stream × Moonshine + Silero — Live Transcription |
2 | 2 |
|
3 | | -This example demonstrates real-time call transcription using the Moonshine Speech-to-Text plugin with GetStream Video SDK. |
| 3 | +This example spins up a bot that joins a Stream Video call, detects speech |
| 4 | +with **Silero VAD**, transcribes it with the **Moonshine** model, and prints |
| 5 | +final transcripts to the terminal. |
4 | 6 |
|
5 | | -## Features |
6 | | - |
7 | | -- **Real-time Transcription**: Process audio from video calls using Moonshine STT |
8 | | -- **Voice Activity Detection**: Integrated Silero VAD to filter speech from silence |
9 | | -- **Efficient Processing**: Only transcribe actual speech, reducing computational overhead |
10 | | -- **Performance Monitoring**: Track transcription speed, accuracy, and resource usage |
11 | | -- **Model Selection**: Choose between `moonshine/tiny` (fast) and `moonshine/base` (accurate) |
12 | | -- **Configurable Processing**: Adjust VAD sensitivity and STT parameters |
13 | | - |
14 | | -## Prerequisites |
15 | | - |
16 | | -1. **GetStream Account**: Get your API key from [GetStream Dashboard](https://dashboard.getstream.io/) |
17 | | -2. **Moonshine Library**: Install the Moonshine STT library |
18 | | -3. **Python 3.9+**: Required for the GetStream SDK |
19 | | - |
20 | | -## Installation |
21 | | - |
22 | | -1. **Install Moonshine STT Library**: |
23 | | - ```bash |
24 | | - pip install useful-moonshine@git+https://github.com/usefulsensors/moonshine.git |
25 | | - ``` |
26 | | - |
27 | | -2. **Install Example Dependencies**: |
28 | | - ```bash |
29 | | - # From the example directory |
30 | | - uv sync |
31 | | - ``` |
32 | | - |
33 | | -3. **Configure Environment**: |
34 | | - ```bash |
35 | | - cp env.example .env |
36 | | - # Edit .env with your GetStream API key |
37 | | - ``` |
38 | | - |
39 | | -## Configuration |
40 | | - |
41 | | -Edit the `.env` file with your settings: |
42 | | - |
43 | | -```env |
44 | | -# Required: Your GetStream API key |
45 | | -STREAM_API_KEY=your_stream_api_key_here |
46 | | -
|
47 | | -# Optional: Moonshine model selection (default: moonshine/base) |
48 | | -MOONSHINE_MODEL=moonshine/base # or moonshine/tiny |
49 | | -
|
50 | | -# Demo Configuration |
51 | | -DEMO_DURATION_SECONDS=60 # Demo runtime (seconds) |
| 7 | +Pipeline: |
| 8 | +``` |
| 9 | + WebRTC audio ▶︎ Silero VAD ▶︎ Moonshine STT ▶︎ print transcript |
52 | 10 | ``` |
53 | 11 |
|
54 | | -### Model Comparison |
55 | | - |
56 | | -| Model | Size | Speed | Accuracy | Use Case | |
57 | | -|-------|------|-------|----------|----------| |
58 | | -| `moonshine/tiny` | ~190MB | 5-10x real-time | Good | Real-time applications, resource-constrained | |
59 | | -| `moonshine/base` | ~400MB | 3-5x real-time | Better | **Default** - Higher accuracy requirements | |
60 | | - |
61 | | -## Usage |
62 | | - |
63 | | -### Basic Demo |
| 12 | +--- |
64 | 13 |
|
65 | | -Run the transcription demo: |
| 14 | +## Quick start |
66 | 15 |
|
67 | 16 | ```bash |
68 | | -python main.py |
69 | | -``` |
| 17 | +cd examples/stt_moonshine_transcription |
70 | 18 |
|
71 | | -This will: |
72 | | -1. Initialize Moonshine STT with your chosen model |
73 | | -2. Set up Voice Activity Detection (if enabled) |
74 | | -3. Display configuration and wait for audio input |
75 | | -4. Show periodic statistics during runtime |
| 19 | +# create & activate env (fast, no pip) |
| 20 | +uv venv .venv && source .venv/bin/activate |
76 | 21 |
|
77 | | -### Integration with Video Calls |
| 22 | +# install everything declared in this folder's pyproject.toml |
| 23 | +uv sync |
78 | 24 |
|
79 | | -To integrate with actual video calls, modify the `run_transcription_demo` method: |
80 | | - |
81 | | -```python |
82 | | -# Example integration (pseudo-code) |
83 | | -async def process_call_audio(call_id: str): |
84 | | - # Initialize components |
85 | | - stt = Moonshine() |
86 | | - vad = Silero(sample_rate=16000, speech_pad_ms=300, min_speech_ms=250) |
87 | | - |
88 | | - # Set up VAD -> STT pipeline |
89 | | - @vad.on("audio") |
90 | | - async def on_speech_detected(pcm_data, user): |
91 | | - await stt.process_audio(pcm_data, user) |
92 | | - |
93 | | - # Join the call |
94 | | - call = client.video.call("default", call_id) |
95 | | - async with await rtc.join(call, "bot-user") as connection: |
96 | | - @connection.on("audio") |
97 | | - async def on_audio(pcm_data, user): |
98 | | - # Process all audio through VAD first |
99 | | - await vad.process_audio(pcm_data, user) |
| 25 | +# copy credentials and run |
| 26 | +cp env.example .env # fill STREAM_* keys |
| 27 | +python main.py # or: uv run python main.py |
100 | 28 | ``` |
101 | 29 |
|
102 | | -## Performance Characteristics |
103 | | - |
104 | | -### Expected Performance (on modern hardware) |
105 | | - |
106 | | -- **Real-time Factor**: 0.1-0.3x (processes 3-10x faster than real-time) |
107 | | -- **Latency**: 100-300ms for 1-second audio chunks |
108 | | -- **Memory Usage**: 200-400MB depending on model |
109 | | -- **CPU Usage**: 10-30% on modern CPUs |
110 | | - |
111 | | -### Optimization Tips |
112 | | - |
113 | | -1. **Model Selection**: |
114 | | - - Use `moonshine/base` for best balance of accuracy and performance (**default**) |
115 | | - - Use `moonshine/tiny` for maximum speed on resource-constrained devices |
116 | | - |
117 | | -2. **Chunk Duration**: |
118 | | - - Smaller chunks (500-1000ms): Lower latency, more processing overhead |
119 | | - - Larger chunks (1000-2000ms): Higher latency, better efficiency |
120 | | - |
121 | | -3. **VAD Integration**: |
122 | | - - Silero VAD automatically filters out silence |
123 | | - - Only processes actual speech, reducing computational overhead |
124 | | - - Configured with optimal settings: 300ms padding, 250ms minimum speech, 0.3/0.2 activation/deactivation thresholds |
| 30 | +You'll see something like: |
125 | 31 |
|
126 | | -## Troubleshooting |
127 | | - |
128 | | -### Common Issues |
129 | | - |
130 | | -1. **Moonshine Import Error**: |
131 | | - ``` |
132 | | - ImportError: No module named 'moonshine' |
133 | | - ``` |
134 | | - **Solution**: Install Moonshine library: |
135 | | - ```bash |
136 | | - pip install useful-moonshine@git+https://github.com/usefulsensors/moonshine.git |
137 | | - ``` |
138 | | - |
139 | | -2. **CUDA/GPU Issues**: |
140 | | - ``` |
141 | | - RuntimeError: CUDA out of memory |
142 | | - ``` |
143 | | - **Solution**: Force CPU usage: |
144 | | - ```python |
145 | | - stt = Moonshine(device="cpu") |
146 | | - ``` |
| 32 | +```text |
| 33 | +🌙 Stream + Moonshine Real-time Transcription Example |
| 34 | +📞 Call ID: 4b12… |
| 35 | +✅ Bot joined call: 4b12… |
| 36 | +🎧 Listening for audio… (Press Ctrl+C to stop) |
| 37 | +🎤 Speech detected from user: My User, duration: 1.12s |
| 38 | +[14:03:27] My User: hello moonshine |
| 39 | +``` |
147 | 40 |
|
148 | | -3. **No Transcriptions**: |
149 | | - - Check audio input levels |
150 | | - - Verify VAD settings (try disabling VAD) |
151 | | - - Ensure minimum audio length requirements are met |
| 41 | +--- |
152 | 42 |
|
153 | | -4. **Poor Performance**: |
154 | | - - Try the `moonshine/tiny` model |
155 | | - - Increase chunk duration |
156 | | - - Check system resources (CPU/memory) |
| 43 | +## How it works (short version) |
157 | 44 |
|
158 | | -### Debug Mode |
| 45 | +`main.py` does the following: |
159 | 46 |
|
160 | | -Enable debug logging for detailed information: |
| 47 | +1. Creates two temporary users (`create_user`) – **human** and **moonshine-bot**. |
| 48 | +2. Generates a random `call_id`, creates the call, and opens a join URL in your browser. |
| 49 | +3. Initialises: |
| 50 | + * `Silero()` – voice-activity detector (48 kHz by default). |
| 51 | + * `Moonshine()` – STT model (base or tiny, picked in the plugin). |
| 52 | +4. Joins the call with `rtc.join()`, then: |
161 | 53 |
|
162 | 54 | ```python |
163 | | -import logging |
164 | | -logging.basicConfig(level=logging.DEBUG) |
165 | | -``` |
166 | | - |
167 | | -## Example Output |
| 55 | +@connection.on("audio") |
| 56 | +async def on_pcm(pcm, user): |
| 57 | + await vad.process_audio(pcm, user) # silence filtered here |
168 | 58 |
|
| 59 | +@vad.on("audio") |
| 60 | +async def on_speech(pcm, user): |
| 61 | + await stt.process_audio(pcm, user) |
169 | 62 | ``` |
170 | | -🌙 Stream + Moonshine Real-time Transcription Example |
171 | | -=================================================== |
172 | | -📞 Call ID: 12345678-1234-1234-1234-123456789abc |
173 | | -🔑 Created token for browser user: browser-user |
174 | | -🤖 Created token for bot user: transcription-bot |
175 | | -📞 Call created: 12345678-1234-1234-1234-123456789abc |
176 | | -Opening browser to: https://pronto.getstream.io/bare/join/... |
177 | | -
|
178 | | -🤖 Starting transcription bot... |
179 | | -The bot will join the call and transcribe speech using VAD + Moonshine STT. |
180 | | -VAD will filter out silence and only process actual speech. |
181 | | -Join the call in your browser and speak to see transcriptions appear here! |
182 | | -
|
183 | | -🌙 Initializing Moonshine STT... |
184 | | -🔊 Initializing Silero VAD... |
185 | | -✅ Audio processing pipeline ready: VAD → Moonshine STT |
186 | | -✅ Bot joined call: 12345678-1234-1234-1234-123456789abc |
187 | | -🎧 Listening for audio... (Press Ctrl+C to stop) |
188 | | -
|
189 | | -🎤 Speech detected from user: browser-user, duration: 2.34s |
190 | | -[14:30:25] browser-user: Hello, this is a test of the Moonshine transcription system. |
191 | | - └─ model: moonshine/base, device: cpu, RTF: 0.10x |
192 | | -
|
193 | | -🧹 Cleanup completed |
194 | | -``` |
195 | | - |
196 | | -## Next Steps |
197 | 63 |
|
198 | | -1. **Integrate with Real Calls**: Modify the example to process actual call audio |
199 | | -2. **Add Persistence**: Store transcriptions in a database |
200 | | -3. **Implement Webhooks**: Send transcriptions to external services |
201 | | -4. **Add Language Support**: Extend for multiple languages (when supported by Moonshine) |
202 | | -5. **Custom Models**: Train custom Moonshine models for specific domains |
| 64 | +5. `Moonshine` emits a final `transcript` event which is printed with a timestamp. |
| 65 | +6. On **Ctrl-C** the script closes the STT client, VAD and deletes the temporary users. |
203 | 66 |
|
204 | | -## Resources |
| 67 | +--- |
205 | 68 |
|
206 | | -- [Moonshine GitHub Repository](https://github.com/usefulsensors/moonshine) |
207 | | -- [GetStream Video SDK Documentation](https://getstream.io/video/docs/) |
208 | | -- [GetStream Python SDK](https://github.com/GetStream/stream-python) |
209 | | -- [Voice Activity Detection with Silero](https://github.com/snakers4/silero-vad) |
| 69 | +Need help? |
| 70 | +* Stream Video docs – <https://getstream.io/video/docs/> |
| 71 | +* Silero VAD – <https://github.com/snakers4/silero-vad> |
| 72 | +* Moonshine model – <https://github.com/usefulsensors/moonshine> |
0 commit comments