You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qwen3 Realtime LLM integration for Vision Agents framework with native audio output and built-in speech recognition using WebSocket-based realtime communication.
4
+
5
+
## Features
6
+
7
+
-**Native audio output**: No TTS service needed - audio comes directly from the model
8
+
-**Built-in STT**: Integrated speech-to-text using `gummy-realtime-v1` - no external STT service required
9
+
-**Server-side VAD**: Automatic turn detection with configurable silence thresholds
10
+
-**Video understanding**: Optional video frame support for multimodal interactions
11
+
-**Real-time streaming**: WebSocket-based bidirectional communication for low-latency responses
12
+
-**Interruption handling**: Automatic cancellation when user starts speaking
13
+
14
+
## Installation
15
+
16
+
```bash
17
+
uv add vision-agents[qwen]
18
+
```
19
+
20
+
## Usage
21
+
22
+
```python
23
+
from vision_agents.core import User, Agent
24
+
from vision_agents.plugins import getstream, qwen
25
+
26
+
agent = Agent(
27
+
edge=getstream.Edge(),
28
+
agent_user=User(name="Qwen Assistant"),
29
+
instructions="Be helpful and friendly",
30
+
llm=qwen.Realtime(
31
+
model="qwen3-omni-flash-realtime",
32
+
voice="Cherry",
33
+
fps=1,
34
+
),
35
+
# No STT or TTS needed - Qwen Realtime provides both
0 commit comments