Skip to content

Commit dd1987e

Browse files
Enhance voice assistant functionality with TTS sanitization and cross-channel broadcasting
- Introduced a new `sanitizeForTTS` function to clean text for natural TTS output, stripping markdown and special characters. - Updated voice assistant configuration to control cross-channel message broadcasting via the `broadcastAllChannels` flag. - Enhanced outbound message handling to ensure all text sent to the voice speaker is sanitized, improving response quality. - Added detailed instructions for creating and binding a dedicated voice agent for a conversational experience on Raspberry Pi. This update ensures a more seamless and natural interaction with the voice assistant, enhancing user experience.
1 parent 8ff0f1d commit dd1987e

4 files changed

Lines changed: 234 additions & 38 deletions

File tree

RASPBERRY-PI-SETUP.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -423,6 +423,98 @@ Telegram msg ──► OpenClaw Agent ──► Response ──► Telegram
423423

424424
The voice plugin skips echoing messages that originated from the voice channel itself, preventing feedback loops.
425425

426+
Cross-channel broadcast is controlled by the `broadcastAllChannels` config flag (default: `true`). Set it to `false` to disable:
427+
428+
```json
429+
{
430+
"channels": {
431+
"voice-assistant": {
432+
"broadcastAllChannels": false
433+
}
434+
}
435+
}
436+
```
437+
438+
All text sent to the voice speaker — whether from the voice channel itself or broadcast from other channels — is automatically sanitized for TTS. Markdown formatting (`**bold**`, `` `code` ``, headers, links, etc.) is stripped so it sounds natural when spoken aloud.
439+
440+
---
441+
442+
## Conversational Voice Agent (SOUL.md)
443+
444+
By default, the voice channel uses the same agent (and system prompt) as Telegram/WhatsApp. For a natural conversational experience on the Pi speaker, bind a dedicated voice agent with its own persona.
445+
446+
### 9a. Create the Voice Agent Workspace
447+
448+
```bash
449+
mkdir -p ~/.openclaw/workspaces/voice-agent
450+
```
451+
452+
Create `~/.openclaw/workspaces/voice-agent/SOUL.md`:
453+
454+
```markdown
455+
You are OpenClawPi — a conversational voice assistant running on a Raspberry Pi.
456+
457+
You respond through a speaker via text-to-speech. Your responses will be spoken aloud,
458+
not read on a screen.
459+
460+
Core rules:
461+
- Speak naturally — use flowing sentences as if in a real conversation
462+
- NEVER use markdown formatting (no bold, italic, headers, code blocks, bullet lists, links)
463+
- Keep responses concise: 2-4 sentences for simple questions, up to a short paragraph for complex ones
464+
- Use contractions naturally ("I'm", "you're", "that's", "it's")
465+
- For numbers, prefer spoken form for small values ("twenty-three" not "23")
466+
- For lists, use natural speech ("first... then... and finally..." not "1. 2. 3.")
467+
- Never output URLs, file paths, or code snippets — describe them verbally instead
468+
- If asked for code, describe the logic conversationally
469+
- Don't start responses with "Sure!" or "Of course!" — just answer directly
470+
- When you don't know something, say so briefly
471+
```
472+
473+
### 9b. Add Agent Binding to Config
474+
475+
Add the following to `~/.openclaw/openclaw.json`:
476+
477+
```json
478+
{
479+
"agents": {
480+
"list": [
481+
{
482+
"id": "voice-agent",
483+
"workspace": "~/.openclaw/workspaces/voice-agent"
484+
}
485+
]
486+
},
487+
"bindings": [
488+
{
489+
"agentId": "voice-agent",
490+
"match": {
491+
"channel": "voice-assistant"
492+
}
493+
}
494+
]
495+
}
496+
```
497+
498+
This routes all voice-assistant messages to the `voice-agent` (with the conversational SOUL.md), while Telegram and other channels continue using the default agent with normal rich-text responses.
499+
500+
### 9c. Restart and Test
501+
502+
```bash
503+
systemctl --user restart openclaw-gateway
504+
```
505+
506+
Say **"Hey Jarvis, tell me about the weather"** — the response should sound natural and conversational, without any markdown artifacts.
507+
508+
### How It Works
509+
510+
| Message Source | Agent Used | Response Style | TTS Sanitized? |
511+
|---------------|-----------|----------------|---------------|
512+
| Voice mic | `voice-agent` | Conversational (SOUL.md) | Yes (safety net) |
513+
| Telegram | Default agent | Normal rich text | N/A (text channel) |
514+
| Telegram → Voice broadcast | Default agent | Normal rich text | Yes (stripped for speaker) |
515+
516+
The TTS sanitizer (`tts-sanitize.ts`) acts as a safety net on **all** text reaching the speaker, regardless of which agent generated it. Even if the voice agent's SOUL.md instructions are followed perfectly, the sanitizer ensures no markdown artifacts slip through.
517+
426518
---
427519

428520
## Viewing Logs

extensions/voice-assistant/index.ts

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,10 +67,11 @@ const plugin = {
6767
// Register the channel plugin
6868
api.registerChannel({ plugin: voiceAssistantPlugin });
6969

70-
// Hook into gateway events to broadcast outbound messages to voice
71-
// The runtime.events might not be available at register time
70+
// Hook into gateway events to broadcast outbound messages to voice.
71+
// When broadcastAllChannels is enabled (default: true), agent replies
72+
// from any channel (Telegram, WhatsApp, etc.) are spoken aloud on the Pi.
7273
if (api.runtime.events?.on) {
73-
api.runtime.events.on("outbound:delivered", (event: {
74+
api.runtime.events.on("outbound:delivered", async (event: {
7475
channel: string;
7576
to: string;
7677
text?: string;
@@ -80,6 +81,17 @@ const plugin = {
8081
return;
8182
}
8283

84+
// Check broadcastAllChannels config (default: true)
85+
try {
86+
const cfg = await api.runtime.config.loadConfig();
87+
const voiceCfg = cfg.channels?.["voice-assistant"] ?? {};
88+
if (voiceCfg.broadcastAllChannels === false) {
89+
return;
90+
}
91+
} catch {
92+
// If config load fails, default to broadcasting
93+
}
94+
8395
// Broadcast to voice if text is available
8496
if (event.text) {
8597
broadcastToVoice(event.text, event.channel);

extensions/voice-assistant/src/channel.ts

Lines changed: 27 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import type {
2020
} from "./types.js";
2121
import { getVoiceAssistantRuntime } from "./runtime.js";
2222
import { createVoiceGateway, getVoiceGateway, stopVoiceGateway } from "./gateway.js";
23+
import { sanitizeForTTS } from "./tts-sanitize.js";
2324

2425
// =============================================================================
2526
// CONSTANTS
@@ -181,15 +182,16 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
181182
dispatcherOptions: {
182183
deliver: async (payload) => {
183184
if (payload.text) {
184-
log?.info?.(`Sending TTS response: "${payload.text.slice(0, 80)}..."`);
185+
const spokenText = sanitizeForTTS(payload.text);
186+
if (!spokenText) return;
187+
log?.info?.(`Sending TTS response: "${spokenText.slice(0, 80)}..."`);
185188
const speakMsg: VoiceSpeakMessage = {
186189
type: "speak",
187-
text: payload.text,
190+
text: spokenText,
188191
sourceChannel: CHANNEL_ID,
189192
priority: 1,
190193
interrupt: false,
191194
};
192-
// Send to the specific client that sent the transcription
193195
gateway.broadcast(speakMsg);
194196
}
195197
},
@@ -243,9 +245,14 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
243245
};
244246
}
245247

248+
const spokenText = sanitizeForTTS(text);
249+
if (!spokenText) {
250+
return { channel: CHANNEL_ID, ok: true, skipped: true, reason: "Empty after sanitization" };
251+
}
252+
246253
const message: VoiceSpeakMessage = {
247254
type: "speak",
248-
text,
255+
text: spokenText,
249256
sourceChannel: CHANNEL_ID,
250257
priority: 1,
251258
interrupt: false,
@@ -268,14 +275,17 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
268275
sendMedia: async ({ to, text }) => {
269276
// Voice doesn't support media, but we can speak the caption
270277
if (text) {
271-
const gateway = getVoiceGateway();
272-
if (gateway?.hasClients()) {
273-
gateway.broadcast({
274-
type: "speak",
275-
text,
276-
sourceChannel: CHANNEL_ID,
277-
priority: 1,
278-
});
278+
const spokenText = sanitizeForTTS(text);
279+
if (spokenText) {
280+
const gateway = getVoiceGateway();
281+
if (gateway?.hasClients()) {
282+
gateway.broadcast({
283+
type: "speak",
284+
text: spokenText,
285+
sourceChannel: CHANNEL_ID,
286+
priority: 1,
287+
});
288+
}
279289
}
280290
}
281291

@@ -337,27 +347,6 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
337347
},
338348
};
339349

340-
// =============================================================================
341-
// OUTBOUND BROADCAST HOOK
342-
// =============================================================================
343-
344-
/**
345-
* Hook to broadcast outbound messages from ANY channel to voice
346-
* This enables the "all channels sync" feature
347-
*/
348-
export function setupOutboundBroadcast(cfg: any): void {
349-
const voiceCfg = getVoiceConfig(cfg);
350-
351-
if (voiceCfg.broadcastAllChannels === false) {
352-
console.log("[VoiceAssistant] Outbound broadcast disabled");
353-
return;
354-
}
355-
356-
console.log("[VoiceAssistant] Outbound broadcast enabled - all channel messages will be spoken");
357-
358-
// This will be called by the plugin registration to hook into outbound events
359-
}
360-
361350
/**
362351
* Broadcast a message to voice from any channel
363352
* Called by the gateway event hook
@@ -368,11 +357,14 @@ export function broadcastToVoice(text: string, sourceChannel: string): void {
368357
return;
369358
}
370359

360+
const spokenText = sanitizeForTTS(text);
361+
if (!spokenText) return;
362+
371363
gateway.broadcast({
372364
type: "speak",
373-
text,
365+
text: spokenText,
374366
sourceChannel,
375-
priority: sourceChannel === CHANNEL_ID ? 1 : 0, // Lower priority for other channels
367+
priority: sourceChannel === CHANNEL_ID ? 1 : 0, // Lower priority for cross-channel
376368
interrupt: false,
377369
});
378370
}
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
/**
2+
* @file tts-sanitize.ts
3+
* @description Sanitizes text for natural TTS (text-to-speech) output.
4+
*
5+
* Strips markdown formatting, special characters, and artifacts that
6+
* would sound unnatural when spoken aloud through a speaker.
7+
*
8+
* Applied to ALL text before it reaches the voice client as a `speak` command,
9+
* whether the text originated from the voice channel itself or was broadcast
10+
* from another channel (Telegram, WhatsApp, etc.).
11+
*/
12+
13+
/**
14+
* Strips markdown and special characters for clean TTS output.
15+
* Returns empty string if the result is only whitespace.
16+
*/
17+
export function sanitizeForTTS(text: string): string {
18+
let result = text;
19+
20+
// --- Remove structural markdown ---
21+
22+
// Fenced code blocks (```lang\n...\n```) — remove entirely
23+
result = result.replace(/```[\s\S]*?```/g, "");
24+
25+
// Inline code → plain text
26+
result = result.replace(/`([^`]+)`/g, "$1");
27+
28+
// Images — remove entirely (alt text isn't useful spoken)
29+
result = result.replace(/!\[([^\]]*)\]\([^)]+\)/g, "");
30+
31+
// Links — keep display text, drop URL
32+
result = result.replace(/\[([^\]]+)\]\([^)]+\)/g, "$1");
33+
34+
// Bare URLs — drop entirely
35+
result = result.replace(/https?:\/\/\S+/g, "");
36+
37+
// --- Remove inline formatting ---
38+
39+
// Bold + italic (***text*** or ___text___)
40+
result = result.replace(/(\*{3}|_{3})([^*_]+)\1/g, "$2");
41+
42+
// Bold (**text** or __text__)
43+
result = result.replace(/(\*{2}|_{2})([^*_]+)\1/g, "$2");
44+
45+
// Italic (*text* or _text_) — careful not to match mid-word underscores
46+
result = result.replace(/(?<!\w)\*([^*]+)\*(?!\w)/g, "$1");
47+
result = result.replace(/(?<!\w)_([^_]+)_(?!\w)/g, "$1");
48+
49+
// Strikethrough (~~text~~)
50+
result = result.replace(/~~([^~]+)~~/g, "$1");
51+
52+
// --- Remove block-level markers ---
53+
54+
// Headers (# text)
55+
result = result.replace(/^#{1,6}\s+/gm, "");
56+
57+
// Blockquotes (> text)
58+
result = result.replace(/^>\s+/gm, "");
59+
60+
// Horizontal rules (---, ***, ___)
61+
result = result.replace(/^[-*_]{3,}\s*$/gm, "");
62+
63+
// Unordered list markers (- item, * item, + item)
64+
result = result.replace(/^[\t ]*[-*+]\s+/gm, "");
65+
66+
// Ordered list markers (1. item, 2. item)
67+
result = result.replace(/^[\t ]*\d+\.\s+/gm, "");
68+
69+
// --- Remove OpenClaw-specific directives ---
70+
71+
// TTS directives: [[tts:...]], [[/tts:...]]
72+
result = result.replace(/\[\[\/?\s*tts:[^\]]*\]\]/g, "");
73+
74+
// Reply-to tags
75+
result = result.replace(/\[\[reply_to_current\]\]/g, "");
76+
77+
// --- Clean up special characters ---
78+
79+
// HTML-like tags (angle brackets)
80+
result = result.replace(/<[^>]*>/g, "");
81+
82+
// Curly braces, pipes, backslashes
83+
result = result.replace(/[{}|\\]/g, "");
84+
85+
// --- Normalize whitespace ---
86+
87+
// Collapse 3+ newlines into double newline (natural pause)
88+
result = result.replace(/\n{3,}/g, "\n\n");
89+
90+
// Trim leading/trailing whitespace per line
91+
result = result
92+
.split("\n")
93+
.map((line) => line.trim())
94+
.join("\n");
95+
96+
// Collapse multiple spaces
97+
result = result.replace(/ {2,}/g, " ");
98+
99+
return result.trim();
100+
}

0 commit comments

Comments
 (0)