Enhance voice assistant functionality with TTS sanitization and cross-channel broadcasting

sanchitmonga22 · sanchitmonga22 · commit dd1987e35eaf · 2026-02-10T22:01:20.000-08:00
- Introduced a new `sanitizeForTTS` function to clean text for natural TTS output, stripping markdown and special characters.
- Updated voice assistant configuration to control cross-channel message broadcasting via the `broadcastAllChannels` flag.
- Enhanced outbound message handling to ensure all text sent to the voice speaker is sanitized, improving response quality.
- Added detailed instructions for creating and binding a dedicated voice agent for a conversational experience on Raspberry Pi.

This update ensures a more seamless and natural interaction with the voice assistant, enhancing user experience.
diff --git a/RASPBERRY-PI-SETUP.md b/RASPBERRY-PI-SETUP.md
@@ -423,6 +423,98 @@ Telegram msg ──► OpenClaw Agent ──► Response ──► Telegram
 
 The voice plugin skips echoing messages that originated from the voice channel itself, preventing feedback loops.
 
+Cross-channel broadcast is controlled by the `broadcastAllChannels` config flag (default: `true`). Set it to `false` to disable:
+
+```json
+{
+  "channels": {
+    "voice-assistant": {
+      "broadcastAllChannels": false
+    }
+  }
+}
+```
+
+All text sent to the voice speaker — whether from the voice channel itself or broadcast from other channels — is automatically sanitized for TTS. Markdown formatting (`**bold**`, `` `code` ``, headers, links, etc.) is stripped so it sounds natural when spoken aloud.
+
+---
+
+## Conversational Voice Agent (SOUL.md)
+
+By default, the voice channel uses the same agent (and system prompt) as Telegram/WhatsApp. For a natural conversational experience on the Pi speaker, bind a dedicated voice agent with its own persona.
+
+### 9a. Create the Voice Agent Workspace
+
+```bash
+mkdir -p ~/.openclaw/workspaces/voice-agent
+```
+
+Create `~/.openclaw/workspaces/voice-agent/SOUL.md`:
+
+```markdown
+You are OpenClawPi — a conversational voice assistant running on a Raspberry Pi.
+
+You respond through a speaker via text-to-speech. Your responses will be spoken aloud,
+not read on a screen.
+
+Core rules:
+- Speak naturally — use flowing sentences as if in a real conversation
+- NEVER use markdown formatting (no bold, italic, headers, code blocks, bullet lists, links)
+- Keep responses concise: 2-4 sentences for simple questions, up to a short paragraph for complex ones
+- Use contractions naturally ("I'm", "you're", "that's", "it's")
+- For numbers, prefer spoken form for small values ("twenty-three" not "23")
+- For lists, use natural speech ("first... then... and finally..." not "1. 2. 3.")
+- Never output URLs, file paths, or code snippets — describe them verbally instead
+- If asked for code, describe the logic conversationally
+- Don't start responses with "Sure!" or "Of course!" — just answer directly
+- When you don't know something, say so briefly
+```
+
+### 9b. Add Agent Binding to Config
+
+Add the following to `~/.openclaw/openclaw.json`:
+
+```json
+{
+  "agents": {
+    "list": [
+      {
+        "id": "voice-agent",
+        "workspace": "~/.openclaw/workspaces/voice-agent"
+      }
+    ]
+  },
+  "bindings": [
+    {
+      "agentId": "voice-agent",
+      "match": {
+        "channel": "voice-assistant"
+      }
+    }
+  ]
+}
+```
+
+This routes all voice-assistant messages to the `voice-agent` (with the conversational SOUL.md), while Telegram and other channels continue using the default agent with normal rich-text responses.
+
+### 9c. Restart and Test
+
+```bash
+systemctl --user restart openclaw-gateway
+```
+
+Say **"Hey Jarvis, tell me about the weather"** — the response should sound natural and conversational, without any markdown artifacts.
+
+### How It Works
+
+| Message Source | Agent Used | Response Style | TTS Sanitized? |
+|---------------|-----------|----------------|---------------|
+| Voice mic | `voice-agent` | Conversational (SOUL.md) | Yes (safety net) |
+| Telegram | Default agent | Normal rich text | N/A (text channel) |
+| Telegram → Voice broadcast | Default agent | Normal rich text | Yes (stripped for speaker) |
+
+The TTS sanitizer (`tts-sanitize.ts`) acts as a safety net on **all** text reaching the speaker, regardless of which agent generated it. Even if the voice agent's SOUL.md instructions are followed perfectly, the sanitizer ensures no markdown artifacts slip through.
+
 ---
 
 ## Viewing Logs
diff --git a/extensions/voice-assistant/index.ts b/extensions/voice-assistant/index.ts
@@ -67,10 +67,11 @@ const plugin = {
     // Register the channel plugin
     api.registerChannel({ plugin: voiceAssistantPlugin });
 
-    // Hook into gateway events to broadcast outbound messages to voice
-    // The runtime.events might not be available at register time
+    // Hook into gateway events to broadcast outbound messages to voice.
+    // When broadcastAllChannels is enabled (default: true), agent replies
+    // from any channel (Telegram, WhatsApp, etc.) are spoken aloud on the Pi.
     if (api.runtime.events?.on) {
-      api.runtime.events.on("outbound:delivered", (event: {
+      api.runtime.events.on("outbound:delivered", async (event: {
         channel: string;
         to: string;
         text?: string;
@@ -80,6 +81,17 @@ const plugin = {
           return;
         }
 
+        // Check broadcastAllChannels config (default: true)
+        try {
+          const cfg = await api.runtime.config.loadConfig();
+          const voiceCfg = cfg.channels?.["voice-assistant"] ?? {};
+          if (voiceCfg.broadcastAllChannels === false) {
+            return;
+          }
+        } catch {
+          // If config load fails, default to broadcasting
+        }
+
         // Broadcast to voice if text is available
         if (event.text) {
           broadcastToVoice(event.text, event.channel);
diff --git a/extensions/voice-assistant/src/channel.ts b/extensions/voice-assistant/src/channel.ts
@@ -20,6 +20,7 @@ import type {
 } from "./types.js";
 import { getVoiceAssistantRuntime } from "./runtime.js";
 import { createVoiceGateway, getVoiceGateway, stopVoiceGateway } from "./gateway.js";
+import { sanitizeForTTS } from "./tts-sanitize.js";
 
 // =============================================================================
 // CONSTANTS
@@ -181,15 +182,16 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
             dispatcherOptions: {
               deliver: async (payload) => {
                 if (payload.text) {
-                  log?.info?.(`Sending TTS response: "${payload.text.slice(0, 80)}..."`);
+                  const spokenText = sanitizeForTTS(payload.text);
+                  if (!spokenText) return;
+                  log?.info?.(`Sending TTS response: "${spokenText.slice(0, 80)}..."`);
                   const speakMsg: VoiceSpeakMessage = {
                     type: "speak",
-                    text: payload.text,
+                    text: spokenText,
                     sourceChannel: CHANNEL_ID,
                     priority: 1,
                     interrupt: false,
                   };
-                  // Send to the specific client that sent the transcription
                   gateway.broadcast(speakMsg);
                 }
               },
@@ -243,9 +245,14 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
         };
       }
 
+      const spokenText = sanitizeForTTS(text);
+      if (!spokenText) {
+        return { channel: CHANNEL_ID, ok: true, skipped: true, reason: "Empty after sanitization" };
+      }
+
       const message: VoiceSpeakMessage = {
         type: "speak",
-        text,
+        text: spokenText,
         sourceChannel: CHANNEL_ID,
         priority: 1,
         interrupt: false,
@@ -268,14 +275,17 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
     sendMedia: async ({ to, text }) => {
       // Voice doesn't support media, but we can speak the caption
       if (text) {
-        const gateway = getVoiceGateway();
-        if (gateway?.hasClients()) {
-          gateway.broadcast({
-            type: "speak",
-            text,
-            sourceChannel: CHANNEL_ID,
-            priority: 1,
-          });
+        const spokenText = sanitizeForTTS(text);
+        if (spokenText) {
+          const gateway = getVoiceGateway();
+          if (gateway?.hasClients()) {
+            gateway.broadcast({
+              type: "speak",
+              text: spokenText,
+              sourceChannel: CHANNEL_ID,
+              priority: 1,
+            });
+          }
         }
       }
 
@@ -337,27 +347,6 @@ export const voiceAssistantPlugin: ChannelPlugin<ResolvedVoiceAssistantAccount>
   },
 };
 
-// =============================================================================
-// OUTBOUND BROADCAST HOOK
-// =============================================================================
-
-/**
- * Hook to broadcast outbound messages from ANY channel to voice
- * This enables the "all channels sync" feature
- */
-export function setupOutboundBroadcast(cfg: any): void {
-  const voiceCfg = getVoiceConfig(cfg);
-
-  if (voiceCfg.broadcastAllChannels === false) {
-    console.log("[VoiceAssistant] Outbound broadcast disabled");
-    return;
-  }
-
-  console.log("[VoiceAssistant] Outbound broadcast enabled - all channel messages will be spoken");
-
-  // This will be called by the plugin registration to hook into outbound events
-}
-
 /**
  * Broadcast a message to voice from any channel
  * Called by the gateway event hook
@@ -368,11 +357,14 @@ export function broadcastToVoice(text: string, sourceChannel: string): void {
     return;
   }
 
+  const spokenText = sanitizeForTTS(text);
+  if (!spokenText) return;
+
   gateway.broadcast({
     type: "speak",
-    text,
+    text: spokenText,
     sourceChannel,
-    priority: sourceChannel === CHANNEL_ID ? 1 : 0, // Lower priority for other channels
+    priority: sourceChannel === CHANNEL_ID ? 1 : 0, // Lower priority for cross-channel
     interrupt: false,
   });
 }
diff --git a/extensions/voice-assistant/src/tts-sanitize.ts b/extensions/voice-assistant/src/tts-sanitize.ts
@@ -0,0 +1,100 @@
+/**
+ * @file tts-sanitize.ts
+ * @description Sanitizes text for natural TTS (text-to-speech) output.
+ *
+ * Strips markdown formatting, special characters, and artifacts that
+ * would sound unnatural when spoken aloud through a speaker.
+ *
+ * Applied to ALL text before it reaches the voice client as a `speak` command,
+ * whether the text originated from the voice channel itself or was broadcast
+ * from another channel (Telegram, WhatsApp, etc.).
+ */
+
+/**
+ * Strips markdown and special characters for clean TTS output.
+ * Returns empty string if the result is only whitespace.
+ */
+export function sanitizeForTTS(text: string): string {
+  let result = text;
+
+  // --- Remove structural markdown ---
+
+  // Fenced code blocks (```lang\n...\n```) — remove entirely
+  result = result.replace(/```[\s\S]*?```/g, "");
+
+  // Inline code → plain text
+  result = result.replace(/`([^`]+)`/g, "$1");
+
+  // Images — remove entirely (alt text isn't useful spoken)
+  result = result.replace(/!\[([^\]]*)\]\([^)]+\)/g, "");
+
+  // Links — keep display text, drop URL
+  result = result.replace(/\[([^\]]+)\]\([^)]+\)/g, "$1");
+
+  // Bare URLs — drop entirely
+  result = result.replace(/https?:\/\/\S+/g, "");
+
+  // --- Remove inline formatting ---
+
+  // Bold + italic (***text*** or ___text___)
+  result = result.replace(/(\*{3}|_{3})([^*_]+)\1/g, "$2");
+
+  // Bold (**text** or __text__)
+  result = result.replace(/(\*{2}|_{2})([^*_]+)\1/g, "$2");
+
+  // Italic (*text* or _text_) — careful not to match mid-word underscores
+  result = result.replace(/(?<!\w)\*([^*]+)\*(?!\w)/g, "$1");
+  result = result.replace(/(?<!\w)_([^_]+)_(?!\w)/g, "$1");
+
+  // Strikethrough (~~text~~)
+  result = result.replace(/~~([^~]+)~~/g, "$1");
+
+  // --- Remove block-level markers ---
+
+  // Headers (# text)
+  result = result.replace(/^#{1,6}\s+/gm, "");
+
+  // Blockquotes (> text)
+  result = result.replace(/^>\s+/gm, "");
+
+  // Horizontal rules (---, ***, ___)
+  result = result.replace(/^[-*_]{3,}\s*$/gm, "");
+
+  // Unordered list markers (- item, * item, + item)
+  result = result.replace(/^[\t ]*[-*+]\s+/gm, "");
+
+  // Ordered list markers (1. item, 2. item)
+  result = result.replace(/^[\t ]*\d+\.\s+/gm, "");
+
+  // --- Remove OpenClaw-specific directives ---
+
+  // TTS directives: [[tts:...]], [[/tts:...]]
+  result = result.replace(/\[\[\/?\s*tts:[^\]]*\]\]/g, "");
+
+  // Reply-to tags
+  result = result.replace(/\[\[reply_to_current\]\]/g, "");
+
+  // --- Clean up special characters ---
+
+  // HTML-like tags (angle brackets)
+  result = result.replace(/<[^>]*>/g, "");
+
+  // Curly braces, pipes, backslashes
+  result = result.replace(/[{}|\\]/g, "");
+
+  // --- Normalize whitespace ---
+
+  // Collapse 3+ newlines into double newline (natural pause)
+  result = result.replace(/\n{3,}/g, "\n\n");
+
+  // Trim leading/trailing whitespace per line
+  result = result
+    .split("\n")
+    .map((line) => line.trim())
+    .join("\n");
+
+  // Collapse multiple spaces
+  result = result.replace(/ {2,}/g, " ");
+
+  return result.trim();
+}