Add voice agent conciseness controls via SOUL.md and maxTokens

sanchitmonga22 · claude · sanchitmonga22 · commit b6400d23e5f4 · 2026-02-12T15:04:30.000-08:00
Voice responses were too long for TTS playback. This adds a two-layer
approach: a stricter SOUL.md with hard sentence limits (1-6 sentences)
and a dedicated voice model key with maxTokens capped at 512 tokens,
while leaving other channels unaffected at 8192.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/RASPBERRY-PI-SETUP.md b/RASPBERRY-PI-SETUP.md
@@ -453,21 +453,24 @@ Create `~/.openclaw/workspaces/voice-agent/SOUL.md`:
 
 ```markdown
 You are OpenClawPi — a conversational voice assistant running on a Raspberry Pi.
-
-You respond through a speaker via text-to-speech. Your responses will be spoken aloud,
-not read on a screen.
-
-Core rules:
-- Speak naturally — use flowing sentences as if in a real conversation
-- NEVER use markdown formatting (no bold, italic, headers, code blocks, bullet lists, links)
-- Keep responses concise: 2-4 sentences for simple questions, up to a short paragraph for complex ones
-- Use contractions naturally ("I'm", "you're", "that's", "it's")
-- For numbers, prefer spoken form for small values ("twenty-three" not "23")
-- For lists, use natural speech ("first... then... and finally..." not "1. 2. 3.")
-- Never output URLs, file paths, or code snippets — describe them verbally instead
-- If asked for code, describe the logic conversationally
-- Don't start responses with "Sure!" or "Of course!" — just answer directly
-- When you don't know something, say so briefly
+Your responses are spoken aloud via text-to-speech. Brevity is essential.
+
+CRITICAL — Response length limits (these are hard rules):
+- Simple questions (weather, time, facts): 1–2 sentences maximum.
+- Explanations or summaries: 3–4 sentences maximum.
+- Complex topics: 5–6 sentences maximum, then stop and offer to continue.
+- NEVER exceed 6 sentences in a single response.
+
+Speech rules:
+- Speak naturally using flowing sentences, not bullet points.
+- NEVER use markdown (no bold, italic, headers, code blocks, bullet lists, links).
+- Use contractions naturally ("I'm", "you're", "that's", "it's").
+- Spell out small numbers ("twenty-three" not "23").
+- For lists, use natural speech ("first… then… and finally…" not "1. 2. 3.").
+- Never output URLs, file paths, or code — describe them verbally instead.
+- Answer directly. No filler phrases ("Sure!", "Of course!", "Great question!").
+- When you don't know, say so in one sentence.
+- Never apologize for being brief — brevity is expected.
 ```
 
 ### 9b. Add Agent Binding to Config
@@ -477,10 +480,20 @@ Add the following to `~/.openclaw/openclaw.json`:
 ```json
 {
   "agents": {
+    "defaults": {
+      "models": {
+        "anthropic/claude-sonnet-4-5-voice": {
+          "params": {
+            "maxTokens": 512
+          }
+        }
+      }
+    },
     "list": [
       {
         "id": "voice-agent",
-        "workspace": "~/.openclaw/workspaces/voice-agent"
+        "workspace": "~/.openclaw/workspaces/voice-agent",
+        "model": "anthropic/claude-sonnet-4-5-voice"
       }
     ]
   },
@@ -497,6 +510,10 @@ Add the following to `~/.openclaw/openclaw.json`:
 
 This routes all voice-assistant messages to the `voice-agent` (with the conversational SOUL.md), while Telegram and other channels continue using the default agent with normal rich-text responses.
 
+**Why a separate model key?** OpenClaw's `maxTokens` is set per-model, not per-agent. By creating a dedicated model key (`anthropic/claude-sonnet-4-5-voice`), the voice agent gets a hard 512-token ceiling while other channels keep their default limit (8192). Both keys route to the same underlying Anthropic model — the key is just OpenClaw's internal routing identifier. Combined with the SOUL.md conciseness instructions, this ensures voice responses stay short and natural for TTS.
+
+> **Tip:** If 512 tokens feels too restrictive (responses getting cut off), bump it to `768` or `1024`. For most spoken responses, 512 tokens (~3–5 sentences) is the sweet spot.
+
 ### 9c. Restart and Test
 
 ```bash
@@ -507,11 +524,16 @@ Say **"Hey Jarvis, tell me about the weather"** — the response should sound na
 
 ### How It Works
 
-| Message Source | Agent Used | Response Style | TTS Sanitized? |
-|---------------|-----------|----------------|---------------|
-| Voice mic | `voice-agent` | Conversational (SOUL.md) | Yes (safety net) |
-| Telegram | Default agent | Normal rich text | N/A (text channel) |
-| Telegram → Voice broadcast | Default agent | Normal rich text | Yes (stripped for speaker) |
+| Message Source | Agent | Model Key | maxTokens | Response Style | TTS? |
+| --- | --- | --- | --- | --- | --- |
+| Voice mic | `voice-agent` | `claude-sonnet-4-5-voice` | 512 | Concise (SOUL.md) | Yes |
+| Telegram | Default agent | `claude-sonnet-4-5` | 8192 | Normal rich text | N/A |
+| Telegram → Voice broadcast | Default agent | `claude-sonnet-4-5` | 8192 | Normal rich text | Yes |
+
+Voice response conciseness is controlled by two independent layers:
+
+1. **SOUL.md** (soft control) — instructs the LLM to keep responses to 1–6 sentences. This is the primary lever.
+2. **maxTokens** (hard ceiling) — caps the voice model at 512 tokens, preventing runaway generation even if the LLM ignores the system prompt.
 
 The TTS sanitizer (`tts-sanitize.ts`) acts as a safety net on **all** text reaching the speaker, regardless of which agent generated it. Even if the voice agent's SOUL.md instructions are followed perfectly, the sanitizer ensures no markdown artifacts slip through.
 
diff --git a/docs/channels/voice-assistant.md b/docs/channels/voice-assistant.md
@@ -96,6 +96,48 @@ When `broadcastAllChannels: true`, messages from ANY channel are spoken via TTS:
 3. WhatsApp receives text response.
 4. Voice assistant receives TTS playback.
 
+## Controlling response length
+
+Voice responses are spoken aloud, so conciseness matters. There are two layers to control this:
+
+### 1. SOUL.md (primary — soft control)
+
+Bind a dedicated voice agent with its own workspace and SOUL.md that instructs the LLM to keep responses brief. See [RASPBERRY-PI-SETUP.md](/RASPBERRY-PI-SETUP.md) for a full template. Key rules to include:
+
+- Hard sentence limits (1–2 for simple questions, 5–6 max for complex topics)
+- No markdown formatting
+- Natural speech patterns
+
+### 2. maxTokens (secondary — hard ceiling)
+
+Create a dedicated model key for the voice agent with a low `maxTokens` value. This prevents runaway generation even if the LLM ignores the system prompt:
+
+```json5
+{
+  agents: {
+    defaults: {
+      models: {
+        "anthropic/claude-sonnet-4-5-voice": {
+          params: { maxTokens: 512 },
+        },
+      },
+    },
+    list: [
+      {
+        id: "voice-agent",
+        workspace: "~/.openclaw/workspaces/voice-agent",
+        model: "anthropic/claude-sonnet-4-5-voice",
+      },
+    ],
+  },
+  bindings: [
+    { agentId: "voice-agent", match: { channel: "voice-assistant" } },
+  ],
+}
+```
+
+The dedicated model key (`-voice` suffix) inherits the same underlying model but gets its own `maxTokens`. Other channels keep their default limit. Start with 512 tokens and adjust up if responses feel cut off.
+
 ## Configuration
 
 ### Channel config
diff --git a/extensions/voice-assistant/README.md b/extensions/voice-assistant/README.md
@@ -31,6 +31,15 @@ On-device voice input/output channel powered by RunAnywhere.
 └─────────────────────────────────────────────────────────────┘
 ```
 
+## Controlling response conciseness
+
+Voice responses are spoken aloud via TTS, so brevity is critical. Two mechanisms work together:
+
+1. **SOUL.md** — Bind a dedicated `voice-agent` with a workspace containing a SOUL.md that enforces sentence limits (1–6 sentences max). This is the primary control.
+2. **maxTokens** — Create a voice-specific model key (e.g. `anthropic/claude-sonnet-4-5-voice`) with `params.maxTokens: 512` in `agents.defaults.models`. This acts as a hard ceiling.
+
+See [RASPBERRY-PI-SETUP.md](/RASPBERRY-PI-SETUP.md) for the full configuration example and SOUL.md template.
+
 ## Configuration
 
 Add to `~/.openclaw/openclaw.yaml`: