Skip to content

Commit b6400d2

Browse files
Add voice agent conciseness controls via SOUL.md and maxTokens
Voice responses were too long for TTS playback. This adds a two-layer approach: a stricter SOUL.md with hard sentence limits (1-6 sentences) and a dedicated voice model key with maxTokens capped at 512 tokens, while leaving other channels unaffected at 8192. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7f74337 commit b6400d2

3 files changed

Lines changed: 94 additions & 21 deletions

File tree

RASPBERRY-PI-SETUP.md

Lines changed: 43 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -453,21 +453,24 @@ Create `~/.openclaw/workspaces/voice-agent/SOUL.md`:
453453

454454
```markdown
455455
You are OpenClawPi — a conversational voice assistant running on a Raspberry Pi.
456-
457-
You respond through a speaker via text-to-speech. Your responses will be spoken aloud,
458-
not read on a screen.
459-
460-
Core rules:
461-
- Speak naturally — use flowing sentences as if in a real conversation
462-
- NEVER use markdown formatting (no bold, italic, headers, code blocks, bullet lists, links)
463-
- Keep responses concise: 2-4 sentences for simple questions, up to a short paragraph for complex ones
464-
- Use contractions naturally ("I'm", "you're", "that's", "it's")
465-
- For numbers, prefer spoken form for small values ("twenty-three" not "23")
466-
- For lists, use natural speech ("first... then... and finally..." not "1. 2. 3.")
467-
- Never output URLs, file paths, or code snippets — describe them verbally instead
468-
- If asked for code, describe the logic conversationally
469-
- Don't start responses with "Sure!" or "Of course!" — just answer directly
470-
- When you don't know something, say so briefly
456+
Your responses are spoken aloud via text-to-speech. Brevity is essential.
457+
458+
CRITICAL — Response length limits (these are hard rules):
459+
- Simple questions (weather, time, facts): 1–2 sentences maximum.
460+
- Explanations or summaries: 3–4 sentences maximum.
461+
- Complex topics: 5–6 sentences maximum, then stop and offer to continue.
462+
- NEVER exceed 6 sentences in a single response.
463+
464+
Speech rules:
465+
- Speak naturally using flowing sentences, not bullet points.
466+
- NEVER use markdown (no bold, italic, headers, code blocks, bullet lists, links).
467+
- Use contractions naturally ("I'm", "you're", "that's", "it's").
468+
- Spell out small numbers ("twenty-three" not "23").
469+
- For lists, use natural speech ("first… then… and finally…" not "1. 2. 3.").
470+
- Never output URLs, file paths, or code — describe them verbally instead.
471+
- Answer directly. No filler phrases ("Sure!", "Of course!", "Great question!").
472+
- When you don't know, say so in one sentence.
473+
- Never apologize for being brief — brevity is expected.
471474
```
472475

473476
### 9b. Add Agent Binding to Config
@@ -477,10 +480,20 @@ Add the following to `~/.openclaw/openclaw.json`:
477480
```json
478481
{
479482
"agents": {
483+
"defaults": {
484+
"models": {
485+
"anthropic/claude-sonnet-4-5-voice": {
486+
"params": {
487+
"maxTokens": 512
488+
}
489+
}
490+
}
491+
},
480492
"list": [
481493
{
482494
"id": "voice-agent",
483-
"workspace": "~/.openclaw/workspaces/voice-agent"
495+
"workspace": "~/.openclaw/workspaces/voice-agent",
496+
"model": "anthropic/claude-sonnet-4-5-voice"
484497
}
485498
]
486499
},
@@ -497,6 +510,10 @@ Add the following to `~/.openclaw/openclaw.json`:
497510

498511
This routes all voice-assistant messages to the `voice-agent` (with the conversational SOUL.md), while Telegram and other channels continue using the default agent with normal rich-text responses.
499512

513+
**Why a separate model key?** OpenClaw's `maxTokens` is set per-model, not per-agent. By creating a dedicated model key (`anthropic/claude-sonnet-4-5-voice`), the voice agent gets a hard 512-token ceiling while other channels keep their default limit (8192). Both keys route to the same underlying Anthropic model — the key is just OpenClaw's internal routing identifier. Combined with the SOUL.md conciseness instructions, this ensures voice responses stay short and natural for TTS.
514+
515+
> **Tip:** If 512 tokens feels too restrictive (responses getting cut off), bump it to `768` or `1024`. For most spoken responses, 512 tokens (~3–5 sentences) is the sweet spot.
516+
500517
### 9c. Restart and Test
501518

502519
```bash
@@ -507,11 +524,16 @@ Say **"Hey Jarvis, tell me about the weather"** — the response should sound na
507524

508525
### How It Works
509526

510-
| Message Source | Agent Used | Response Style | TTS Sanitized? |
511-
|---------------|-----------|----------------|---------------|
512-
| Voice mic | `voice-agent` | Conversational (SOUL.md) | Yes (safety net) |
513-
| Telegram | Default agent | Normal rich text | N/A (text channel) |
514-
| Telegram → Voice broadcast | Default agent | Normal rich text | Yes (stripped for speaker) |
527+
| Message Source | Agent | Model Key | maxTokens | Response Style | TTS? |
528+
| --- | --- | --- | --- | --- | --- |
529+
| Voice mic | `voice-agent` | `claude-sonnet-4-5-voice` | 512 | Concise (SOUL.md) | Yes |
530+
| Telegram | Default agent | `claude-sonnet-4-5` | 8192 | Normal rich text | N/A |
531+
| Telegram → Voice broadcast | Default agent | `claude-sonnet-4-5` | 8192 | Normal rich text | Yes |
532+
533+
Voice response conciseness is controlled by two independent layers:
534+
535+
1. **SOUL.md** (soft control) — instructs the LLM to keep responses to 1–6 sentences. This is the primary lever.
536+
2. **maxTokens** (hard ceiling) — caps the voice model at 512 tokens, preventing runaway generation even if the LLM ignores the system prompt.
515537

516538
The TTS sanitizer (`tts-sanitize.ts`) acts as a safety net on **all** text reaching the speaker, regardless of which agent generated it. Even if the voice agent's SOUL.md instructions are followed perfectly, the sanitizer ensures no markdown artifacts slip through.
517539

docs/channels/voice-assistant.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,48 @@ When `broadcastAllChannels: true`, messages from ANY channel are spoken via TTS:
9696
3. WhatsApp receives text response.
9797
4. Voice assistant receives TTS playback.
9898

99+
## Controlling response length
100+
101+
Voice responses are spoken aloud, so conciseness matters. There are two layers to control this:
102+
103+
### 1. SOUL.md (primary — soft control)
104+
105+
Bind a dedicated voice agent with its own workspace and SOUL.md that instructs the LLM to keep responses brief. See [RASPBERRY-PI-SETUP.md](/RASPBERRY-PI-SETUP.md) for a full template. Key rules to include:
106+
107+
- Hard sentence limits (1–2 for simple questions, 5–6 max for complex topics)
108+
- No markdown formatting
109+
- Natural speech patterns
110+
111+
### 2. maxTokens (secondary — hard ceiling)
112+
113+
Create a dedicated model key for the voice agent with a low `maxTokens` value. This prevents runaway generation even if the LLM ignores the system prompt:
114+
115+
```json5
116+
{
117+
agents: {
118+
defaults: {
119+
models: {
120+
"anthropic/claude-sonnet-4-5-voice": {
121+
params: { maxTokens: 512 },
122+
},
123+
},
124+
},
125+
list: [
126+
{
127+
id: "voice-agent",
128+
workspace: "~/.openclaw/workspaces/voice-agent",
129+
model: "anthropic/claude-sonnet-4-5-voice",
130+
},
131+
],
132+
},
133+
bindings: [
134+
{ agentId: "voice-agent", match: { channel: "voice-assistant" } },
135+
],
136+
}
137+
```
138+
139+
The dedicated model key (`-voice` suffix) inherits the same underlying model but gets its own `maxTokens`. Other channels keep their default limit. Start with 512 tokens and adjust up if responses feel cut off.
140+
99141
## Configuration
100142

101143
### Channel config

extensions/voice-assistant/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,15 @@ On-device voice input/output channel powered by RunAnywhere.
3131
└─────────────────────────────────────────────────────────────┘
3232
```
3333

34+
## Controlling response conciseness
35+
36+
Voice responses are spoken aloud via TTS, so brevity is critical. Two mechanisms work together:
37+
38+
1. **SOUL.md** — Bind a dedicated `voice-agent` with a workspace containing a SOUL.md that enforces sentence limits (1–6 sentences max). This is the primary control.
39+
2. **maxTokens** — Create a voice-specific model key (e.g. `anthropic/claude-sonnet-4-5-voice`) with `params.maxTokens: 512` in `agents.defaults.models`. This acts as a hard ceiling.
40+
41+
See [RASPBERRY-PI-SETUP.md](/RASPBERRY-PI-SETUP.md) for the full configuration example and SOUL.md template.
42+
3443
## Configuration
3544

3645
Add to `~/.openclaw/openclaw.yaml`:

0 commit comments

Comments
 (0)