Spec ID: OVOS-AUDIO-1 · Version: 2 · Status: Draft
This specification defines the audio output service — the
pipeline's output-side counterpart that consumes natural-language
responses and renders them as audio. It covers two rendering modes
(ovos.utterance.speak for local playback and
ovos.utterance.speak.b64 for remote-client delivery), a sequential
playback queue for speech and sound effects, fire-and-forget instant
sounds, and the output lifecycle signals that bookend audio playback.
It builds on three companion specifications:
- the Utterance Lifecycle and Pipeline Specification
(OVOS-PIPELINE-1) — the pipeline iteration, the
Matchand dispatch contract, the handler-lifecycle trio, and theovos.utterance.speakoutput exit point; - the Bus Message Specification (OVOS-MSG-1) — the envelope, routing keys, session carrier, and derivations every Message defined here travels in;
- the Transformer Injection Point Specification (OVOS-TRANSFORM-1) — the dialog-transformer and TTS-transformer chains that run before and after TTS synthesis.
The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used as in RFC 2119.
This specification defines:
- the audio output service role (§2) — the component that receives natural-language responses and renders them as audio;
- the rendering pipeline (§3) — two rendering modes sharing the
same TTS pipeline:
ovos.utterance.speakenqueues for local playback;ovos.utterance.speak.b64emits synthesised audio as base64 for remote clients instead; - the playback model (§4) — the scheduled queue for TTS speech and queued sounds, and fire-and-forget instant sounds for immediate playback;
- output lifecycle signals (§5) — the start/end markers that bookend audio playback;
- stop integration (§6) — how the audio service responds to stop signals;
- bus surface (§7);
- conformance (§8).
It does not define:
- the internal machinery of TTS synthesis — how a TTS plugin converts text to audio, including model inference, voice selection, and audio formatting, is entirely the plugin's business. The spec fixes only the observable bus contract;
- the transformer plugin internals — dialog and TTS transformer chains are defined by OVOS-TRANSFORM-1; this spec only fixes when they run in the output pipeline;
- the audio-input pipeline — microphone capture, wake-word detection, and speech-to-text are separate services covered by other specifications;
- hardware access — how the service accesses audio output hardware is a deployment concern;
- volume control, audio routing, or hardware abstraction — these are deployment-level concerns;
- music and media playback — long-form audio is managed by a separate media-playback service. This spec covers TTS speech and sound effects only.
The audio output service is the component that receives natural-language response text from the pipeline and renders it as audible output. It:
- subscribes to
ovos.utterance.speak(OVOS-PIPELINE-1 §9.6) andovos.utterance.speak.b64(§3.4) and processes each through the same TTS rendering pipeline (§3), differing only in output stage; - maintains a scheduled playback queue (§4.1) for TTS speech and queued sounds, ensuring that audio is played back in order without overlapping;
- plays instant sounds (§4.2) immediately on receipt, independently of the scheduled queue and without stopping it;
- emits output lifecycle signals (§5) around each playback session;
- responds to stop signals (§6) by clearing the queue and terminating in-progress playback.
A deployment MAY have no audio output service. The pipeline and handler lifecycle are unaffected by its absence.
The handler does not block on audio output; playback may occur after
ovos.utterance.handled has fired (PIPELINE-1 §6.1).
Both ovos.utterance.speak and ovos.utterance.speak.b64 pass
through the same TTS pipeline. They differ only in the output stage:
ovos.utterance.speak ovos.utterance.speak.b64
│ │
▼ ▼
[dialog transformers] [dialog transformers] ← TRANSFORM-1 §3.5
│ │
▼ ▼
TTS synthesis TTS synthesis
│ │
▼ ▼
[tts transformers] [tts transformers] ← TRANSFORM-1 §3.6
│ │
▼ ▼
scheduled queue ovos.audio.speech (§4.3)
→ local playback (b64 for remote client)
All rendering stages execute in the audio output service, which MAY run in the same process as the utterance orchestrator or separately.
Before TTS synthesis, the utterance text is passed through the dialog-transformer chain (OVOS-TRANSFORM-1 §3.5) hosted by the audio output service. Each transformer plugin in the chain receives the text and the Message context and MAY mutate either.
The transformed text replaces the original utterance field for
all downstream stages.
The audio output service synthesises the utterance text into audio.
Language is taken from data.lang in the received Message
(PIPELINE-1 §9.6); when absent, the service resolves it from the
session (OVOS-SESSION-1 §3.2).
When synthesis fails, the service SHOULD attempt a fallback. Selection and fallback logic are deployment concerns.
For ovos.utterance.speak, the synthesised audio is enqueued for
local playback (§4). For ovos.utterance.speak.b64, the synthesised
audio is emitted as ovos.audio.speech (§3.4) instead — it is not
enqueued and does not play locally.
Note (non-normative): See appendix §4.9 for a discussion of sentence-segmentation as a latency-reduction technique.
After synthesis, the audio data and Message context are passed through the TTS-transformer chain (OVOS-TRANSFORM-1 §3.6) hosted by the audio output service. Each transformer plugin MAY mutate the audio data.
The transformed audio replaces the original for playback.
The audio output service MUST subscribe to
ovos.utterance.speak.b64. A Message on this topic carries the same
utterance text as ovos.utterance.speak and passes through the
same dialog-transformer, TTS-synthesis, and TTS-transformer stages
(§3.1–§3.3). The output stage differs: instead of enqueueing for
local playback, the service MUST emit ovos.audio.speech (§4.3)
with the synthesised audio encoded as base64. The audio is not
enqueued and does not play on the local device.
The listen flag (§4.4) applies: if the originating Message carries
listen: true, the service MUST emit ovos.mic.listen after
emitting ovos.audio.speech.
The audio output service has one scheduled queue and a separate instant-sound mechanism:
- Scheduled playback queue (§4.1) — sequential, one-at-a-time playback for TTS speech and queued sound effects. Audio plays in FIFO order without overlapping.
- Instant sounds (§4.2) — fire-and-forget playback that starts immediately on receipt. Instant sounds are independent of the queue: they play over whatever is currently scheduled, MAY overlap each other, and are not stoppable.
This queue holds TTS speech (from ovos.utterance.speak, §3.2)
and queued sounds (from ovos.audio.queue, below).
Session scope. The audio output service MUST only enqueue items
whose context.session.session_id matches a session it is
configured to serve locally. A service co-located with the
orchestrator on a single device SHOULD serve only
session_id: "default" (OVOS-SESSION-2 §5) and MUST NOT
enqueue audio for named sessions — those sessions belong to remote
participants and their audio is delivered via
ovos.utterance.speak.b64 / ovos.audio.speech (§3.4, §4.3).
Discipline:
- FIFO. Items are dequeued in the order they were enqueued.
- Sequential. Each item plays to completion before the next item begins.
- Clearable. On a stop signal (§6), the queue is emptied of all pending items and any in-progress playback is terminated.
Queued sounds use topic ovos.audio.queue:
| Field | Type | Required | Meaning |
|---|---|---|---|
uri |
string | no | URI referencing the audio data. |
audio |
string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to uri). |
listen |
bool | no | When true, re-opens the user input channel after this item plays (§4.4). |
Exactly one of uri or audio MUST be present.
Instant sounds are played via ovos.audio.play_sound. They start
immediately on receipt, play over any audio currently in progress
from the scheduled queue, MAY overlap each other, and are not
affected by stop signals (§6).
Play-sound topic ovos.audio.play_sound:
| Field | Type | Required | Meaning |
|---|---|---|---|
uri |
string | no | URI referencing the audio data. |
audio |
string | no | Base64-encoded audio data, used when the audio source is on a different host (alternative to uri). |
Exactly one of uri or audio MUST be present.
ovos.audio.speech is emitted by the audio output service when
processing an ovos.utterance.speak.b64 Message (§3.4). It carries
the synthesised audio as base64; the receiving client is responsible
for decoding and playing it.
| Field | Type | Required | Meaning |
|---|---|---|---|
audio |
string | yes | Base64-encoded synthesised audio. |
listen |
bool | no | When true, the client SHOULD re-open its microphone after playback. |
The session is identified via context.session as usual. A bridge
(OVOS-BRIDGE-1 §4.2.4) subscribes by session_id or destination
and relays this message to the client.
The listen field on ovos.utterance.speak is defined by
OVOS-PIPELINE-1 §9.6. When a received Message carries listen: true,
the audio output service MUST emit ovos.mic.listen after all
audio for that utterance has completed and after
ovos.audio.output.ended (§5.2).
On a stop-initiated end (§6), ovos.mic.listen is NOT emitted
regardless of the listen flag.
The audio output service emits lifecycle signals around playback to notify other components of audio state.
When the first item in a playback session begins (queue was empty, first item dequeued), the audio output service MUST emit:
ovos.audio.output.started
Payload:
No payload. The session is identified by context.session.session_id
of this Message.
A playback session runs from the first item's start until the queue
is empty and the last item completes. ovos.audio.output.started
fires once per idle→active transition.
When the queue becomes empty and the last item has completed playback, the audio output service MUST emit:
ovos.audio.output.ended
Payload:
No payload. The session is identified by context.session.session_id
of this Message.
Components that subscribed to ovos.audio.output.started use this
signal to restore state.
If the last completed item carried listen: true (§4.4), the audio
output service emits ovos.mic.listen after ovos.audio.output.ended.
On a stop-initiated end, ovos.mic.listen is not emitted (§4.4).
A component MAY query whether the audio output service is currently speaking by emitting:
ovos.audio.is_speaking
Request payload: none. To scope the query to a specific session,
the requester sets context.session.session_id in the request
Message; the service answers for that session only. An absent or
"default" session_id asks about the device-local default session
(OVOS-SESSION-1 §3.1); it is not a wildcard over all sessions.
The service replies with:
{ "speaking": true }| Field | Type | Required | Meaning |
|---|---|---|---|
speaking |
bool | yes | Whether audio is currently playing for the session identified by context.session.session_id of the request. |
When the audio output service receives a stop signal, it:
- clears the scheduled playback queue of all pending items;
- terminates any in-progress scheduled playback;
- emits
ovos.audio.output.endedif a playback session was active.
Instant sounds (§4.2) are not affected by stop signals — they play to completion regardless.
The stop signal topics are:
| Topic | Purpose |
|---|---|
ovos.audio.stop |
Stop audio output. |
ovos.stop |
Universal stop broadcast (OVOS-STOP-1). |
Both signals carry context.session.session_id (OVOS-MSG-1 §4).
The audio output service MAY scope its response to that session.
| Topic | Direction | Purpose |
|---|---|---|
ovos.utterance.speak |
handler → audio | Natural-language response text for TTS + local playback (PIPELINE-1 §9.6). |
ovos.utterance.speak.b64 |
handler/bridge → audio | Natural-language response text for TTS + remote delivery via ovos.audio.speech (§3.4). |
ovos.audio.queue |
any component → audio | Queue a sound for scheduled playback (§4.1). |
ovos.audio.play_sound |
any component → audio | Play a sound immediately (§4.2). |
ovos.audio.stop |
any component → audio | Stop audio playback and clear queue (§6). |
ovos.audio.is_speaking |
any component → audio | Query whether audio is currently playing (§5.3). |
ovos.audio.output.started |
audio → broadcast | Playback session started (§5.1). |
ovos.audio.output.ended |
audio → broadcast | Playback session ended (§5.2). |
ovos.audio.speech |
audio → broadcast | Synthesised audio as base64 for remote clients (§4.3). |
ovos.mic.listen |
audio → broadcast | Request microphone re-open after listen: true (§4.4). |
- subscribe to
ovos.utterance.speakand process each Message through the TTS rendering pipeline for local playback (§3); - subscribe to
ovos.utterance.speak.b64and process each Message through the same TTS pipeline, emittingovos.audio.speechinstead of enqueueing for local playback (§3.4); - maintain a scheduled playback queue that plays one item at a time in FIFO order (§4.1);
- support queued sound playback via
ovos.audio.queue(§4.1); - play instant sounds immediately on
ovos.audio.play_soundwithout queuing or stopping scheduled playback (§4.2); - emit
ovos.audio.output.startedwhen a playback session begins (§5.1); - emit
ovos.audio.output.endedwhen a playback session ends (§5.2); - clear the scheduled queue and terminate playback on stop signals (§6);
- emit
ovos.mic.listenafter playback when the last item carrieslisten: true(§4.4); - suppress
ovos.mic.listenwhen playback ends due to a stop signal (§4.4, §6).
- pass utterance text through the dialog-transformer chain before TTS synthesis (§3.1);
- pass the synthesized audio through the TTS-transformer chain before enqueueing (§3.3);
- scope stop responses to the
context.session.session_idin the stop signal (§6).
- Utterance Lifecycle and Pipeline Specification (OVOS-PIPELINE-1)
— the pipeline iteration,
ovos.utterance.speak, andovos.utterance.handled. - Bus Message Specification (OVOS-MSG-1) — the envelope and derivations used for all bus communication.
- Transformer Injection Point Specification (OVOS-TRANSFORM-1) — the dialog-transformer and TTS-transformer chains that plug into the rendering pipeline.
- Stop Pipeline Plugin Specification (OVOS-STOP-1) — the universal
ovos.stopbroadcast that the audio output service responds to.