Skip to content

Commit 450f18d

Browse files
PaulAsjesclaude
andauthored
Speech Engine SDK (#771)
* rename to Speech Engine * veng -> seng * initial commit * Bug fixes and API surface updates * bug fixes and QoL improvements * Address comments * Fix client_wrapper naming and use public API for key access Rename client_options parameter to client_wrapper for consistency with Fern SDK conventions. Use get_headers() instead of accessing private _api_key attribute. Update tests to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix send_response silently failing when event_id is absent Use a dedicated _in_transcript_handler flag instead of checking _current_event_id is None, which conflated "no handler running" with "handler running but transcript had no event_id". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix crash when user_transcript is null in wire message Move the log call after the try/except and use `or []` to handle null transcript_data, preventing len(None) TypeError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Bump websockets minimum to >=13.0 The server uses websockets 13.x API (websocket.request.path, websocket.request.headers) which doesn't exist in 11.x-12.x. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Set _closed on protocol close message to stop the receive loop Without this, the run() loop continues calling recv() after a close message, processing stale messages and reporting is_open as True. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Move auth to process_request for pre-handshake rejection Use websockets process_request callback to verify JWT before the WebSocket handshake completes. Unauthenticated requests now get a plain HTTP 401 instead of completing the upgrade first. Also include exception in logger.exception error handler message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Guard against non-dict JSON messages in the receive loop json.loads can return lists, strings, numbers, or null. These would crash _handle_message which expects a dict. Now emits an error event and continues instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Capture event_id upfront in send_response for string path The string response path called _send_agent_response twice without an explicit event_id, reading self._current_event_id at each call. An interruption between the two awaits could stamp the terminator with the wrong event_id. Now captures event_id once and passes it through, matching the stream path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Wire speech engine to Fern-generated CRUD client Replace the stub _AsyncSpeechEngineAccessor with custom SpeechEngineClient/AsyncSpeechEngineClient classes that extend the Fern-generated clients and add resource() for WebSocket server setup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * override get * fixes * Return resource instead of response * fix bugs --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent bb1f507 commit 450f18d

15 files changed

Lines changed: 2775 additions & 1 deletion

.fernignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ src/elevenlabs/music_custom.py
1111
src/elevenlabs/speech_to_text_custom.py
1212
src/elevenlabs/url_utils.py
1313
src/elevenlabs/realtime/
14+
src/elevenlabs/speech_engine/
1415

1516
# Ignore CI files
1617
.github/

README.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,141 @@ client_tools.register("calculate_sum", calculate_sum, is_async=False)
266266
client_tools.register("fetch_data", fetch_data, is_async=True)
267267
```
268268

269+
## Speech Engine
270+
271+
Speech Engine lets you build server-side voice agents that receive real-time transcripts from the ElevenLabs API and stream LLM responses back for text-to-speech synthesis. Your server acts as a WebSocket endpoint — ElevenLabs connects to it, sends user transcripts, and your code decides how to respond.
272+
273+
Speech Engine is async-only and available on `AsyncElevenLabs`.
274+
275+
### Quick Start
276+
277+
```python
278+
import asyncio
279+
from openai import AsyncOpenAI
280+
from elevenlabs import AsyncElevenLabs
281+
282+
openai_client = AsyncOpenAI()
283+
elevenlabs = AsyncElevenLabs()
284+
285+
async def main():
286+
engine = await elevenlabs.speech_engine.get("seng_123")
287+
288+
async def on_transcript(transcript, session):
289+
stream = await openai_client.responses.create(
290+
model="gpt-4o",
291+
input=[
292+
{"role": "assistant" if m.role == "agent" else m.role, "content": m.content}
293+
for m in transcript
294+
],
295+
stream=True,
296+
)
297+
await session.send_response(stream)
298+
299+
async def on_init(conversation_id, session):
300+
print(f"Session started: {conversation_id}")
301+
302+
async def on_close(session):
303+
print(f"Session ended: {session.conversation_id}")
304+
305+
async def on_error(err, session):
306+
print(f"Error: {err}")
307+
308+
await engine.serve(
309+
port=3001,
310+
debug=True,
311+
on_init=on_init,
312+
on_transcript=on_transcript,
313+
on_close=on_close,
314+
on_error=on_error,
315+
)
316+
317+
asyncio.run(main())
318+
```
319+
320+
### How It Works
321+
322+
When `engine.serve()` starts, it opens a WebSocket server on the specified port. For each incoming connection from the ElevenLabs API:
323+
324+
1. An `init` message arrives with a `conversation_id`
325+
2. As the user speaks, `user_transcript` messages arrive with the full conversation history
326+
3. Your `on_transcript` handler generates a response (using any LLM) and calls `session.send_response()`
327+
4. If the user interrupts (speaks again mid-response), the previous handler is automatically cancelled
328+
329+
### Sending Responses
330+
331+
`send_response()` accepts strings or async iterators. LLM stream formats from OpenAI, Anthropic, and Google Gemini are auto-detected:
332+
333+
```python
334+
# Plain string
335+
await session.send_response("Hello world")
336+
337+
# OpenAI stream (auto-parsed)
338+
stream = await openai_client.responses.create(model="gpt-4o", ..., stream=True)
339+
await session.send_response(stream)
340+
341+
# Anthropic stream (auto-parsed)
342+
stream = anthropic_client.messages.stream(model="claude-sonnet-4-20250514", ...)
343+
await session.send_response(stream)
344+
345+
# Any async iterator of strings
346+
async def my_generator():
347+
yield "Hello "
348+
yield "world"
349+
await session.send_response(my_generator())
350+
```
351+
352+
### Interruption Handling
353+
354+
When a new transcript arrives while a previous response is still streaming, the previous handler's `asyncio.Task` is cancelled automatically. Any `await` in your handler (including LLM SDK calls) will raise `asyncio.CancelledError`, which cleanly aborts the in-flight request. No manual signal handling needed.
355+
356+
### Custom Server Integration (FastAPI, Starlette)
357+
358+
For integrating with an existing web server, use `create_session()` instead of `serve()`:
359+
360+
```python
361+
from fastapi import FastAPI, WebSocket
362+
363+
app = FastAPI()
364+
engine = ... # SpeechEngineResource from await client.speech_engine.get(...)
365+
366+
@app.websocket("/api/speech-engine/ws")
367+
async def speech_engine_ws(ws: WebSocket):
368+
await ws.accept()
369+
session = engine.create_session(ws, debug=True)
370+
session.on("user_transcript", handle_transcript)
371+
await session.run()
372+
```
373+
374+
When using `session.on()` directly, handlers receive just the event data (no `session` argument, since you already have the reference):
375+
376+
| Event | Handler signature |
377+
|---|---|
378+
| `"init"` | `async (conversation_id: str) -> None` |
379+
| `"user_transcript"` | `async (transcript: list[ConversationMessage]) -> None` |
380+
| `"close"` | `async () -> None` |
381+
| `"disconnected"` | `async () -> None` |
382+
| `"error"` | `async (error: Exception) -> None` |
383+
384+
### Standalone Server
385+
386+
For full control over the server lifecycle, use `SpeechEngineServer` directly:
387+
388+
```python
389+
from elevenlabs.speech_engine import SpeechEngineServer
390+
391+
server = SpeechEngineServer(
392+
port=3001,
393+
debug=True,
394+
on_transcript=handle_transcript,
395+
)
396+
397+
# In one task:
398+
await server.serve()
399+
400+
# In another task (e.g. signal handler):
401+
await server.stop()
402+
```
403+
269404
## Languages Supported
270405

271406
Explore [all models & languages](https://elevenlabs.io/docs/models).

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ pydantic = ">= 1.9.2"
4343
pydantic-core = ">=2.18.2"
4444
requests = ">=2.20"
4545
typing_extensions = ">= 4.0.0"
46-
websockets = ">=11.0"
46+
websockets = ">=13.0"
4747

4848
[tool.poetry.group.dev.dependencies]
4949
mypy = "==1.13.0"

src/elevenlabs/client.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from .environment import ElevenLabsEnvironment
88
from .music_custom import AsyncMusicClient, MusicClient
99
from .realtime_tts import RealtimeTextToSpeechClient
10+
from .speech_engine_custom import AsyncSpeechEngineClient, SpeechEngineClient
1011
from .speech_to_text_custom import AsyncSpeechToTextClient, SpeechToTextClient
1112
from .webhooks_custom import AsyncWebhooksClient, WebhooksClient
1213

@@ -62,6 +63,11 @@ def __init__(
6263
self._webhooks = WebhooksClient(client_wrapper=self._client_wrapper)
6364
self._music = MusicClient(client_wrapper=self._client_wrapper)
6465
self._speech_to_text = SpeechToTextClient(client_wrapper=self._client_wrapper)
66+
self._speech_engine = SpeechEngineClient(client_wrapper=self._client_wrapper)
67+
68+
@property
69+
def speech_engine(self) -> SpeechEngineClient:
70+
return typing.cast(SpeechEngineClient, self._speech_engine)
6571

6672

6773
class AsyncElevenLabs(AsyncBaseElevenLabs):
@@ -107,3 +113,8 @@ def __init__(
107113
self._webhooks = AsyncWebhooksClient(client_wrapper=self._client_wrapper)
108114
self._music = AsyncMusicClient(client_wrapper=self._client_wrapper)
109115
self._speech_to_text = AsyncSpeechToTextClient(client_wrapper=self._client_wrapper)
116+
self._speech_engine = AsyncSpeechEngineClient(client_wrapper=self._client_wrapper)
117+
118+
@property
119+
def speech_engine(self) -> AsyncSpeechEngineClient:
120+
return typing.cast(AsyncSpeechEngineClient, self._speech_engine)

src/elevenlabs/speech_engine/__init__.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,31 @@
22

33
# isort: skip_file
44

5+
"""ElevenLabs Speech Engine SDK module."""
6+
7+
from .resource import SpeechEngineResource, verify_speech_engine_jwt
8+
from .server import SpeechEngineServer
9+
from .session import SpeechEngineSession
10+
from .types import (
11+
CLOSE,
12+
DISCONNECTED,
13+
ERROR,
14+
INIT,
15+
USER_TRANSCRIPT,
16+
ConversationMessage,
17+
WebSocketLike,
18+
)
19+
20+
__all__ = [
21+
"ConversationMessage",
22+
"SpeechEngineResource",
23+
"SpeechEngineServer",
24+
"SpeechEngineSession",
25+
"WebSocketLike",
26+
"verify_speech_engine_jwt",
27+
"CLOSE",
28+
"DISCONNECTED",
29+
"ERROR",
30+
"INIT",
31+
"USER_TRANSCRIPT",
32+
]

0 commit comments

Comments
 (0)