js-tts-wrapper Engine & Feature Backlog

Reference: speech-sdk (@speech-sdk/core)

Completed

New Engines to Add

Lower Priority (Open-Source / Niche)

Engine	Models	Key Features	Notes
fal	`f5-tts`, `kokoro`, `dia-tts`, `orpheus-tts`, `index-tts-2`	Voice cloning, open-source	No streaming, many sub-models
Google Gemini TTS	`gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts`	Pseudo-streaming, 23 languages	Different from existing Google Cloud TTS

Cross-Cutting Features

Audio Tags (Cross-Provider Abstraction)

Unified [tag] syntax mapped to provider-specific representations:

ElevenLabs v3 — native passthrough (done)
Cartesia sonic-3 — emotions to <emotion value="..."/> SSML (done)
OpenAI gpt-4o-mini-tts — tags to natural language instructions
xAI grok-tts — native passthrough
Fish Audio s2-pro — native passthrough
All others — strip tags with warnings

Model-Level Feature Declarations

Add per-model capability metadata (from speech-sdk pattern):

streaming — supports real-time audio streaming
audio-tags — supports [tag] syntax
inline-voice-cloning — accepts reference audio inline
open-source — model is open source

Enables runtime capability checks via hasFeature().

Unified Voice Type

Current: engine-specific voice IDs Proposed: string | { url: string } | { audio: string | Uint8Array }

string — standard voice ID
{ url } — voice cloning from URL
{ audio } — voice cloning from inline audio

Voice Cloning Support

Providers that support inline voice cloning:

Cartesia sonic-3
Hume octave-2
Fish Audio s2-pro
Resemble
Mistral voxtral-mini-tts-2603
fal (f5-tts, dia-tts, index-tts-2)

Streaming Improvements

Cartesia: true streaming (already pipes response.body)
Deepgram: true streaming (already pipes response.body)
ElevenLabs: true streaming (fixed — pipes response.body when not using timestamps)
Polly: true streaming for MP3/OGG (already pipes AudioStream; WAV requires buffering for header)
Standardize synthToBytestream to return actual streaming responses where supported
Google Cloud TTS: SDK returns all audio at once — would need StreamingSynthesize beta API
Google Gemini TTS: pseudo-streaming via SSE base64 chunks (new engine, not yet implemented)

Tree-Shakeable Subpath Exports

From speech-sdk pattern — add per-provider subpath exports in package.json:

{
  "exports": {
    ".": "./dist/esm/index.js",
    "./cartesia": "./dist/esm/engines/cartesia.js",
    "./deepgram": "./dist/esm/engines/deepgram.js"
  }
}

Unified Error Hierarchy

Standardize errors across engines with rich context (statusCode, model, responseBody).

Existing Engine Updates Needed

Engine	Update Needed
OpenAI	Add `gpt-4o-mini-tts` model with instructions/audio tag support
Google	Add Gemini-based TTS alongside existing Cloud TTS
ElevenLabs	Close issue #24 (already fixed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

js-tts-wrapper Engine & Feature Backlog

Completed

New Engines to Add

Lower Priority (Open-Source / Niche)

Cross-Cutting Features

Audio Tags (Cross-Provider Abstraction)

Model-Level Feature Declarations

Unified Voice Type

Voice Cloning Support

Streaming Improvements

Tree-Shakeable Subpath Exports

Unified Error Hierarchy

Existing Engine Updates Needed

Uh oh!

FilesExpand file tree

BACKLOG.md

Latest commit

History

BACKLOG.md

File metadata and controls

js-tts-wrapper Engine & Feature Backlog

Completed

New Engines to Add

Lower Priority (Open-Source / Niche)

Cross-Cutting Features

Audio Tags (Cross-Provider Abstraction)

Model-Level Feature Declarations

Unified Voice Type

Voice Cloning Support

Streaming Improvements

Tree-Shakeable Subpath Exports

Unified Error Hierarchy

Existing Engine Updates Needed